Percorrer por autor "Sousa, Lisete"
A mostrar 1 - 7 de 7
Resultados por página
Opções de ordenação
- Arrow plot and CA maps on microarray preprocessing methodsPublication . Silva, Carina; Freitas, Adelaide; Roque, Sara; Sousa, LiseteMicroarray allow to monitoring simultaneously thousands of genes, where the abundance of the transcripts under a same experimental condition at the same time can be quantified. Among various available array technologies, double channel cDNA microarray experiments have arisen in numerous technical protocols associated to genomic studies, which is the focus of this work. Microarray experiments involve many steps and each one can affect the quality of raw data. Background correction and normalization are preprocessing techniques to clean and correct the raw data when undesirable fluctuations arise from technical factors. Several recent studies showed that there is no preprocessing strategy that outperforms others in all circumstances and thus it seems difficult to provide general recommendations. In this work, it is proposed to use exploratory techniques to visualize the effects of preprocessing methods on statistical analysis of cancer two-channel microarray data sets, where the cancer types (classes) are known. For selecting differential expressed genes the arrow plot was used and the graph of profiles resultant from the correspondence analysis for visualizing the results. It was used 6 background methods and 6 normalization methods, performing 36 pre-processing methods and it was analyzed in a published cDNA microarray database (Liver) available at http://genome-www5.stanford.edu/ which microarrays were already classified by cancer type. All statistical analyses were performed using the R statistical software.
- Arrow plot and correspondence analysis maps for visualizing the effects of background correction and normalization methods on microarray dataPublication . Silva, Carina; Freitas, Adelaide; Roque, Sara; Sousa, LiseteAmong various available array technologies, double-channel cDNA microarray experiments provide numerous technical protocols associated with functional genomic studies. The chapter begins by detailing the arrow plot, which is a recent graphical-based methodology to detect differentially expressed (DE) genes, and briefly mentions the significance analysis of microarrays (SAM) procedure, which is, in contrast, quite well known. Next, it introduces the correspondence analysis (CA) and explains how the resultant graphic can be interpreted. Then, CA in both class comparison and class prediction applications and over the data sets lymphoma (lym), lung (lun), and liver (liv) is executed. The CA is applied to all three databases in order to obtain graphical representations of background correction (BC) and normalization (NM) profiles in a two-dimensional reduced space. Whenever possible, more than one preprocessing strategy on microarray data could be applied and results from preprocessed data should be compared before any conclusion and subsequent analysis.
- Arrow Plot: a new graphical tool for selecting up and down regulated genes and genes differentially expressed on samples subgroupsPublication . Silva, Carina; Turkman, Maria Antónia Amaral; Sousa, LiseteBackground: A common task in analyzing microarray data is to determine which genes are differentially expressed across two (or more) kind of tissue samples or samples submitted under experimental conditions. Several statistical methods have been proposed to accomplish this goal, generally based on measures of distance between classes. It is well known that biological samples are heterogeneous because of factors such as molecular subtypes or genetic background that are often unknown to the experimenter. For instance, in experiments which involve molecular classification of tumors it is important to identify significant subtypes of cancer. Bimodal or multimodal distributions often reflect the presence of subsamples mixtures. Consequently, there can be genes differentially expressed on sample subgroups which are missed if usual statistical approaches are used. In this paper we propose a new graphical tool which not only identifies genes with up and down regulations, but also genes with differential expression in different subclasses, that are usually missed if current statistical methods are used. This tool is based on two measures of distance between samples, namely the overlapping coefficient (OVL) between two densities and the area under the receiver operating characteristic (ROC) curve. The methodology proposed here was implemented in the open-source R software. Results: This method was applied to a publicly available dataset, as well as to a simulated dataset. We compared our results with the ones obtained using some of the standard methods for detecting differentially expressed genes, namely Welch t-statistic, fold change (FC), rank products (RP), average difference (AD), weighted average difference (WAD), moderated t-statistic (modT), intensity-based moderated t-statistic (ibmT), significance analysis of microarrays (samT) and area under the ROC curve (AUC). On both datasets all differentially expressed genes with bimodal or multimodal distributions were not selected by all standard selection procedures. We also compared our results with (i) area between ROC curve and rising area (ABCR) and (ii) the test for not proper ROC curves (TNRC). We found our methodology more comprehensive, because it detects both bimodal and multimodal distributions and different variances can be considered on both samples. Another advantage of our method is that we can analyze graphically the behavior of different kinds of differentially expressed genes. Conclusion: Our results indicate that the arrow plot represents a new flexible and useful tool for the analysis of gene expression profiles from microarrays.
- Challenges and opportunities for statistics in Omics data analysisPublication . Carrasquinha, Eunice; Sousa, Lisete; Silva, Carina; Gama-Carvalho, Margarida; Figueiredo, Andreia; Pinto, Francisco RodriguesOmics data, comprising a diverse array of high-throughput molecular datasets, present substantial statistical challenges due to their intrinsic heterogeneity and variability. Effectively distinguishing biologically meaningful variations from random noise requires the application and development of robust statistical approaches. Interdisciplinary collaboration plays a pivotal role in refining these methodologies and enhancing the understanding of intricate biological systems. This chapter reviews the importance of statistical methods in omics data analysis, highlighting the need for ongoing advancements to address key challenges, including experimental design, preprocessing, dimensionality reduction, statistical modeling of complex datasets, and the interpretation of results. The pursuit of improved reliability in biological insights creates opportunities for the development and refinement of advanced statistical methodologies.
- Estatística em biologia molecular: o passado, o presente e o futuroPublication . Sousa, Lisete; Silva, CarinaVivemos na era mais mensurável da história. Na era do petabyte (1000 terabytes) o desafio não é mais o armazenamento de dados, é dar-lhes sentido. Sendo esta a era da revolução dos dados, a respetiva análise torna-se parte integrante de várias ciências. Por exemplo, a biologia molecular deixa de ser uma ciência onde os biólogos estudam um gene de cada vez, para passar a produzir milhares (agora milhões) de medições por amostra para analisar. Além disso, ao contrário da análise do ADN, que é estática, a análise da expressão genética é dinâmica, uma vez que nos vários tecidos expressam-se genes diferentes. O geneticista John Craig Venter, sequenciava organismos isolados, mas com o aparecimento de novas tecnologias e computadores com elevada capacidade de memória, que permitem a análise de dados bastante complexos, passou a estudar ecossistemas inteiros: sequenciação dos microorganismos do oceano, desde 2003, e do ar, desde 2005. A complexidade dos dados é ainda potenciada pelas novas tecnologias que, ao surgirem, são ainda pouco exploradas, produzindo dados com mais ruído dos que as anteriores. Esta complexidade e grau de variabilidade fazem com que a estatística seja um importante e inequívoco contributo na análise. Na realidade, o papel da estatística na biologia molecular vai além de uma mera intervenção. Trata-se de um pilar indissociável desta ciência! A estatística tem vindo a conquistar o seu espaço nesta nova área, tornando-se uma componente essencial de mérito reconhecido.
- Impact of OVL variation on AUC bias estimated by non-parametric methodsPublication . Silva, Carina; Turkman, Maria Antónia Amaral; Sousa, LiseteThe area under the ROC curve (AUC) is the most commonly used index in the ROC methodology to evaluate the performance of a classifier that discriminates between two mutually exclusive conditions. The AUC can admit values between 0.5 and 1, where values close to 1 indicate that the model of classification has high discriminative power. The overlap coefficient (OVL) between two density functions is defined as the common area between both functions. This coefficient is used as a measure of agreement between two distributions presenting values between 0 and 1, where values close to 1 reveal total overlapping densities. These two measures were used to construct the arrow plot to select differential expressed genes. A simulation study using the bootstrap method is presented in order to estimate AUC bias and standard error using empirical and kernel methods. In order to assess the impact of the OVL variation on the AUC bias, samples from various continuous distributions were simulated considering different values for its parameters and for fixed OVL values between 0 and 1. Samples of dimensions 15, 30, 50, and 100, and 1000 bootstrap replicate for each scenario were considered.
- Variant calling in genomics: a comparative performance analysis and decision guidePublication . Pinto, Vera; Sousa, Lisete; Silva, Carina; Nejat MahdiehThe accurate detection of genetic variants is critical for advancing genomics research and precision medicine. However, this task remains challenging due to pervasive sequencing errors and complex genomic regions. The choice of variant calling software significantly influences results, creating a need for clear, evidence-based guidance. This study aims to provide a performance evaluation and a clear, evidence-based guide for selecting variant callers by benchmarking seven widely used tools, GATK, FreeBayes, DeepVariant, Samtools, Strelka2, Octopus, and Varscan2, highlighting their algorithmic trade-offs. The well-characterized NA12878 genome from the Genome in a Bottle consortium was analyzed. High-coverage whole-genome sequencing data were processed with each variant caller, and the resulting variant calling files were benchmarked against a gold-standard reference. Performance was assessed using precision, recall, and F1-score on a chromosome 20 subset and on full whole-genome data. The analysis revealed that DeepVariant's deep learning approach achieved the highest precision (0.7869) and F1-score (0.8754) on chromosome 20. For whole-genome analysis, Strelka2 excelled in precision (0.8326), while Octopus demonstrated superior recall (0.9838). FreeBayes exhibited high sensitivity but lower precision, underscoring a key trade-off. There is no universally superior variant caller; the optimal choice depends on the specific research objectives, whether prioritizing precision, recall, or computational efficiency. This study serves as a crucial evidence-based resource for researchers and clinicians, enabling informed tool selection.
