Browsing by Author "Francisco, Alexandre P."
Now showing 1 - 10 of 11
Results Per Page
Sort Options
- An ontology and a REST API for sequence based microbial typing dataPublication . Almeida, João; Tiple, João; Ramirez, Mário; Melo-Cristino, José; Vaz, Cátia; Francisco, Alexandre P.; Carrico, JoaoIn the Microbial typing field, the need to have a common understanding of the concepts described and the ability to share results within the community is an increasingly important requisite for the continued development of portable and accurate sequence-based typing methods. These methods are used for bacterial strain identification and are fundamental tools in Clinical Microbiology and Bacterial Population Genetics studies. In this article we propose an ontology designed for the microbial typing field, focusing on the widely used Multi Locus Sequence Typing methodology, and a RESTful API for accessing information systems based on the proposed ontology. This constitutes an important first step to accurately describe, analyze, curate, and manage information for microbial typing methodologies based on sequence based typing methodologies, and allows for the future integration with data analysis Web services.
- Computing RF Tree Distance over Succinct RepresentationsPublication . Branco, António Pedro; Vaz, Cátia; Francisco, Alexandre P.There are several tools available to infer phylogenetic trees, which depict the evolutionary relationships among biological entities such as viral and bacterial strains in infectious outbreaks or cancerous cells in tumor progression trees. These tools rely on several inference methods available to produce phylogenetic trees, with resulting trees not being unique. Thus, methods for comparing phylogenies that are capable of revealing where two phylogenetic trees agree or differ are required. An approach is then proposed to compute a similarity or dissimilarity measure between trees, with the Robinson–Foulds distance being one of the most used, and which can be computed in linear time and space. Nevertheless, given the large and increasing volume of phylogenetic data, phylogenetic trees are becoming very large with hundreds of thousands of leaves. In this context, space requirements become an issue both while computing tree distances and while storing trees. We propose then an efficient implementation of the Robinson–Foulds distance over tree succinct representations. Our implementation also generalizes the Robinson–Foulds distances to labelled phylogenetic trees, i.e., trees containing labels on all nodes, instead of only on leaves. Experimental results show that we are able to still achieve linear time while requiring less space. Our implementation in C++ is available as an open-source tool.
- Distance-based phylogenetic inference from typing data: a unifying viewPublication . Vaz, Cátia; Nascimento, Marta; Carrico, Joao; Rocher, Tatiana; Francisco, Alexandre P.Typing methods are widely used in the surveillance of infectious diseases, outbreaks investigation and studies of the natural history of an infection. Moreover, their use is becoming standard, in particular with the introduction of high-throughput sequencing. On the other hand, the data being generated are massive and many algorithms have been proposed for a phylogenetic analysis of typing data, addressing both correctness and scalability issues. Most of the distance-based algorithms for inferring phylogenetic trees follow the closest pair joining scheme. This is one of the approaches used in hierarchical clustering. Moreover, although phylogenetic inference algorithms may seem rather different, the main difference among them resides on how one defines cluster proximity and on which optimization criterion is used. Both cluster proximity and optimization criteria rely often on a model of evolution. In this work, we review, and we provide a unified view of these algorithms. This is an important step not only to better understand such algorithms but also to identify possible computational bottlenecks and improvements, important to deal with large data sets.
- Dynamic phylogenetic inference for sequence-based typing dataPublication . Francisco, Alexandre P.; Nascimento, Marta; Vaz, CátiaTyping methods are widely used in the surveillance of infectious diseases, outbreaks investigation and studies of the natural history of an infection. And their use is becoming standard, in particular with the introduction of High Throughput Sequencing (HTS). On the other hand, the data being generated is massive and many algorithms have been proposed for phylogenetic analysis of typing data, such as the goeBURST algorithm. These algorithms must however be run whenever new data becomes available starting from scratch. We address this issue proposing a dynamic version of goeBURST algorithm. Experimental results show that this new version is efficient on integrating new data and updating inferred evolutionary patterns, improving the update running time by at least one order of magnitude.
- NGS4Cloud: Cloud-based NGS Data ProcessingPublication . Forja, João; Almeida, Alexandre; Francisco, Alexandre P.; Simão, José; Vaz, CátiaMotivation and challenges: Next-Generation Sequencing (NGS) technologies are greatly increasing the amount of genomic computer data, revolutionizing the biosciences field and leading to the development of more complex NGS Data Analysis techniques [2]. These techniques, known as pipelines or workflows, consist of running and refining a series of intertwined computational analysis and visualization tasks on large amounts of data. These pipelines involve the use of multiple software tools and data resources in a staged fashion, with the output of one tool being passed as input to the next one. To simplify the design and execution of biomedical workflows by end users, especially those that use multiple software tools and data resources, a number of scientific workflow systems have been developed over the past decade. Examples include Galaxy [1] and Swift [3]. However, most of these scientific workflow systems cannot be easily deployed and most of the times are only available to users with access to specialized IT support. There are two main issues to address in the design of an execution environment to these pipelines. First, due to the complexity of configuring and parametrizing pipelines, the use of NGS Data Analysis techniques is not an easy task for a user without IT knowledge. Second, knowing input data can be as much as terabytes and petabytes, pipelines execution require, in general, a great amount of computational resources.
- NGSPipes: from specification to automatic deployment of NGS pipelinesPublication . Dantas, Bruno; Fleitas, Calmenelias; Francisco, Alexandre P.; Simão, José; Vaz, CátiaBiosciences have been revolutionized by next generation sequencing (NGS) technologies in last years, leading to new perspectives in medical, industrial and environmental applications. And although our motivation comes from biosciences, the following is true for many areas of science: published results are usually hard to reproduce either because data is not available or tools are not readily available, which delays the adoption of new methodologies and hinders innovation. Our focus is on tool readiness and pipelines availability. Even though most tools are freely available, pipelines are in general barely described and their configuration is far from trivial, with many parameters to be tuned.In this paper we discuss how to effectively build and use pipelines, relying on state of the art computing technologies to execute them without users need to configure, install and manage tools, servers and complex workflow management systems. A framework is also proposed showing that we can have public pipelines ready to process and analyse very high volume experimental data, produced for instance by high-throughput technologies, and that can be executed by users without effort. The NGSPipes framework and underlying architecture provides a major step towards open science and true collaboration in what concerns tools and pipelines among computational biology researchers and practitioners, which may share and replicate results in an easier and transparent way.
- phyloDB: a framework for large-scale phylogenetic analysis of sequence based typing dataPublication . Lourenço, Bruno; Vaz, Cátia; Coimbra, Miguel E.; Francisco, Alexandre P.PHYLODB is a modular and extensible framework for large-scale phylogenetic analyses of sequence based typing data, which are essential for understanding epidemics evolution. It relies on the Neo4j graph database for data storage and processing, providing a schema and an API for representing and querying phylogenetic data. Custom algorithms are also supported, allowing users to perform heavy computations directly over the data, and to store results in the database. Multiple computation results are stored as multilayer networks, promoting and facilitating comparative analyses, as well as avoiding unnecessary ab initio computations. The experimental evaluation results showcase that PHYLODB is efficient and scalable with respect to both API operations and algorithms execution.
- PHYLOViZ Online: web-based tool for visualization, phylogenetic inference, analysis and sharing of minimum spanning treesPublication . Ribeiro-Gonçalves, Bruno; Francisco, Alexandre P.; Vaz, Cátia; Ramirez, Mário; Carrico, JoaoHigh-throughput sequencing methods generated allele and single nucleotide polymorphism information for thousands of bacterial strains that are publicly available in online repositories and created the possibility of generating similar information for hundreds to thousands of strains more in a single study. Minimum spanning tree analysis of allelic data offers a scalable and reproducible methodological alternative to traditional phylogenetic inference approaches, useful in epidemiological investigations and population studies of bacterial pathogens. PHYLOViZ Online was developed to allow users to do these analyses without software installation and to enable easy accessing and sharing of data and analyses results from any Internet enabled computer. PHYLOViZ Online also offers a RESTful API for programmatic access to data and algorithms, allowing it to be seamlessly integrated into any third party web service or software.
- PHYLOViZ: phylogenetic inference and data visualization for sequence based typing methodsPublication . Francisco, Alexandre P.; Vaz, Cátia; Monteiro, Pedro T.; Melo-Cristino, José; Ramirez, Mario; Carrico, JoaoBackground: With the decrease of DNA sequencing costs, sequence-based typing methods are rapidly becoming the gold standard for epidemiological surveillance. These methods provide reproducible and comparable results needed for a global scale bacterial population analysis, while retaining their usefulness for local epidemiological surveys. Online databases that collect the generated allelic profiles and associated epidemiological data are available but this wealth of data remains underused and are frequently poorly annotated since no user-friendly tool exists to analyze and explore it. Results: PHYLOViZ is platform independent Java software that allows the integrated analysis of sequence-based typing methods, including SNP data generated from whole genome sequence approaches, and associated epidemiological data. goeBURST and its Minimum Spanning Tree expansion are used for visualizing the possible evolutionary relationships between isolates. The results can be displayed as an annotated graph overlaying the query results of any other epidemiological data available. Conclusions: PHYLOViZ is a user-friendly software that allows the combined analysis of multiple data sources for microbial epidemiological and population studies. It is freely available at http://www.phyloviz.net.
- Towards distance-based phylogenetic inference in average-case linear-timePublication . Crochemore, Maxime; Francisco, Alexandre P.; Pissis, Solon; Vaz, CátiaComputing genetic evolution distances among a set of taxa dominates the running time of many phylogenetic inference methods. Most of genetic evolution distance definitions rely, even if indirectly, on computing the pairwise Hamming distance among sequences or profiles. We propose here an average-case linear-time algorithm to compute pairwise Hamming distances among a set of taxa under a given Hamming distance threshold. This article includes both a theoretical analysis and extensive experimental results concerning the proposed algorithm. We further show how this algorithm can be successfully integrated into a well known phylogenetic inference method.