Cartes, Jorge Avila; Bonizzoni, Paola; Ciccolella, Simone; Vedova, Gianluca Della; Denti, Luca
PangeBlocks: customized construction of pangenome graphs via maximal blocks Journal Article
In: BMC Bioinformatics, vol. 25, no. 1, 2024, ISSN: 1471-2105.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{AvilaCartes2024a,
title = {PangeBlocks: customized construction of pangenome graphs via maximal blocks},
author = {Jorge Avila Cartes and Paola Bonizzoni and Simone Ciccolella and Gianluca Della Vedova and Luca Denti},
url = {https://github.com/AlgoLab/pangeblocks},
doi = {10.1186/s12859-024-05958-5},
issn = {1471-2105},
year = {2024},
date = {2024-11-01},
urldate = {2024-11-01},
journal = {BMC Bioinformatics},
volume = {25},
number = {1},
publisher = {Springer Science and Business Media LLC},
abstract = {\textbf{Background}
The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.
\textbf{Results}
In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.
\textbf{Conclusion}
We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.
Results
In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.
Conclusion
We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.
Parmigiani, Luca; Garrison, Erik; Stoye, Jens; Marschall, Tobias; Doerr, Daniel
Panacus: fast and exact pangenome growth and core size estimation Journal Article
In: Bioinformatics, 2024, ISSN: 1367-4811.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Parmigiani2024b,
title = {Panacus: fast and exact pangenome growth and core size estimation},
author = {Luca Parmigiani and Erik Garrison and Jens Stoye and Tobias Marschall and Daniel Doerr},
editor = {Janet Kelso},
url = {https://github.com/marschall-lab/panacus},
doi = {10.1093/bioinformatics/btae720},
issn = {1367-4811},
year = {2024},
date = {2024-11-01},
urldate = {2024-11-01},
journal = {Bioinformatics},
publisher = {Oxford University Press (OUP)},
abstract = {Motivation: Using a single linear reference genome poses a limitation to exploring the full genomic diversity of a species. The release of a draft human pangenome underscores the increasing relevance of pangenomics to overcome these limitations. Pangenomes are commonly represented as graphs, which can represent billions of base pairs of sequence. Presently, there is a lack of scalable software able to perform key tasks on pangenomes, such as quantifying universally shared sequence across genomes (the core genome) and measuring the extent of genomic variability as a function of sample size (pangenome growth).
Results: We introduce Panacus (pangenome-abacus), a tool designed to rapidly perform these tasks and visualize the results in interactive plots. Panacus can process GFA files, the accepted standard for pangenome graphs, and is able to analyze a human pangenome graph with 110 million nodes in less than one hour.
Availability and implementation: Panacus is implemented in Rust and is published as Open Source software under the MIT license. The source code and documentation are available at https://github.com/marschall-lab/panacus. Panacus can be installed via Bioconda at https://bioconda.github.io/recipes/panacus/README.html.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Results: We introduce Panacus (pangenome-abacus), a tool designed to rapidly perform these tasks and visualize the results in interactive plots. Panacus can process GFA files, the accepted standard for pangenome graphs, and is able to analyze a human pangenome graph with 110 million nodes in less than one hour.
Availability and implementation: Panacus is implemented in Rust and is published as Open Source software under the MIT license. The source code and documentation are available at https://github.com/marschall-lab/panacus. Panacus can be installed via Bioconda at https://bioconda.github.io/recipes/panacus/README.html.
Gabory, Esteban; Mwaniki, Moses Njagi; Pisanti, Nadia; Pissis, Solon P.; Radoszewski, Jakub; Sweering, Michelle; Zuba, Wiktor
Pangenome comparison via ED strings Journal Article
In: Frontiers in Bioinformatics, vol. 4, 2024, ISSN: 2673-7647.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Gabory2024,
title = {Pangenome comparison via ED strings},
author = {Esteban Gabory and Moses Njagi Mwaniki and Nadia Pisanti and Solon P. Pissis and Jakub Radoszewski and Michelle Sweering and Wiktor Zuba},
doi = {10.3389/fbinf.2024.1397036},
issn = {2673-7647},
year = {2024},
date = {2024-09-01},
journal = {Frontiers in Bioinformatics},
volume = {4},
publisher = {Frontiers Media SA},
abstract = {Introduction: An elastic-degenerate (ED) string is a sequence of sets of strings. It can also be seen as a directed acyclic graph whose edges are labeled by strings. The notion of ED strings was introduced as a simple alternative to variation and sequence graphs for representing a pangenome, that is, a collection of genomic sequences to be analyzed jointly or to be used as a reference.
Methods: In this study, we define notions of matching statistics of two ED strings as similarity measures between pangenomes and, consequently infer a corresponding distance measure. We then show that both measures can be computed efficiently, in both theory and practice, by employing the intersection graph of two ED strings.
Results: We also implemented our methods as a software tool for pangenome comparison and evaluated their efficiency and effectiveness using both synthetic and real datasets.
Discussion: As for efficiency, we compare the runtime of the intersection graph method against the classic product automaton construction showing that the intersection graph is faster by up to one order of magnitude. For showing effectiveness, we used real SARS-CoV-2 datasets and our matching statistics similarity measure to reproduce a well-established clade classification of SARS-CoV-2, thus demonstrating that the classification obtained by our method is in accordance with the existing one.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Methods: In this study, we define notions of matching statistics of two ED strings as similarity measures between pangenomes and, consequently infer a corresponding distance measure. We then show that both measures can be computed efficiently, in both theory and practice, by employing the intersection graph of two ED strings.
Results: We also implemented our methods as a software tool for pangenome comparison and evaluated their efficiency and effectiveness using both synthetic and real datasets.
Discussion: As for efficiency, we compare the runtime of the intersection graph method against the classic product automaton construction showing that the intersection graph is faster by up to one order of magnitude. For showing effectiveness, we used real SARS-CoV-2 datasets and our matching statistics similarity measure to reproduce a well-established clade classification of SARS-CoV-2, thus demonstrating that the classification obtained by our method is in accordance with the existing one.
Brejová, Broňa; Gagie, Travis; Herencsárová, Eva; Vinař, Tomáš
Maximum-scoring path sets on pangenome graphs of constant treewidth Journal Article
In: Frontiers in Bioinformatics, vol. 4, 2024, ISSN: 2673-7647.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Brejova2024,
title = {Maximum-scoring path sets on pangenome graphs of constant treewidth},
author = {Broňa Brejová and Travis Gagie and Eva Herencsárová and Tomáš Vinař},
doi = {10.3389/fbinf.2024.1391086},
issn = {2673-7647},
year = {2024},
date = {2024-07-01},
journal = {Frontiers in Bioinformatics},
volume = {4},
publisher = {Frontiers Media SA},
abstract = {We generalize a problem of finding maximum-scoring segment sets, previously studied by Csűrös (IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004, 1, 139–150), from sequences to graphs. Namely, given a vertex-weighted graph G and a non-negative startup penalty c, we can find a set of vertex-disjoint paths in G with maximum total score when each path’s score is its vertices’ total weight minus c. We call this new problem maximum-scoring path sets (MSPS). We present an algorithm that has a linear-time complexity for graphs with a constant treewidth. Generalization from sequences to graphs allows the algorithm to be used on pangenome graphs representing several related genomes and can be seen as a common abstraction for several biological problems on pangenomes, including searching for CpG islands, ChIP-seq data analysis, analysis of region enrichment for functional elements, or simple chaining problems.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Sierra, Pío; Durbin, Richard
Identification of transposable element families from pangenome polymorphisms Journal Article
In: Mobile DNA, vol. 15, no. 1, pp. 13, 2024, ISSN: 1759-8753.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Sierra2024,
title = {Identification of transposable element families from pangenome polymorphisms},
author = {Pío Sierra and Richard Durbin},
url = {https://github.com/piosierra/pantera},
doi = {10.1186/s13100-024-00323-y},
issn = {1759-8753},
year = {2024},
date = {2024-06-01},
urldate = {2024-06-01},
journal = {Mobile DNA},
volume = {15},
number = {1},
pages = {13},
publisher = {Springer Science and Business Media LLC},
abstract = {Background: Transposable Elements (TEs) are segments of DNA, typically a few hundred base pairs up to several tens of thousands bases long, that have the ability to generate new copies of themselves in the genome. Most existing methods used to identify TEs in a newly sequenced genome are based on their repetitive character, together with detection based on homology and structural features. As new high quality assemblies become more common, including the availability of multiple independent assemblies from the same species, an alternative strategy for identification of TE families becomes possible in which we focus on the polymorphism at insertion sites caused by TE mobility.
Results: We develop the idea of using the structural polymorphisms found in pangenomes to create a library of the TE families recently active in a species, or in a closely related group of species. We present a tool, pantera, that achieves this task, and illustrate its use both on species with well-curated libraries, and on new assemblies.
Conclusions: Our results show that pantera is sensitive and accurate, tending to correctly identify complete elements with precise boundaries, and is particularly well suited to detect larger, low copy number TEs that are often undetected with existing de novo methods.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Results: We develop the idea of using the structural polymorphisms found in pangenomes to create a library of the TE families recently active in a species, or in a closely related group of species. We present a tool, pantera, that achieves this task, and illustrate its use both on species with well-curated libraries, and on new assemblies.
Conclusions: Our results show that pantera is sensitive and accurate, tending to correctly identify complete elements with precise boundaries, and is particularly well suited to detect larger, low copy number TEs that are often undetected with existing de novo methods.
Parmigiani, Luca; Wittler, Roland; Stoye, Jens
Revisiting pangenome openness with k-mers Journal Article
In: Peer Community Journal, vol. 4, 2024, ISSN: 2804-3871.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Parmigiani2024,
title = {Revisiting pangenome openness with k-mers},
author = {Luca Parmigiani and Roland Wittler and Jens Stoye},
doi = {10.24072/pcjournal.415},
issn = {2804-3871},
year = {2024},
date = {2024-04-01},
journal = {Peer Community Journal},
volume = {4},
publisher = {Peer Community In},
abstract = {Pangenomics is the study of related genomes collectively, usually from the same species or closely related taxa. Originally, pangenomes were defined for bacterial species. After the concept was extended to eukaryotic genomes, two definitions of pangenome evolved in parallel: the gene-based approach, which defines the pangenome as the union of all genes, and the sequence-based approach, which defines the pangenome as the set of all nonredundant genomic sequences. Estimating the total size of the pangenome for a given species has been subject of study since the very first mention of pangenomes. Traditionally, this is performed by predicting the ratio at which new genes are discovered, referred to as the openness of the species. Here, we abstract each genome as a set of items, which is entirely agnostic of the two approaches (gene-based, sequence-based). Genes are a viable option for items, but also other possibilities are feasible, e.g., genome sequence substrings of fixed length k (k-mers). In the present study, we investigate the use of k-mers to estimate the openness as an alternative to genes, and compare the results. An efficient implementation is also provided.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Almeida, Miguel Vasconcelos; Blumer, Moritz; Yuan, Chengwei Ulrika; Sierra, Pío; Price, Jonathan L.; Quah, Fu Xiang; Friman, Aleksandr; Dallaire, Alexandra; Vernaz, Grégoire; Putman, Audrey L. K.; Smith, Alan M.; Joyce, Domino A.; Butter, Falk; Haase, Astrid D.; Durbin, Richard; Santos, M. Emília; Miska, Eric A.
Dynamic co-evolution of transposable elements and the piRNA pathway in African cichlid fishes Unpublished
2024.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@unpublished{Almeida2024,
title = {Dynamic co-evolution of transposable elements and the piRNA pathway in African cichlid fishes},
author = {Miguel Vasconcelos Almeida and Moritz Blumer and Chengwei Ulrika Yuan and Pío Sierra and Jonathan L. Price and Fu Xiang Quah and Aleksandr Friman and Alexandra Dallaire and Grégoire Vernaz and Audrey L. K. Putman and Alan M. Smith and Domino A. Joyce and Falk Butter and Astrid D. Haase and Richard Durbin and M. Emília Santos and Eric A. Miska},
doi = {10.1101/2024.04.01.587621},
year = {2024},
date = {2024-04-01},
publisher = {Cold Spring Harbor Laboratory},
abstract = {East African cichlid fishes have diversified in an explosive fashion, but the (epi)genetic basis of the phenotypic diversity of these fishes remains largely unknown. Although transposable elements (TEs) have been associated with phenotypic variation in cichlids, little is known about their transcriptional activity and epigenetic silencing. Here, we describe dynamic patterns of TE expression in African cichlid gonads and during early development. Orthology inference revealed an expansion of piwil1 genes in Lake Malawi cichlids, likely driven by PiggyBac TEs. The expanded piwil1 copies have signatures of positive selection and retain amino acid residues essential for catalytic activity. Furthermore, the gonads of African cichlids express a Piwi-interacting RNA (piRNA) pathway that target TEs. We define the genomic sites of piRNA production in African cichlids and find divergence in closely related species, in line with fast evolution of piRNA-producing loci. Our findings suggest dynamic co-evolution of TEs and host silencing pathways in the African cichlid radiations. We propose that this co-evolution has contributed to cichlid genomic diversity.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {unpublished}
}
Lemane, Téo; Lezzoche, Nolan; Lecubin, Julien; Pelletier, Eric; Lescot, Magali; Chikhi, Rayan; Peterlongo, Pierre
Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA Journal Article
In: Nature Computational Science, vol. 4, no. 2, pp. 104–109, 2024, ISSN: 2662-8457.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG, WP2: Evolutionary/Comparative CPG, WP3: Translational CPG
@article{Lemane2024,
title = {Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA},
author = {Téo Lemane and Nolan Lezzoche and Julien Lecubin and Eric Pelletier and Magali Lescot and Rayan Chikhi and Pierre Peterlongo},
doi = {10.1038/s43588-024-00596-6},
issn = {2662-8457},
year = {2024},
date = {2024-02-01},
journal = {Nature Computational Science},
volume = {4},
number = {2},
pages = {104–109},
publisher = {Springer Science and Business Media LLC},
abstract = {Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset.},
keywords = {WP1: Primary CPG, WP2: Evolutionary/Comparative CPG, WP3: Translational CPG},
pubstate = {published},
tppubtype = {article}
}
Cartes, Jorge Avila; Bonizzoni, Paola; Ciccolella, Simone; Vedova, Gianluca Della; Denti, Luca; Didelot, Xavier; Monti, Davide Cesare; Pirola, Yuri
RecGraph: recombination-aware alignment of sequences to variation graphs Journal Article
In: Bioinformatics, vol. 40, no. 5, pp. btae292, 2024, ISSN: 1367-4811.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Avila2024,
title = {RecGraph: recombination-aware alignment of sequences to variation graphs},
author = {Jorge Avila Cartes and Paola Bonizzoni and Simone Ciccolella and Gianluca Della Vedova and Luca Denti and Xavier Didelot and Davide Cesare Monti and Yuri Pirola},
url = {https://github.com/AlgoLab/RecGraph},
doi = {10.1093/bioinformatics/btae292},
issn = {1367-4811},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
journal = {Bioinformatics},
volume = {40},
number = {5},
pages = {btae292},
abstract = {Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated.In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination—we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events.
Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.
Schulz, Tizian; Parmigiani, Luca; Rempel, Andreas; Stoye, Jens
Methods for Pangenomic Core Detection Book Chapter
In: Setubal, João Carlos; Stadler, Peter F.; Stoye, Jens (Ed.): Comparative Genomics: Methods and Protocols, pp. 73–106, Springer US, New York, NY, 2024, ISSN: 1940-6029.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@inbook{Schulz2024,
title = {Methods for Pangenomic Core Detection},
author = {Tizian Schulz and Luca Parmigiani and Andreas Rempel and Jens Stoye},
editor = {João Carlos Setubal and Peter F. Stadler and Jens Stoye},
doi = {10.1007/978-1-0716-3838-5_4},
issn = {1940-6029},
year = {2024},
date = {2024-01-01},
booktitle = {Comparative Genomics: Methods and Protocols},
pages = {73–106},
publisher = {Springer US},
address = {New York, NY},
abstract = {Computational pangenomics deals with the joint analysis of all genomic sequences of a species. It has already been successfully applied to various tasks in many research areas. Further advances in DNA sequencing technologies constantly let more and more genomic sequences become available for many species, leading to an increasing attractiveness of pangenomic studies. At the same time, larger datasets also pose new challenges for data structures and algorithms that are needed to handle the data. Efficient methods oftentimes make use of the concept of k-mers.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {inbook}
}
Baláž, Andrej; Gagie, Travis; Goga, Adrián; Heumos, Simon; Navarro, Gonzalo; Petescia, Alessia; Sirén, Jouni
Wheeler Maps Proceedings Article
In: Soto, José A.; Wiese, Andreas (Ed.): LATIN 2024: Theoretical Informatics, pp. 178–192, Springer Nature Switzerland, Cham, 2024, ISSN: 1611-3349.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@inproceedings{Balaz2024,
title = {Wheeler Maps},
author = {Andrej Baláž and Travis Gagie and Adrián Goga and Simon Heumos and Gonzalo Navarro and Alessia Petescia and Jouni Sirén},
editor = {José A. Soto and Andreas Wiese},
doi = {10.1007/978-3-031-55598-5_12},
issn = {1611-3349},
year = {2024},
date = {2024-01-01},
booktitle = {LATIN 2024: Theoretical Informatics},
pages = {178–192},
publisher = {Springer Nature Switzerland},
address = {Cham},
abstract = {Motivated by challenges in pangenomic read alignment, we propose a generalization of Wheeler graphs that we call Wheeler maps. A Wheeler map stores a text T[1..n] and an assignment of tags to the characters of T such that we can preprocess a pattern P[1..m] and then, given i and j, quickly return all the distinct tags labeling the first characters of the occurrences of P[i..j] in T. For the applications that most interest us, characters with long common contexts are likely to have the same tag, so we consider the number t of runs in the list of tags sorted by their characters’ positions in the Burrows-Wheeler Transform (BWT) of T. We show how, given a straight-line program with g rules for T, we can build an O(g+r+t)-space Wheeler map, where r is the number of runs in the BWT of T, with which we can preprocess a pattern P[1..m] in $O(m łog n)$ time and then return the k distinct tags for P[i..j] in optimal O(k) time for any given i and j.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {inproceedings}
}
Willink, Beatriz; Tunström, Kalle; Nilén, Sofie; Chikhi, Rayan; Lemane, Téo; Takahashi, Michihiko; Takahashi, Yuma; Svensson, Erik I.; Wheat, Christopher West
The genomics and evolution of inter-sexual mimicry and female-limited polymorphisms in damselflies Journal Article
In: Nature Ecology & Evolution, vol. 8, no. 1, pp. 83–97, 2023, ISSN: 2397-334X.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG, WP3: Translational CPG
@article{Willink2023,
title = {The genomics and evolution of inter-sexual mimicry and female-limited polymorphisms in damselflies},
author = {Beatriz Willink and Kalle Tunström and Sofie Nilén and Rayan Chikhi and Téo Lemane and Michihiko Takahashi and Yuma Takahashi and Erik I. Svensson and Christopher West Wheat},
doi = {10.1038/s41559-023-02243-1},
issn = {2397-334X},
year = {2023},
date = {2023-11-01},
urldate = {2023-11-01},
journal = {Nature Ecology & Evolution},
volume = {8},
number = {1},
pages = {83–97},
publisher = {Springer Science and Business Media LLC},
abstract = {Sex-limited morphs can provide profound insights into the evolution and genomic architecture of complex phenotypes. Inter-sexual mimicry is one particular type of sex-limited polymorphism in which a novel morph resembles the opposite sex. While inter-sexual mimics are known in both sexes and a diverse range of animals, their evolutionary origin is poorly understood. Here, we investigated the genomic basis of female-limited morphs and male mimicry in the common bluetail damselfly. Differential gene expression between morphs has been documented in damselflies, but no causal locus has been previously identified. We found that male mimicry originated in an ancestrally sexually dimorphic lineage in association with multiple structural changes, probably driven by transposable element activity. These changes resulted in ~900 kb of novel genomic content that is partly shared by male mimics in a close relative, indicating that male mimicry is a trans-species polymorphism. More recently, a third morph originated following the translocation of part of the male-mimicry sequence into a genomic position ~3.5 mb apart. We provide evidence of balancing selection maintaining male mimicry, in line with previous field population studies. Our results underscore how structural variants affecting a handful of potentially regulatory genes and morph-specific genes can give rise to novel and complex phenotypic polymorphisms.},
keywords = {WP2: Evolutionary/Comparative CPG, WP3: Translational CPG},
pubstate = {published},
tppubtype = {article}
}
Orozco-Arias, Simon; Sierra, Pío; Durbin, Richard; González, Josefa
MCHelper automatically curates transposable element libraries across eukaryotic species Unpublished
2023.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@unpublished{OrozcoArias2023,
title = {MCHelper automatically curates transposable element libraries across eukaryotic species},
author = {Simon Orozco-Arias and Pío Sierra and Richard Durbin and Josefa González},
doi = {10.1101/2023.10.17.562682},
year = {2023},
date = {2023-10-01},
publisher = {Cold Spring Harbor Laboratory},
abstract = {The number of species with high quality genome sequences continues to increase, in part due to scaling up of multiple large scale biodiversity sequencing projects. While the need to annotate genic sequences in these genomes is widely acknowledged, the parallel need to annotate transposable element sequences that have been shown to alter genome architecture, rewire gene regulatory networks, and contribute to the evolution of host traits is becoming ever more evident. However, accurate genome-wide annotation of transposable element sequences is still technically challenging. Several de novo transposable element identification tools are now available, but manual curation of the libraries produced by these tools is needed to generate high quality genome annotations. Manual curation is time-consuming, and thus impractical for large-scale genomic studies, and lacks reproducibility. In this work, we present the Manual Curator Helper tool MCHelper, which automates the TE library curation process. By leveraging MCHelper’s fully automated mode with the outputs from three de novo transposable element identification tools, RepeatModeler2, EDTA and REPET, in fruit fly, rice, hooded crow, zebrafish, maize, and human, we show a substantial improvement in the quality of the transposable element libraries and genome annotations. MCHelper libraries are less redundant, with up to 65% reduction in the number of consensus sequences, have up to 11.4% fewer false positive sequences, and up to ∼48% fewer “unclassified/unknown” transposable element consensus sequences. Genome-wide transposable element annotations were also improved, including larger unfragmented insertions. Moreover, MCHelper is an easy to install and easy to use tool.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {unpublished}
}
Herencsárová, Eva; Brejová, Broňa
Identifying Clusters in Graph Representations of Genomes Proceedings Article
In: Brejová, Broňa; Ciencialová, Lucie; Holeňa, Martin; Jajcay, Róbert; Jajcayová, Tatiana; Lexa, Matej; Mráz, František; Pardubská, Dana; Plátek, Martin (Ed.): Proceedings of the 23rd Conference Information Technologies – Applications and Theory (ITAT 2023), pp. 232–241, CEUR Workshop Proceedings, Tatranské Matliare, Slovakia, 2023.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@inproceedings{Herencsarova2023,
title = {Identifying Clusters in Graph Representations of Genomes},
author = {Eva Herencsárová and Broňa Brejová},
editor = {Broňa Brejová and Lucie Ciencialová and Martin Holeňa and Róbert Jajcay and Tatiana Jajcayová and Matej Lexa and František Mráz and Dana Pardubská and Martin Plátek},
url = {https://ceur-ws.org/Vol-3498/paper30.pdf},
year = {2023},
date = {2023-09-23},
urldate = {2023-09-01},
booktitle = {Proceedings of the 23rd Conference Information Technologies – Applications and Theory (ITAT 2023)},
volume = {3498},
pages = {232–241},
publisher = {CEUR Workshop Proceedings},
address = {Tatranské Matliare, Slovakia},
abstract = {In many bioinformatics applications the task is to identify biologically significant locations in an individual genome. In our work, we are interested in finding high-density clusters of such biologically meaningful locations in a graph representation of a pangenome, which is a collection of related genomes. Different formulations of finding such clusters were previously studied for sequences. In this work, we study an extension of this problem for graphs, which we formalize as finding a set of vertex-disjoint paths with a maximum score in a weighted directed graph. We provide a linear-time algorithm for a special class of graphs corresponding to elastic-degenerate strings, one of pangenome representations. We also provide a fixed-parameter tractable algorithm for directed acyclic graphs with a special path decomposition of a limited width.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {inproceedings}
}
Bernardini, Giulia; Iersel, Leo; Julien, Esther; Stougie, Leen
Constructing phylogenetic networks via cherry picking and machine learning Journal Article
In: Algorithms for Molecular Biology, vol. 18, no. 1, pp. 13, 2023, ISSN: 1748-7188.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Bernardini2023,
title = {Constructing phylogenetic networks via cherry picking and machine learning},
author = {Giulia Bernardini and Leo Iersel and Esther Julien and Leen Stougie},
doi = {10.1186/s13015-023-00233-3},
issn = {1748-7188},
year = {2023},
date = {2023-09-16},
urldate = {2023-09-16},
journal = {Algorithms for Molecular Biology},
volume = {18},
number = {1},
pages = {13},
publisher = {Springer Science and Business Media LLC},
abstract = {\textbf{Background}
Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. Existing methods are computationally expensive and can either handle only small numbers of phylogenetic trees or are limited to severely restricted classes of networks.
\textbf{Results}
In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of efficient heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets consisting of binary trees. Some of the heuristics in this framework are based on the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. We also propose simple and fast randomised heuristics that prove to be very effective when run multiple times.
\textbf{Conclusions}
Unlike the existing exact methods, our heuristics are applicable to datasets of practical size, and the experimental study we conducted on both simulated and real data shows that these solutions are qualitatively good, always within some small constant factor from the optimum. Moreover, our machine-learned heuristics are one of the first applications of machine learning to phylogenetics and show its promise.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. Existing methods are computationally expensive and can either handle only small numbers of phylogenetic trees or are limited to severely restricted classes of networks.
Results
In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of efficient heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets consisting of binary trees. Some of the heuristics in this framework are based on the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. We also propose simple and fast randomised heuristics that prove to be very effective when run multiple times.
Conclusions
Unlike the existing exact methods, our heuristics are applicable to datasets of practical size, and the experimental study we conducted on both simulated and real data shows that these solutions are qualitatively good, always within some small constant factor from the optimum. Moreover, our machine-learned heuristics are one of the first applications of machine learning to phylogenetics and show its promise.
Cozzi, Davide; Rossi, Massimiliano; Rubinacci, Simone; Gagie, Travis; Köppl, Dominik; Boucher, Christina; Bonizzoni, Paola
μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data Journal Article
In: Bioinformatics, vol. 39, no. 9, 2023, ISSN: 1367-4811.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Cozzi2023-bx,
title = {μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data},
author = {Davide Cozzi and Massimiliano Rossi and Simone Rubinacci and Travis Gagie and Dominik Köppl and Christina Boucher and Paola Bonizzoni},
url = {https://github.com/dlcgold/muPBWT},
doi = {10.1093/bioinformatics/btad552},
issn = {1367-4811},
year = {2023},
date = {2023-09-01},
urldate = {2023-09-01},
journal = {Bioinformatics},
volume = {39},
number = {9},
publisher = {Oxford University Press (OUP)},
abstract = {The Positional Burrows–Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory.
In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.
Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.
Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.
Ayad, Lorraine A K; Chikhi, Rayan; Pissis, Solon P
Seedability: optimizing alignment parameters for sensitive sequence comparison Journal Article
In: Bioinform. Adv., vol. 3, no. 1, 2023, ISSN: 2635-0041.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG, WP2: Evolutionary/Comparative CPG
@article{Ayad2023-li,
title = {Seedability: optimizing alignment parameters for sensitive sequence comparison},
author = {Lorraine A K Ayad and Rayan Chikhi and Solon P Pissis},
url = {https://github.com/lorrainea/Seedability},
doi = {10.1093/bioadv/vbad108},
issn = {2635-0041},
year = {2023},
date = {2023-08-01},
urldate = {2023-08-01},
journal = {Bioinform. Adv.},
volume = {3},
number = {1},
publisher = {Oxford University Press (OUP)},
abstract = {Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2, use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability, a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences.
The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2. We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments.
https://github.com/lorrainea/Seedability (distributed under GPL v3.0).},
keywords = {WP1: Primary CPG, WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2. We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments.
https://github.com/lorrainea/Seedability (distributed under GPL v3.0).
Lee, Sewon; Kim, Gyuri; Karin, Eli Levy; Mirdita, Milot; Park, Sukhwan; Chikhi, Rayan; Babaian, Artem; Kryshtafovych, Andriy; Steinegger, Martin
Petascale Homology Search for Structure Prediction Unpublished
bioRxiv, 2023.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@unpublished{Lee2023,
title = {Petascale Homology Search for Structure Prediction},
author = {Sewon Lee and Gyuri Kim and Eli Levy Karin and Milot Mirdita and Sukhwan Park and Rayan Chikhi and Artem Babaian and Andriy Kryshtafovych and Martin Steinegger},
doi = {10.1101/2023.07.10.548308},
year = {2023},
date = {2023-07-01},
urldate = {2023-07-01},
publisher = {Cold Spring Harbor Laboratory},
abstract = {The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by the success of the top AlphaFold2-based prediction methods. To push the boundaries of MSA utilization, we conducted a petabase-scale search of the Sequence Read Archive (SRA), resulting in gigabytes of aligned homologs for CASP15 targets. These were merged with default MSAs produced by ColabFold-search and provided to ColabFold-predict. By using SRA data, we achieved highly accurate predictions (GDT_TS > 70) for 66% of the non-easy targets, whereas using ColabFold-search default MSAs scored highly in only 52%. Next, we tested the effect of deep homology search and ColabFold’s advanced features, such as more recycles, on prediction accuracy. While SRA homologs were most significant for improving ColabFold’s CASP15 ranking from 11th to 3rd place, other strategies contributed too. We analyze these in the context of existing strategies to improve prediction.},
howpublished = {bioRxiv},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {unpublished}
}
Gabory, Esteban; Mwaniki, Moses Njagi; Pisanti, Nadia; Pissis, Solon P; Radoszewski, Jakub; Sweering, Michelle; Zuba, Wiktor
Comparing elastic-degenerate strings: Algorithms, lower bounds, and applications Proceedings Article
In: Bulteau, Laurent; Lipt'ak, Zsuzsanna (Ed.): 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023), pp. 11:1–11:20, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2023, ISBN: 978-3-95977-276-1.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG, WP2: Evolutionary/Comparative CPG
@inproceedings{Gabory2023-vh,
title = {Comparing elastic-degenerate strings: Algorithms, lower bounds, and applications},
author = {Esteban Gabory and Moses Njagi Mwaniki and Nadia Pisanti and Solon P Pissis and Jakub Radoszewski and Michelle Sweering and Wiktor Zuba},
editor = {Bulteau, Laurent and Lipt'{a}k, Zsuzsanna},
doi = {10.4230/LIPIcs.CPM.2023.11},
isbn = {978-3-95977-276-1},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
volume = {259},
pages = {11:1--11:20},
publisher = {Schloss Dagstuhl - Leibniz-Zentrum für Informatik},
address = {Dagstuhl, Germany},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
abstract = {An elastic-degenerate (ED) string T is a sequence of n sets T[1],…,T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as ℒ(T) = {S_1 ⋯ S_n : S_i ∈ T[i] for all i ∈ [1,n]}. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T₁ and T₂ of lengths n₁ and n₂, cardinalities m₁ and m₂, and sizes N₁ and N₂, respectively, we show the following: - There is no 𝒪((N₁N₂)^{1-ε})-time algorithm, thus no 𝒪((N₁m₂+N₂m₁)^{1-ε})-time algorithm and no 𝒪((N₁n₂+N₂n₁)^{1-ε})-time algorithm, for any constant ε > 0, for EDSI even when T₁ and T₂ are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. - There is no combinatorial 𝒪((N₁+N₂)^{1.2-ε}f(n₁,n₂))-time algorithm, for any constant ε > 0 and any function f, for EDSI even when T₁ and T₂ are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. - An 𝒪(N₁log N₁log n₁+N₂log N₂log n₂)-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T₁ and T₂ are given in a compact representation, we show that the problem is NP-complete. - An 𝒪(N₁m₂+N₂m₁)-time algorithm for EDSI. - An Õ(N₁^{ω-1}n₂+N₂^{ω-1}n₁)-time algorithm for EDSI, where ω is the exponent of matrix multiplication; the Õ notation suppresses factors that are polylogarithmic in the input size. We also show that the techniques we develop have applications outside of ED string comparison.},
keywords = {WP1: Primary CPG, WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {inproceedings}
}
Denti, Luca; Khorsand, Parsoa; Bonizzoni, Paola; Hormozdiari, Fereydoun; Chikhi, Rayan
SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads Journal Article
In: Nat. Methods, vol. 20, no. 4, pp. 550–558, 2023, ISBN: 1548-7105.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Denti2023-xa,
title = {SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads},
author = {Luca Denti and Parsoa Khorsand and Paola Bonizzoni and Fereydoun Hormozdiari and Rayan Chikhi},
doi = {10.1038/s41592-022-01674-1},
isbn = {1548-7105},
year = {2023},
date = {2023-04-01},
urldate = {2023-04-01},
journal = {Nat. Methods},
volume = {20},
number = {4},
pages = {550–558},
publisher = {Springer Science and Business Media LLC},
abstract = {Structural variants (SVs) account for a large amount of sequence variability across genomes and play an important role in human genomics and precision medicine. Despite intense efforts over the years, the discovery of SVs in individuals remains challenging due to the diploid and highly repetitive structure of the human genome, and by the presence of SVs that vastly exceed sequencing read lengths. However, the recent introduction of low-error long-read sequencing technologies such as PacBio HiFi may finally enable these barriers to be overcome. Here we present SV discovery with sample-specific strings (SVDSS)—a method for discovery of SVs from long-read sequencing technologies (for example, PacBio HiFi) that combines and effectively leverages mapping-free, mapping-based and assembly-based methodologies for overall superior SV discovery performance. Our experiments on several human samples show that SVDSS outperforms state-of-the-art mapping-based methods for discovery of insertion and deletion SVs in PacBio HiFi reads and achieves notable improvements in calling SVs in repetitive regions of the genome.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}