Cartes, Jorge Avila; Bonizzoni, Paola; Ciccolella, Simone; Vedova, Gianluca Della; Denti, Luca
PangeBlocks: customized construction of pangenome graphs via maximal blocks Journal Article
In: BMC Bioinformatics, vol. 25, no. 1, 2024, ISSN: 1471-2105.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{AvilaCartes2024a,
title = {PangeBlocks: customized construction of pangenome graphs via maximal blocks},
author = {Jorge Avila Cartes and Paola Bonizzoni and Simone Ciccolella and Gianluca Della Vedova and Luca Denti},
url = {https://github.com/AlgoLab/pangeblocks},
doi = {10.1186/s12859-024-05958-5},
issn = {1471-2105},
year = {2024},
date = {2024-11-01},
urldate = {2024-11-01},
journal = {BMC Bioinformatics},
volume = {25},
number = {1},
publisher = {Springer Science and Business Media LLC},
abstract = {\textbf{Background}
The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.
\textbf{Results}
In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.
\textbf{Conclusion}
We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.
Results
In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.
Conclusion
We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.
Rizzo, Nicola; Equi, Massimo; Norri, Tuukka; Mäkinen, Veli
Elastic founder graphs improved and enhanced Journal Article
In: Theoretical Computer Science, vol. 982, no. 114269, pp. 114269, 2024, ISSN: 0304-3975.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{Rizzo2024-jy,
title = {Elastic founder graphs improved and enhanced},
author = {Nicola Rizzo and Massimo Equi and Tuukka Norri and Veli Mäkinen},
doi = {10.1016/j.tcs.2023.114269},
issn = {0304-3975},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
journal = {Theoretical Computer Science},
volume = {982},
number = {114269},
pages = {114269},
publisher = {Elsevier BV},
abstract = {Indexing labeled graphs for pattern matching is a central challenge of pangenomics. Equi et al. (2022) [14] developed the Elastic Founder Graph (EFG) representing an alignment of m sequences of length n, drawn from alphabet Σ plus the special gap character: the paths spell the original sequences or their recombination. By enforcing the semi-repeat-free property, the EFG admits a polynomial-space index for linear-time pattern matching, breaking through the conditional lower bounds on indexing labeled graphs (Equi et al. [13]). In this work, we improve the space of the EFG index answering pattern matching queries in linear time, from linear in the length of all strings spelled by three consecutive node labels, to linear in the size of the edge labels. Then, we develop linear-time construction algorithms optimizing for different metrics: we improve the existing linearithmic construction algorithms to O(mn), by solving the novel exclusive ancestor set problem on trees; we propose, for the simplified gapless setting, an O(mn)-time solution minimizing the maximum block height, that we generalize by substituting block height with prefix-aware height. Finally, to show the versatility of the framework, we develop a BWT-based EFG index and study how to encode and perform document listing queries on a set of paths of the graphs, reporting which paths present a given pattern as a substring. We propose the EFG framework as an improved and enhanced version of the framework for the gapless setting, along with construction methods that are valid in any setting concerned with the segmentation of aligned sequences.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
Benoit, Gaëtan; Raguideau, Sébastien; James, Robert; Phillippy, Adam M.; Chikhi, Rayan; Quince, Christopher
High-quality metagenome assembly from long accurate reads with metaMDBG Journal Article
In: Nature Biotechnology, 2024, ISSN: 1546-1696.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{Benoit2024,
title = {High-quality metagenome assembly from long accurate reads with metaMDBG},
author = {Gaëtan Benoit and Sébastien Raguideau and Robert James and Adam M. Phillippy and Rayan Chikhi and Christopher Quince},
doi = {10.1038/s41587-023-01983-6},
issn = {1546-1696},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
journal = {Nature Biotechnology},
publisher = {Springer Science and Business Media LLC},
abstract = {We introduce metaMDBG, a metagenomics assembler for PacBio HiFi reads. MetaMDBG combines a de Bruijn graph assembly in a minimizer space with an iterative assembly over sequences of minimizers to address variations in genome coverage depth and an abundance-based filtering strategy to simplify strain complexity. For complex communities, we obtained up to twice as many high-quality circularized prokaryotic metagenome-assembled genomes as existing methods and had better recovery of viruses and plasmids.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
Cartes, Jorge Avila; Bonizzoni, Paola; Ciccolella, Simone; Vedova, Gianluca Della; Denti, Luca; Didelot, Xavier; Monti, Davide Cesare; Pirola, Yuri
RecGraph: recombination-aware alignment of sequences to variation graphs Journal Article
In: Bioinformatics, vol. 40, no. 5, pp. btae292, 2024, ISSN: 1367-4811.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Avila2024,
title = {RecGraph: recombination-aware alignment of sequences to variation graphs},
author = {Jorge Avila Cartes and Paola Bonizzoni and Simone Ciccolella and Gianluca Della Vedova and Luca Denti and Xavier Didelot and Davide Cesare Monti and Yuri Pirola},
url = {https://github.com/AlgoLab/RecGraph},
doi = {10.1093/bioinformatics/btae292},
issn = {1367-4811},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
journal = {Bioinformatics},
volume = {40},
number = {5},
pages = {btae292},
abstract = {Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated.In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination—we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events.
Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.
Sauer, Lena M.; Cánovas, Rodrigo; Roche, Daniel Barry; Shams-Eldin, Hosam; Ravel, Patrice; Colinge, Jacques; Schwarz, Ralph T.; Mamoun, Choukri Ben; Rivals, Eric; Cornillot, Emmanuel
In: Malaria Journal, vol. 22, no. 1, pp. 27-46, 2023, ISBN: 1475-2875.
Abstract | Links | BibTeX | Tags: Misc
@article{Sauer2023,
title = {FT-GPI, a highly sensitive and accurate predictor of GPI-anchored proteins, reveals the composition and evolution of the GPI proteome in Plasmodium species},
author = {Lena M. Sauer and Rodrigo Cánovas and Daniel Barry Roche and Hosam Shams-Eldin and Patrice Ravel and Jacques Colinge and Ralph T. Schwarz and Choukri Ben Mamoun and Eric Rivals and Emmanuel Cornillot},
doi = {10.1186/s12936-022-04430-0},
isbn = {1475-2875},
year = {2023},
date = {2023-12-01},
urldate = {2023-12-01},
journal = {Malaria Journal},
volume = {22},
number = {1},
pages = {27-46},
publisher = {BioMed Central},
abstract = {Protozoan parasites are known to attach specific and diverse group of proteins to their plasma membrane via a GPI anchor. In malaria parasites, GPI-anchored proteins (GPI-APs) have been shown to play an important role in host–pathogen interactions and a key function in host cell invasion and immune evasion. Because of their immunogenic properties, some of these proteins have been considered as malaria vaccine candidates. However, identification of all possible GPI-APs encoded by these parasites remains challenging due to their sequence diversity and limitations of the tools used for their characterization.},
keywords = {Misc},
pubstate = {published},
tppubtype = {article}
}
Kang, Xiongbin; Xu, Jialu; Luo, Xiao; Schönhuth, Alexander
Hybrid-hybrid correction of errors in long reads with HERO Journal Article
In: Genome Biol., vol. 24, no. 1, pp. 275, 2023, ISBN: 1474-760X.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{Kang2023-jk,
title = {Hybrid-hybrid correction of errors in long reads with HERO},
author = {Xiongbin Kang and Jialu Xu and Xiao Luo and Alexander Schönhuth},
doi = {10.1186/s13059-023-03112-7},
isbn = {1474-760X},
year = {2023},
date = {2023-12-01},
urldate = {2023-12-01},
journal = {Genome Biol.},
volume = {24},
number = {1},
pages = {275},
abstract = {Although generally superior, hybrid approaches for correcting errors in third-generation sequencing (TGS) reads, using next-generation sequencing (NGS) reads, mistake haplotype-specific variants for errors in polyploid and mixed samples. We suggest HERO, as the first “hybrid-hybrid”approach, to make use of both de Bruijn graphs and overlap graphs for optimal catering to the particular strengths of NGS and TGS reads. Extensive benchmarking experiments demonstrate that HERO improves indel and mismatch error rates by on average 65% (27~95%) and 20% (4~61%). Using HERO prior to genome assembly significantly improves the assemblies in the majority of the relevant categories.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
Bernardini, Giulia; Fici, Gabriele; Gawrychowski, Paweł; Pissis, Solon P
Substring Complexity in Sublinear Space Proceedings Article
In: Iwata, Satoru; Kakimura, Naonori (Ed.): 34th International Symposium on Algorithms and Computation (ISAAC 2023), pp. 12:1–12:19, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2023, ISBN: 978-3-95977-289-1.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@inproceedings{Bernardini2023-nr,
title = {Substring Complexity in Sublinear Space},
author = {Giulia Bernardini and Gabriele Fici and Paweł Gawrychowski and Solon P Pissis},
editor = {Iwata, Satoru and Kakimura, Naonori},
url = {https://arxiv.org/abs/2112.10376},
doi = {10.4230/LIPICS.ISAAC.2023.12},
isbn = {978-3-95977-289-1},
year = {2023},
date = {2023-11-01},
urldate = {2023-11-01},
booktitle = {34th International Symposium on Algorithms and Computation (ISAAC 2023)},
volume = {283},
pages = {12:1--12:19},
publisher = {Schloss Dagstuhl - Leibniz-Zentrum für Informatik},
address = {Dagstuhl, Germany},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
abstract = {Shannon’s entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size z of the Lempel–Ziv parse or the number r of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ of a smallest string attractor. Let T be a string of length n. A string attractor of T is a set of positions of T capturing the occurrences of all the substrings of T. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function S_T(k) counting the number of distinct substrings of length k of T, also known as the substring complexity of T. This new measure is defined as δ = sup{S_T(k)/k, k ≥ 1} and lower bounds all the relevant ad hoc measures previously considered. In particular, δ ≤ γ always holds and δ can be computed in 𝒪(n) time using Θ(n) working space. Kociumaka et al. showed that one can construct an 𝒪(δ log n/(δ))-sized representation of T supporting efficient direct access and efficient pattern matching queries on T. Given that for highly compressible strings, δ is significantly smaller than n, it is natural to pose the following question: Can we compute δ efficiently using sublinear working space? It is straightforward to show that in the comparison model, any algorithm computing δ using 𝒪(b) space requires Ω(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We thus wanted to investigate whether we can indeed match this lower bound. We address this algorithmic challenge by showing the following bounds to compute δ: - 𝒪((n³log b)/b²) time using 𝒪(b) space, for any b ∈ [1,n], in the comparison model. - 𝒪̃(n²/b) time using 𝒪̃(b) space, for any b ∈ [√n,n], in the word RAM model. This gives an 𝒪̃(n^{1+ε})-time and 𝒪̃(n^{1-ε})-space algorithm to compute δ, for any 0 < ε ≤ 1/2. Let us remark that our algorithms compute S_T(k), for all k, within the same complexities.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {inproceedings}
}
Loukides, Grigorios; Pissis, Solon P; Sweering, Michelle
Bidirectional string anchors for improved text indexing and top-$K$ similarity search Journal Article
In: IEEE Trans. Knowl. Data Eng., vol. 35, no. 11, pp. 11093–11111, 2023, ISBN: 1558-2191.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{Loukides2023-qt,
title = {Bidirectional string anchors for improved text indexing and top-$K$ similarity search},
author = {Grigorios Loukides and Solon P Pissis and Michelle Sweering},
doi = {10.1109/TKDE.2022.3231780},
isbn = {1558-2191},
year = {2023},
date = {2023-11-01},
urldate = {2023-11-01},
journal = {IEEE Trans. Knowl. Data Eng.},
volume = {35},
number = {11},
pages = {11093–11111},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
abstract = {The minimizers sampling mechanism is a popular mechanism for string sampling. However, minimizers sampling mechanisms lack good guarantees on the expected size of their samples for different combinations of their input parameters. Furthermore, indexes constructed over minimizers samples lack good worst-case guarantees for on-line pattern searches. In response, we propose bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given an integer ℓ , our mechanism selects the lexicographically smallest rotation in every length-ℓ fragment. We show that, like minimizers samples, bd-anchors samples are approximately uniform, locally consistent, and computable in linear time. Furthermore, our experiments demonstrate that the bd-anchors sample sizes decrease proportionally to ℓ ; and that these sizes are competitive to or smaller than the minimizers sample sizes. We theoretically justify these results by analyzing the expected size of bd-anchors samples. We also prove that computing a total order on the input alphabet which minimizes the bd-anchors sample size is NP-hard. We next highlight the benefits of bd-anchors in two important applications: text indexing and top-K similarity search. For the first application, we develop an index for performing on-line pattern searches in near-optimal time, and show experimentally that a simple implementation of our index is consistently faster for on-line pattern searches than an analogous implementation of a minimizers-based index; we also show that it is substantially faster than two classic text indexes. For the second application, we develop a heuristic for top-K similarity search under edit distance, and show experimentally that it is generally as accurate as the state-of-the-art tool for the same purpose but more than one order of magnitude faster.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
Romashchenko, Nikolai; Linard, Benjamin; Pardi, Fabio; Rivals, Eric
EPIK: Precise and scalable evolutionary placement with informative textitk-mers Journal Article
In: Bioinformatics, vol. 39, no. 12, 2023, ISSN: 1367-4811.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{Romashchenko2023-kq,
title = {EPIK: Precise and scalable evolutionary placement with informative textitk-mers},
author = {Nikolai Romashchenko and Benjamin Linard and Fabio Pardi and Eric Rivals},
url = {https://github.com/phylo42/IPK
https://github.com/phylo42/EPIK},
doi = {10.1093/bioinformatics/btad692},
issn = {1367-4811},
year = {2023},
date = {2023-11-01},
urldate = {2023-11-01},
journal = {Bioinformatics},
volume = {39},
number = {12},
publisher = {Oxford University Press (OUP)},
abstract = {Phylogenetic placement enables phylogenetic analysis of massive collections of newly sequenced DNA, when de novo tree inference is too unreliable or inefficient. Assuming that a high-quality reference tree is available, the idea is to seek the correct placement of the new sequences in that tree. Recently, alignment-free approaches to phylogenetic placement have emerged, both to circumvent the need to align the new sequences and to avoid the calculations that typically follow the alignment step. A promising approach is based on the inference of k-mers that can be potentially related to the reference sequences, also called phylo-k-mers. However, its usage is limited by the time and memory-consuming stage of reference data preprocessing and the large numbers of k-mers to consider.
We suggest a filtering method for selecting informative phylo-k-mers based on mutual information, which can significantly improve the efficiency of placement, at the cost of a small loss in placement accuracy. This method is implemented in IPK, a new tool for computing phylo-k-mers that significantly outperforms the software previously available. We also present EPIK, a new software for phylogenetic placement, supporting filtered phylo-k-mer databases. Our experiments on real-world data show that EPIK is the fastest phylogenetic placement tool available, when placing hundreds of thousands and millions of queries while still providing accurate placements.
IPK and EPIK are freely available at https://github.com/phylo42/IPK and https://github.com/phylo42/EPIK. Both are implemented in C++ and Python and supported on Linux and MacOS.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
We suggest a filtering method for selecting informative phylo-k-mers based on mutual information, which can significantly improve the efficiency of placement, at the cost of a small loss in placement accuracy. This method is implemented in IPK, a new tool for computing phylo-k-mers that significantly outperforms the software previously available. We also present EPIK, a new software for phylogenetic placement, supporting filtered phylo-k-mer databases. Our experiments on real-world data show that EPIK is the fastest phylogenetic placement tool available, when placing hundreds of thousands and millions of queries while still providing accurate placements.
IPK and EPIK are freely available at https://github.com/phylo42/IPK and https://github.com/phylo42/EPIK. Both are implemented in C++ and Python and supported on Linux and MacOS.
Andreace, Francesco; Lechat, Pierre; Dufresne, Yoann; Chikhi, Rayan
Comparing methods for constructing and representing human pangenome graphs Journal Article
In: Genome Biology, vol. 24, no. 1, pp. 274, 2023, ISSN: 1474-760X.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{Andreace2023,
title = {Comparing methods for constructing and representing human pangenome graphs},
author = {Francesco Andreace and Pierre Lechat and Yoann Dufresne and Rayan Chikhi},
doi = {10.1186/s13059-023-03098-2},
issn = {1474-760X},
year = {2023},
date = {2023-11-01},
urldate = {2023-11-01},
journal = {Genome Biology},
volume = {24},
number = {1},
pages = {274},
publisher = {Springer Science and Business Media LLC},
abstract = {As a single reference genome cannot possibly represent all the variation present across human individuals, pangenome graphs have been introduced to incorporate population diversity within a wide range of genomic analyses. Several data structures have been proposed for representing collections of genomes as pangenomes, in particular graphs.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
González, Camila Duitama; Rangavittal, Samarth; Vicedomini, Riccardo; Chikhi, Rayan; Richard, Hugues
aKmerBroom: Ancient oral DNA decontamination using Bloom filters on k-mer sets Journal Article
In: iScience, vol. 26, no. 11, pp. 108057, 2023, ISSN: 2589-0042.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{DuitamaGonzalez2023,
title = {aKmerBroom: Ancient oral DNA decontamination using Bloom filters on k-mer sets},
author = {Camila Duitama González and Samarth Rangavittal and Riccardo Vicedomini and Rayan Chikhi and Hugues Richard},
doi = {10.1016/j.isci.2023.108057},
issn = {2589-0042},
year = {2023},
date = {2023-11-01},
urldate = {2023-11-01},
journal = {iScience},
volume = {26},
number = {11},
pages = {108057},
publisher = {Elsevier BV},
abstract = {Dental calculus samples are modeled as a mixture of DNA coming from dental plaque and contaminants. Current computational decontamination methods such as Recentrifuge and DeconSeq require either a reference database or sequenced negative controls, and therefore have limited use cases. We present a reference-free decontamination tool tailored for the removal of contaminant DNA of ancient oral sample called aKmerBroom. Our tool builds a Bloom filter of known ancient and modern oral k-mers, then scans an input set of ancient metagenomic reads using multiple passes to iteratively retain reads likely to be of oral origin. On synthetic data, aKmerBroom achieves over 89.53% sensitivity and 94.00% specificity. On real datasets, aKmerBroom shows higher read retainment (+60% on average) than other methods. We anticipate aKmerBroom will be a valuable tool for the processing of ancient oral samples as it will prevent contaminated datasets from being completely discarded in downstream analyses.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
Willink, Beatriz; Tunström, Kalle; Nilén, Sofie; Chikhi, Rayan; Lemane, Téo; Takahashi, Michihiko; Takahashi, Yuma; Svensson, Erik I.; Wheat, Christopher West
The genomics and evolution of inter-sexual mimicry and female-limited polymorphisms in damselflies Journal Article
In: Nature Ecology & Evolution, vol. 8, no. 1, pp. 83–97, 2023, ISSN: 2397-334X.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG, WP3: Translational CPG
@article{Willink2023,
title = {The genomics and evolution of inter-sexual mimicry and female-limited polymorphisms in damselflies},
author = {Beatriz Willink and Kalle Tunström and Sofie Nilén and Rayan Chikhi and Téo Lemane and Michihiko Takahashi and Yuma Takahashi and Erik I. Svensson and Christopher West Wheat},
doi = {10.1038/s41559-023-02243-1},
issn = {2397-334X},
year = {2023},
date = {2023-11-01},
urldate = {2023-11-01},
journal = {Nature Ecology & Evolution},
volume = {8},
number = {1},
pages = {83–97},
publisher = {Springer Science and Business Media LLC},
abstract = {Sex-limited morphs can provide profound insights into the evolution and genomic architecture of complex phenotypes. Inter-sexual mimicry is one particular type of sex-limited polymorphism in which a novel morph resembles the opposite sex. While inter-sexual mimics are known in both sexes and a diverse range of animals, their evolutionary origin is poorly understood. Here, we investigated the genomic basis of female-limited morphs and male mimicry in the common bluetail damselfly. Differential gene expression between morphs has been documented in damselflies, but no causal locus has been previously identified. We found that male mimicry originated in an ancestrally sexually dimorphic lineage in association with multiple structural changes, probably driven by transposable element activity. These changes resulted in ~900 kb of novel genomic content that is partly shared by male mimics in a close relative, indicating that male mimicry is a trans-species polymorphism. More recently, a third morph originated following the translocation of part of the male-mimicry sequence into a genomic position ~3.5 mb apart. We provide evidence of balancing selection maintaining male mimicry, in line with previous field population studies. Our results underscore how structural variants affecting a handful of potentially regulatory genes and morph-specific genes can give rise to novel and complex phenotypic polymorphisms.},
keywords = {WP2: Evolutionary/Comparative CPG, WP3: Translational CPG},
pubstate = {published},
tppubtype = {article}
}
Herencsárová, Eva; Brejová, Broňa
Identifying Clusters in Graph Representations of Genomes Proceedings Article
In: Brejová, Broňa; Ciencialová, Lucie; Holeňa, Martin; Jajcay, Róbert; Jajcayová, Tatiana; Lexa, Matej; Mráz, František; Pardubská, Dana; Plátek, Martin (Ed.): Proceedings of the 23rd Conference Information Technologies – Applications and Theory (ITAT 2023), pp. 232–241, CEUR Workshop Proceedings, Tatranské Matliare, Slovakia, 2023.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@inproceedings{Herencsarova2023,
title = {Identifying Clusters in Graph Representations of Genomes},
author = {Eva Herencsárová and Broňa Brejová},
editor = {Broňa Brejová and Lucie Ciencialová and Martin Holeňa and Róbert Jajcay and Tatiana Jajcayová and Matej Lexa and František Mráz and Dana Pardubská and Martin Plátek},
url = {https://ceur-ws.org/Vol-3498/paper30.pdf},
year = {2023},
date = {2023-09-23},
urldate = {2023-09-01},
booktitle = {Proceedings of the 23rd Conference Information Technologies – Applications and Theory (ITAT 2023)},
volume = {3498},
pages = {232–241},
publisher = {CEUR Workshop Proceedings},
address = {Tatranské Matliare, Slovakia},
abstract = {In many bioinformatics applications the task is to identify biologically significant locations in an individual genome. In our work, we are interested in finding high-density clusters of such biologically meaningful locations in a graph representation of a pangenome, which is a collection of related genomes. Different formulations of finding such clusters were previously studied for sequences. In this work, we study an extension of this problem for graphs, which we formalize as finding a set of vertex-disjoint paths with a maximum score in a weighted directed graph. We provide a linear-time algorithm for a special class of graphs corresponding to elastic-degenerate strings, one of pangenome representations. We also provide a fixed-parameter tractable algorithm for directed acyclic graphs with a special path decomposition of a limited width.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {inproceedings}
}
Bernardini, Giulia; Iersel, Leo; Julien, Esther; Stougie, Leen
Constructing phylogenetic networks via cherry picking and machine learning Journal Article
In: Algorithms for Molecular Biology, vol. 18, no. 1, pp. 13, 2023, ISSN: 1748-7188.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Bernardini2023,
title = {Constructing phylogenetic networks via cherry picking and machine learning},
author = {Giulia Bernardini and Leo Iersel and Esther Julien and Leen Stougie},
doi = {10.1186/s13015-023-00233-3},
issn = {1748-7188},
year = {2023},
date = {2023-09-16},
urldate = {2023-09-16},
journal = {Algorithms for Molecular Biology},
volume = {18},
number = {1},
pages = {13},
publisher = {Springer Science and Business Media LLC},
abstract = {\textbf{Background}
Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. Existing methods are computationally expensive and can either handle only small numbers of phylogenetic trees or are limited to severely restricted classes of networks.
\textbf{Results}
In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of efficient heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets consisting of binary trees. Some of the heuristics in this framework are based on the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. We also propose simple and fast randomised heuristics that prove to be very effective when run multiple times.
\textbf{Conclusions}
Unlike the existing exact methods, our heuristics are applicable to datasets of practical size, and the experimental study we conducted on both simulated and real data shows that these solutions are qualitatively good, always within some small constant factor from the optimum. Moreover, our machine-learned heuristics are one of the first applications of machine learning to phylogenetics and show its promise.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. Existing methods are computationally expensive and can either handle only small numbers of phylogenetic trees or are limited to severely restricted classes of networks.
Results
In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of efficient heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets consisting of binary trees. Some of the heuristics in this framework are based on the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. We also propose simple and fast randomised heuristics that prove to be very effective when run multiple times.
Conclusions
Unlike the existing exact methods, our heuristics are applicable to datasets of practical size, and the experimental study we conducted on both simulated and real data shows that these solutions are qualitatively good, always within some small constant factor from the optimum. Moreover, our machine-learned heuristics are one of the first applications of machine learning to phylogenetics and show its promise.
Romashchenko, Nikolai; Linard, Benjamin; Rivals, Eric; Pardi, Fabio
Computing Phylo- k-mers Journal Article
In: IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 20, no. 5, pp. 2889–2897, 2023, ISBN: 1557-9964.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{Romashchenko2023-hj,
title = {Computing Phylo- k-mers},
author = {Nikolai Romashchenko and Benjamin Linard and Eric Rivals and Fabio Pardi},
url = {https://hal-lirmm.ccsd.cnrs.fr/lirmm-03778953v2},
doi = {10.1109/TCBB.2023.3278049},
isbn = {1557-9964},
year = {2023},
date = {2023-09-01},
urldate = {2023-09-01},
journal = {IEEE/ACM Trans. Comput. Biol. Bioinform.},
volume = {20},
number = {5},
pages = {2889–2897},
abstract = {Finding the correct position of new sequences within an established phylogenetic tree is an increasingly relevant problem in evolutionary bioinformatics and metagenomics. Recently, alignment-free approaches for this task have been proposed. One such approach is based on the concept of phylogenetically-informative k-mers or phylo- k-mers for short. In practice, phylo- k-mers are inferred from a set of related reference sequences and are equipped with scores expressing the probability of their appearance in different locations within the input reference phylogeny. Computing phylo- k-mers, however, represents a computational bottleneck to their applicability in real-world problems such as the phylogenetic analysis of metabarcoding reads and the detection of novel recombinant viruses. Here we consider the problem of phylo- k-mer computation: how can we efficiently find all k-mers whose probability lies above a given threshold for a given tree node? We describe and analyze algorithms for this problem, relying on branch-and-bound and divide-and-conquer techniques. We exploit the redundancy of adjacent windows of the alignment to save on computation. Besides computational complexity analyses, we provide an empirical evaluation of the relative performance of their implementations on simulated and real-world data. The divide-and-conquer algorithms are found to surpass the branch-and-bound approach, especially when many phylo- k-mers are found.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
Cozzi, Davide; Rossi, Massimiliano; Rubinacci, Simone; Gagie, Travis; Köppl, Dominik; Boucher, Christina; Bonizzoni, Paola
μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data Journal Article
In: Bioinformatics, vol. 39, no. 9, 2023, ISSN: 1367-4811.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Cozzi2023-bx,
title = {μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data},
author = {Davide Cozzi and Massimiliano Rossi and Simone Rubinacci and Travis Gagie and Dominik Köppl and Christina Boucher and Paola Bonizzoni},
url = {https://github.com/dlcgold/muPBWT},
doi = {10.1093/bioinformatics/btad552},
issn = {1367-4811},
year = {2023},
date = {2023-09-01},
urldate = {2023-09-01},
journal = {Bioinformatics},
volume = {39},
number = {9},
publisher = {Oxford University Press (OUP)},
abstract = {The Positional Burrows–Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory.
In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.
Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.
Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.
Bonnet, Konstantinn; Marschall, Tobias; Doerr, Daniel
Constructing founder sets under allelic and non-allelic homologous recombination Journal Article
In: Algorithms Mol. Biol., vol. 18, no. 1, pp. 15, 2023, ISBN: 1748-7188.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{Bonnet2023-gs,
title = {Constructing founder sets under allelic and non-allelic homologous recombination},
author = {Konstantinn Bonnet and Tobias Marschall and Daniel Doerr},
doi = {10.1186/s13015-023-00241-3},
isbn = {1748-7188},
year = {2023},
date = {2023-09-01},
urldate = {2023-09-01},
journal = {Algorithms Mol. Biol.},
volume = {18},
number = {1},
pages = {15},
abstract = {Homologous recombination between the maternal and paternal copies of a chromosome is a key mechanism for human inheritance and shapes population genetic properties of our species. However, a similar mechanism can also act between different copies of the same sequence, then called non-allelic homologous recombination (NAHR). This process can result in genomic rearrangements—including deletion, duplication, and inversion—and is underlying many genomic disorders. Despite its importance for genome evolution and disease, there is a lack of computational models to study genomic loci prone to NAHR. In this work, we propose such a computational model, providing a unified framework for both (allelic) homologous recombination and NAHR. Our model represents a set of genomes as a graph, where haplotypes correspond to walks through this graph. We formulate two founder set problems under our recombination model, provide flow-based algorithms for their solution, describe exact methods to characterize the number of recombinations, and demonstrate scalability to problem instances arising in practice.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
Baláž, Andrej; Petescia, Alessia
Prefix-free graphs and suffix array construction in sublinear space Proceedings Article
In: Brejová, Broňa; Ciencialová, Lucie; Holeňa, Martin; Jajcay, Róbert; Jajcayová, Tatiana; Lexa, Matej; Mráz, František; Pardubská, Dana; Plátek, Martin (Ed.): Proceedings of the 23rd Conference Information Technologies – Applications and Theory (ITAT 2023), pp. 232–241, CEUR Workshop Proceedings, Tatranské Matliare, Slovakia, 2023.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@inproceedings{Balaz2023,
title = {Prefix-free graphs and suffix array construction in sublinear space},
author = {Andrej Baláž and Alessia Petescia},
editor = {Broňa Brejová and Lucie Ciencialová and Martin Holeňa and Róbert Jajcay and Tatiana Jajcayová and Matej Lexa and František Mráz and Dana Pardubská and Martin Plátek},
url = {https://ceur-ws.org/Vol-3498/paper27.pdf},
year = {2023},
date = {2023-09-01},
urldate = {2023-09-01},
booktitle = {Proceedings of the 23rd Conference Information Technologies – Applications and Theory (ITAT 2023)},
volume = {3498},
pages = {232–241},
publisher = {CEUR Workshop Proceedings},
address = {Tatranské Matliare, Slovakia},
abstract = {A recent paradigm shift in bioinformatics from a single reference genome to a pangenome brought with it several graph structures. These graph structures must implement operations, such as efficient construction from multiple genomes and read mapping. Read mapping is a well-studied problem in sequential data, and, together with data structures such as suffix array and Burrows-Wheeler transform, allows for efficient computation. Attempts to achieve comparatively high performance on graphs bring many complications since the common data structures on strings are not easily obtainable for graphs. In this work, we introduce prefix-free graphs, a novel pangenomic data structure; we show how to construct them and how to use them to obtain well-known data structures from stringology in sublinear space, allowing for many efficient operations on pangenomes.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {inproceedings}
}
Ayad, Lorraine A K; Chikhi, Rayan; Pissis, Solon P
Seedability: optimizing alignment parameters for sensitive sequence comparison Journal Article
In: Bioinform. Adv., vol. 3, no. 1, 2023, ISSN: 2635-0041.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG, WP2: Evolutionary/Comparative CPG
@article{Ayad2023-li,
title = {Seedability: optimizing alignment parameters for sensitive sequence comparison},
author = {Lorraine A K Ayad and Rayan Chikhi and Solon P Pissis},
url = {https://github.com/lorrainea/Seedability},
doi = {10.1093/bioadv/vbad108},
issn = {2635-0041},
year = {2023},
date = {2023-08-01},
urldate = {2023-08-01},
journal = {Bioinform. Adv.},
volume = {3},
number = {1},
publisher = {Oxford University Press (OUP)},
abstract = {Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2, use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability, a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences.
The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2. We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments.
https://github.com/lorrainea/Seedability (distributed under GPL v3.0).},
keywords = {WP1: Primary CPG, WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2. We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments.
https://github.com/lorrainea/Seedability (distributed under GPL v3.0).
Rizzo, Nicola; Cáceres, Manuel; Mäkinen, Veli
Finding Maximal Exact Matches in Graphs Proceedings Article
In: Belazzougui, Djamal; Ouangraoua, A"ida (Ed.): 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023), pp. 10:1–10:17, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2023, ISBN: 978-3-95977-294-5.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@inproceedings{Rizzo2023-hg,
title = {Finding Maximal Exact Matches in Graphs},
author = {Nicola Rizzo and Manuel Cáceres and Veli Mäkinen},
editor = {Belazzougui, Djamal and Ouangraoua, A"{i}da},
doi = {10.4230/LIPIcs.WABI.2023.10},
isbn = {978-3-95977-294-5},
year = {2023},
date = {2023-08-01},
urldate = {2023-08-01},
booktitle = {23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
volume = {273},
pages = {10:1--10:17},
publisher = {Schloss Dagstuhl - Leibniz-Zentrum für Informatik},
address = {Dagstuhl, Germany},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
abstract = {We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least κ (κ-MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., ICALP 2019) even on acyclic graphs. In this paper we show an O(n⋅ L ⋅ d^{L-1} + m + M_{κ,L})-time algorithm finding all κ-MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, m = |Q|, and M_{κ,L} is the number of output MEMs. We use this algorithm to develop a κ-MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time O(nH² + m + M_κ), where H is the maximum number of nodes in a block, and M_κ is the total number of κ-MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some preliminary experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {inproceedings}
}