Kenneweg, Philip; Dandinasivara, Raghuram; Luo, Xiao; Hammer, Barbara; Schönhuth, Alexander
Generating Synthetic Genotypes using Diffusion Models Unpublished
2024.
Abstract | Links | BibTeX | Tags: WP3: Translational CPG
@unpublished{Kenneweg2024,
title = {Generating Synthetic Genotypes using Diffusion Models},
author = {Philip Kenneweg and Raghuram Dandinasivara and Xiao Luo and Barbara Hammer and Alexander Schönhuth},
doi = {10.48550/ARXIV.2412.03278},
year = {2024},
date = {2024-12-01},
publisher = {arXiv},
abstract = {In this paper, we introduce the first diffusion model designed to generate complete synthetic human genotypes, which, by standard protocols, one can straightforwardly expand into full-length, DNA-level genomes. The synthetic genotypes mimic real human genotypes without just reproducing known genotypes, in terms of approved metrics. When training biomedically relevant classifiers with synthetic genotypes, accuracy is near-identical to the accuracy achieved when training classifiers with real data. We further demonstrate that augmenting small amounts of real with synthetically generated genotypes drastically improves performance rates. This addresses a significant challenge in translational human genetics: real human genotypes, although emerging in large volumes from genome wide association studies, are sensitive private data, which limits their public availability. Therefore, the integration of additional, insensitive data when striving for rapid sharing of biomedical knowledge of public interest appears imperative.},
keywords = {WP3: Translational CPG},
pubstate = {published},
tppubtype = {unpublished}
}
Cartes, Jorge Avila; Bonizzoni, Paola; Ciccolella, Simone; Vedova, Gianluca Della; Denti, Luca
PangeBlocks: customized construction of pangenome graphs via maximal blocks Journal Article
In: BMC Bioinformatics, vol. 25, no. 1, 2024, ISSN: 1471-2105.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{AvilaCartes2024a,
title = {PangeBlocks: customized construction of pangenome graphs via maximal blocks},
author = {Jorge Avila Cartes and Paola Bonizzoni and Simone Ciccolella and Gianluca Della Vedova and Luca Denti},
url = {https://github.com/AlgoLab/pangeblocks},
doi = {10.1186/s12859-024-05958-5},
issn = {1471-2105},
year = {2024},
date = {2024-11-01},
urldate = {2024-11-01},
journal = {BMC Bioinformatics},
volume = {25},
number = {1},
publisher = {Springer Science and Business Media LLC},
abstract = {\textbf{Background}
The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.
\textbf{Results}
In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.
\textbf{Conclusion}
We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.
Results
In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.
Conclusion
We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.
Balvert, Marleen; Cooper-Knock, Johnathan; Stamp, Julian; Byrne, Ross P.; Mourragui, Soufiane; Gils, Juami; Benonisdottir, Stefania; Schlüter, Johannes; Kenna, Kevin; Abeln, Sanne; Iacoangeli, Alfredo; Daub, Joséphine T.; Browning, Brian L.; Taş, Gizem; Hu, Jiajing; Wang, Yan; Alhathli, Elham; Harvey, Calum; Pianesi, Luna; Schulte, Sara C.; González-Domínguez, Jorge; Garrisson, Erik; Al-Chalabi, Ammar; Cartes, Jorge Avila; Baaijens, Jasmijn; Berg, Joanna; Bolognini, Davide; Bonizzoni, Paola; Guarracino, Andrea; Koyuturk, Mehmet; Markowska, Magda; Dandinasivara, Raghuram; Bemmelen, Jasper; Vorbrugg, Sebastian; Zhang, Sai; Pasanuic, Bogdan; Snyder, Michael P.; Schönhuth, Alexander; Sng, Letitia M. F.; Twine, Natalie A.
Considerations in the search for epistasis Journal Article
In: Genome Biology, vol. 25, no. 1, pp. 296, 2024, ISSN: 1474-760X.
Abstract | Links | BibTeX | Tags: WP3: Translational CPG
@article{Balvert2024,
title = {Considerations in the search for epistasis},
author = {Marleen Balvert and Johnathan Cooper-Knock and Julian Stamp and Ross P. Byrne and Soufiane Mourragui and Juami Gils and Stefania Benonisdottir and Johannes Schlüter and Kevin Kenna and Sanne Abeln and Alfredo Iacoangeli and Joséphine T. Daub and Brian L. Browning and Gizem Taş and Jiajing Hu and Yan Wang and Elham Alhathli and Calum Harvey and Luna Pianesi and Sara C. Schulte and Jorge González-Domínguez and Erik Garrisson and Ammar Al-Chalabi and Jorge Avila Cartes and Jasmijn Baaijens and Joanna Berg and Davide Bolognini and Paola Bonizzoni and Andrea Guarracino and Mehmet Koyuturk and Magda Markowska and Raghuram Dandinasivara and Jasper Bemmelen and Sebastian Vorbrugg and Sai Zhang and Bogdan Pasanuic and Michael P. Snyder and Alexander Schönhuth and Letitia M. F. Sng and Natalie A. Twine},
doi = {10.1186/s13059-024-03427-z},
issn = {1474-760X},
year = {2024},
date = {2024-11-01},
journal = {Genome Biology},
volume = {25},
number = {1},
pages = {296},
publisher = {Springer Science and Business Media LLC},
abstract = {Epistasis refers to changes in the effect on phenotype of a unit of genetic information, such as a single nucleotide polymorphism or a gene, dependent on the context of other genetic units. Such interactions are both biologically plausible and good candidates to explain observations which are not fully explained by an additive heritability model. However, the search for epistasis has so far largely failed to recover this missing heritability. We identify key challenges and propose that future works need to leverage idealized systems, known biology and even previously identified epistatic interactions, in order to guide the search for new interactions.},
keywords = {WP3: Translational CPG},
pubstate = {published},
tppubtype = {article}
}
Zhang, Wenhai; Liu, Yuansheng; Xu, Jialu; Chen, Enlian; Schönhuth, Alexander; Luo, Xiao
PanTax: Strain-level taxonomic classification of metagenomic data using pangenome graphs Unpublished
2024.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@unpublished{Zhang2024,
title = {PanTax: Strain-level taxonomic classification of metagenomic data using pangenome graphs},
author = {Wenhai Zhang and Yuansheng Liu and Jialu Xu and Enlian Chen and Alexander Schönhuth and Xiao Luo},
url = {https://github.com/LuoGroup2023/PanTax},
doi = {10.1101/2024.11.15.623887},
year = {2024},
date = {2024-11-01},
urldate = {2024-11-01},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Microbes are omnipresent, thriving in a range of habitats from oceans to soils and even within our gastrointestinal tracts. They play a vital role in maintaining ecological equilibrium and promoting the health of their hosts. Consequently, understanding the strain diversity within microbial communities is crucial, as variations between strains can lead to distinct phenotypic expressions or diverse biological functions. However, current methods for taxonomic classification from metagenomic sequencing data have several limitations, including their reliance solely on species resolution, support for either short or long reads, or their confinement to a given single species. Most notably, the majority of existing taxonomic classifiers rely solely on a single linear representative genome as a reference, which fails to capture the strain diversity, thereby introducing single-reference biases.
Here, we present PanTax, a pangenome graph-based taxonomic classification method that overcomes the shortcomings of single-reference genome-based approaches, because pangenome graphs possess the capability to depict the genetic variability present across multiple evolutionarily or environmentally related genomes. PanTax provides a comprehensive solution to taxonomic classification for strain resolution, compatibility with both short and long reads, and compatibility with single or multiple species. Extensive benchmarking results demonstrate that PanTax drastically outperforms state-of-the-art approaches, primarily evidenced by its significantly higher precision or recall (at both species and strain levels), while maintaining comparable or better performance in other aspects across various datasets. PanTax is a user-friendly open-source tool that is publicly accessible at https://github.com/LuoGroup2023/PanTax.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {unpublished}
}
Here, we present PanTax, a pangenome graph-based taxonomic classification method that overcomes the shortcomings of single-reference genome-based approaches, because pangenome graphs possess the capability to depict the genetic variability present across multiple evolutionarily or environmentally related genomes. PanTax provides a comprehensive solution to taxonomic classification for strain resolution, compatibility with both short and long reads, and compatibility with single or multiple species. Extensive benchmarking results demonstrate that PanTax drastically outperforms state-of-the-art approaches, primarily evidenced by its significantly higher precision or recall (at both species and strain levels), while maintaining comparable or better performance in other aspects across various datasets. PanTax is a user-friendly open-source tool that is publicly accessible at https://github.com/LuoGroup2023/PanTax.
Parmigiani, Luca; Garrison, Erik; Stoye, Jens; Marschall, Tobias; Doerr, Daniel
Panacus: fast and exact pangenome growth and core size estimation Journal Article
In: Bioinformatics, 2024, ISSN: 1367-4811.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Parmigiani2024b,
title = {Panacus: fast and exact pangenome growth and core size estimation},
author = {Luca Parmigiani and Erik Garrison and Jens Stoye and Tobias Marschall and Daniel Doerr},
editor = {Janet Kelso},
url = {https://github.com/marschall-lab/panacus},
doi = {10.1093/bioinformatics/btae720},
issn = {1367-4811},
year = {2024},
date = {2024-11-01},
urldate = {2024-11-01},
journal = {Bioinformatics},
publisher = {Oxford University Press (OUP)},
abstract = {Motivation: Using a single linear reference genome poses a limitation to exploring the full genomic diversity of a species. The release of a draft human pangenome underscores the increasing relevance of pangenomics to overcome these limitations. Pangenomes are commonly represented as graphs, which can represent billions of base pairs of sequence. Presently, there is a lack of scalable software able to perform key tasks on pangenomes, such as quantifying universally shared sequence across genomes (the core genome) and measuring the extent of genomic variability as a function of sample size (pangenome growth).
Results: We introduce Panacus (pangenome-abacus), a tool designed to rapidly perform these tasks and visualize the results in interactive plots. Panacus can process GFA files, the accepted standard for pangenome graphs, and is able to analyze a human pangenome graph with 110 million nodes in less than one hour.
Availability and implementation: Panacus is implemented in Rust and is published as Open Source software under the MIT license. The source code and documentation are available at https://github.com/marschall-lab/panacus. Panacus can be installed via Bioconda at https://bioconda.github.io/recipes/panacus/README.html.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Results: We introduce Panacus (pangenome-abacus), a tool designed to rapidly perform these tasks and visualize the results in interactive plots. Panacus can process GFA files, the accepted standard for pangenome graphs, and is able to analyze a human pangenome graph with 110 million nodes in less than one hour.
Availability and implementation: Panacus is implemented in Rust and is published as Open Source software under the MIT license. The source code and documentation are available at https://github.com/marschall-lab/panacus. Panacus can be installed via Bioconda at https://bioconda.github.io/recipes/panacus/README.html.
Frolova, Daria; Lima, Leandro; Roberts, Leah Wendy; Bohnenkämper, Leonard; Wittler, Roland; Stoye, Jens; Iqbal, Zamin
Applying rearrangement distances to enable plasmid epidemiology with pling Journal Article
In: Microbial Genomics, vol. 10, no. 10, 2024, ISSN: 2057-5858.
Abstract | Links | BibTeX | Tags: WP3: Translational CPG
@article{Frolova2024,
title = {Applying rearrangement distances to enable plasmid epidemiology with pling},
author = {Daria Frolova and Leandro Lima and Leah Wendy Roberts and Leonard Bohnenkämper and Roland Wittler and Jens Stoye and Zamin Iqbal},
url = {https://github.com/iqbal-lab-org/pling},
doi = {10.1099/mgen.0.001300},
issn = {2057-5858},
year = {2024},
date = {2024-10-01},
urldate = {2024-10-01},
journal = {Microbial Genomics},
volume = {10},
number = {10},
publisher = {Microbiology Society},
abstract = {Plasmids are a key vector of antibiotic resistance, but the current bioinformatics toolkit is not well suited to tracking them. The rapid structural changes seen in plasmid genomes present considerable challenges to evolutionary and epidemiological analysis. Typical approaches are either low resolution (replicon typing) or use shared k-mer content to define a genetic distance. However, this distance can both overestimate plasmid relatedness by ignoring rearrangements, and underestimate by over-penalizing gene gain/loss. Therefore a model is needed which captures the key components of how plasmid genomes evolve structurally – through gene/block gain or loss, and rearrangement. A secondary requirement is to prevent promiscuous transposable elements (TEs) leading to over-clustering of unrelated plasmids. We choose the ‘Double Cut and Join Indel’ (DCJ-Indel) model, in which plasmids are studied at a coarse level, as a sequence of signed integers (representing genes or aligned blocks), and the distance between two plasmids is the minimum number of rearrangement events or indels needed to transform one into the other. We show how this gives much more meaningful distances between plasmids. We introduce a software workflow pling (https://github.com/iqbal-lab-org/pling), which uses the DCJ-Indel model, to calculate distances between plasmids and then cluster them. In our approach, we combine containment distances and DCJ-Indel distances to build a TE-aware plasmid network. We demonstrate superior performance and interpretability to other plasmid clustering tools on the ‘Russian Doll’ dataset and a hospital transmission dataset.},
keywords = {WP3: Translational CPG},
pubstate = {published},
tppubtype = {article}
}
Kang, Xiongbin; Zhang, Wenhai; Li, Yichen; Luo, Xiao; Schönhuth, Alexander
HyLight: Strain aware assembly of low coverage metagenomes Journal Article
In: Nature Communications, vol. 15, no. 1, pp. 8665, 2024, ISSN: 2041-1723.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{Kang2024,
title = {HyLight: Strain aware assembly of low coverage metagenomes},
author = {Xiongbin Kang and Wenhai Zhang and Yichen Li and Xiao Luo and Alexander Schönhuth},
doi = {10.1038/s41467-024-52907-0},
issn = {2041-1723},
year = {2024},
date = {2024-10-01},
journal = {Nature Communications},
volume = {15},
number = {1},
pages = {8665},
publisher = {Springer Science and Business Media LLC},
abstract = {Different strains of identical species can vary substantially in terms of their spectrum of biomedically relevant phenotypes. Reconstructing the genomes of microbial communities at the level of their strains poses significant challenges, because sequencing errors can obscure strain-specific variants. Next-generation sequencing (NGS) reads are too short to resolve complex genomic regions. Third-generation sequencing (TGS) reads, although longer, are prone to higher error rates or substantially more expensive. Limiting TGS coverage to reduce costs compromises the accuracy of the assemblies. This explains why prior approaches agree on losses in strain awareness, accuracy, tendentially excessive costs, or combinations thereof. We introduce HyLight, a metagenome assembly approach that addresses these challenges by implementing the complementary strengths of TGS and NGS data. HyLight employs strain-resolved overlap graphs (OG) to accurately reconstruct individual strains within microbial communities. Our experiments demonstrate that HyLight produces strain-aware and contiguous assemblies at minimal error content, while significantly reducing costs because utilizing low-coverage TGS data. HyLight achieves an average improvement of 19.05% in preserving strain identity and demonstrates near-complete strain awareness across diverse datasets. In summary, HyLight offers considerable advances in metagenome assembly, insofar as it delivers significantly enhanced strain awareness, contiguity, and accuracy without the typical compromises observed in existing approaches.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
Rivals, Eric
Incremental computation of the set of period sets Unpublished
2024.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@unpublished{Rivals2024,
title = {Incremental computation of the set of period sets},
author = {Eric Rivals},
doi = {10.48550/ARXIV.2410.12077},
year = {2024},
date = {2024-10-01},
urldate = {2024-10-01},
publisher = {arXiv},
abstract = {Overlaps between words are crucial in many areas of computer science, such as code design, stringology, and bioinformatics. A self overlapping word is characterized by its periods and borders. A period of a word $u$ is the starting position of a suffix of $u$ that is also a prefix $u$, and such a suffix is called a border. Each word of length, say $n>0$, has a set of periods, but not all combinations of integers are sets of periods. Computing the period set of a word $u$ takes linear time in the length of $u$. We address the question of computing, the set, denoted $Gamma_n$, of all period sets of words of length $n$. Although period sets have been characterized, there is no formula to compute the cardinality of $Gamma_n$ (which is exponential in $n$), and the known dynamic programming algorithm to enumerate $Gamma_n$ suffers from its space complexity. We present an incremental approach to compute $Gamma_n$ from $Gamma_n-1$, which reduces the space complexity, and then a constructive certification algorithm useful for verification purposes. The incremental approach defines a parental relation between sets in $Gamma_n-1$ and $Gamma_n$, enabling one to investigate the dynamics of period sets, and their intriguing statistical properties. Moreover, the period set of a word $u$ is the key for computing the absence probability of $u$ in random texts. Thus, knowing $Gamma_n$ is useful to assess the significance of word statistics, such as the number of missing words in a random text.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {unpublished}
}
Gabory, Esteban; Mwaniki, Moses Njagi; Pisanti, Nadia; Pissis, Solon P.; Radoszewski, Jakub; Sweering, Michelle; Zuba, Wiktor
Pangenome comparison via ED strings Journal Article
In: Frontiers in Bioinformatics, vol. 4, 2024, ISSN: 2673-7647.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Gabory2024,
title = {Pangenome comparison via ED strings},
author = {Esteban Gabory and Moses Njagi Mwaniki and Nadia Pisanti and Solon P. Pissis and Jakub Radoszewski and Michelle Sweering and Wiktor Zuba},
doi = {10.3389/fbinf.2024.1397036},
issn = {2673-7647},
year = {2024},
date = {2024-09-01},
journal = {Frontiers in Bioinformatics},
volume = {4},
publisher = {Frontiers Media SA},
abstract = {Introduction: An elastic-degenerate (ED) string is a sequence of sets of strings. It can also be seen as a directed acyclic graph whose edges are labeled by strings. The notion of ED strings was introduced as a simple alternative to variation and sequence graphs for representing a pangenome, that is, a collection of genomic sequences to be analyzed jointly or to be used as a reference.
Methods: In this study, we define notions of matching statistics of two ED strings as similarity measures between pangenomes and, consequently infer a corresponding distance measure. We then show that both measures can be computed efficiently, in both theory and practice, by employing the intersection graph of two ED strings.
Results: We also implemented our methods as a software tool for pangenome comparison and evaluated their efficiency and effectiveness using both synthetic and real datasets.
Discussion: As for efficiency, we compare the runtime of the intersection graph method against the classic product automaton construction showing that the intersection graph is faster by up to one order of magnitude. For showing effectiveness, we used real SARS-CoV-2 datasets and our matching statistics similarity measure to reproduce a well-established clade classification of SARS-CoV-2, thus demonstrating that the classification obtained by our method is in accordance with the existing one.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Methods: In this study, we define notions of matching statistics of two ED strings as similarity measures between pangenomes and, consequently infer a corresponding distance measure. We then show that both measures can be computed efficiently, in both theory and practice, by employing the intersection graph of two ED strings.
Results: We also implemented our methods as a software tool for pangenome comparison and evaluated their efficiency and effectiveness using both synthetic and real datasets.
Discussion: As for efficiency, we compare the runtime of the intersection graph method against the classic product automaton construction showing that the intersection graph is faster by up to one order of magnitude. For showing effectiveness, we used real SARS-CoV-2 datasets and our matching statistics similarity measure to reproduce a well-established clade classification of SARS-CoV-2, thus demonstrating that the classification obtained by our method is in accordance with the existing one.
Bernardini, Giulia; Gabory, Esteban; Pissis, Solon P.; Stougie, Leen; Sweering, Michelle; Zuba, Wiktor
Elastic-Degenerate String Matching with 1 Error or Mismatch Journal Article
In: Theory of Computing Systems, vol. 68, no. 5, pp. 1442–1467, 2024, ISSN: 1433-0490.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{Bernardini2024b,
title = {Elastic-Degenerate String Matching with 1 Error or Mismatch},
author = {Giulia Bernardini and Esteban Gabory and Solon P. Pissis and Leen Stougie and Michelle Sweering and Wiktor Zuba},
doi = {10.1007/s00224-024-10194-8},
issn = {1433-0490},
year = {2024},
date = {2024-09-01},
urldate = {2024-09-01},
journal = {Theory of Computing Systems},
volume = {68},
number = {5},
pages = {1442–1467},
publisher = {Springer Science and Business Media LLC},
abstract = {An elastic-degenerate (ED) string is a sequence of n finite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention by the combinatorial pattern matching community, culminating in an $tilde{mathcal{O}}(nw^{w-1})+mathcal{O}(N)$-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where $omega$ denotes the matrix multiplication exponent and the $tilde{mathcal{O}}(cdot)$ notation suppresses polylog factors. In the k-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most k errors. k-EDSM can be solved in $mathcal{O}(k^2mG+kN)$ time, under edit distance, or $mathcal{O}(kmG+kN)$ time, under Hamming distance, where G denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, G is only bounded by N, and so even for k = 1, the existing algorithms run in $Omega(mN)$ time in the worst case. In this paper we make progress in this direction. We show that 1-EDSM can be solved in $mathcal{O}((nm^2+N)log m)$ or $mathcal{O}(nm^3+N)$ time under edit distance. For the decision version of the problem, we present a faster $mathcal{O}(nm^2sqrt{log m} + Nlog log m)$-time algorithm. We also show that 1-EDSM can be solved in $mathcal{O}(nm^2+N log m)$ time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from 1-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the k-errata trees for indexing with errors [Cole et al., STOC 2004]. This is an extended version of a paper presented at LATIN 2022.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
Brejová, Broňa; Gagie, Travis; Herencsárová, Eva; Vinař, Tomáš
Maximum-scoring path sets on pangenome graphs of constant treewidth Journal Article
In: Frontiers in Bioinformatics, vol. 4, 2024, ISSN: 2673-7647.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Brejova2024,
title = {Maximum-scoring path sets on pangenome graphs of constant treewidth},
author = {Broňa Brejová and Travis Gagie and Eva Herencsárová and Tomáš Vinař},
doi = {10.3389/fbinf.2024.1391086},
issn = {2673-7647},
year = {2024},
date = {2024-07-01},
journal = {Frontiers in Bioinformatics},
volume = {4},
publisher = {Frontiers Media SA},
abstract = {We generalize a problem of finding maximum-scoring segment sets, previously studied by Csűrös (IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004, 1, 139–150), from sequences to graphs. Namely, given a vertex-weighted graph G and a non-negative startup penalty c, we can find a set of vertex-disjoint paths in G with maximum total score when each path’s score is its vertices’ total weight minus c. We call this new problem maximum-scoring path sets (MSPS). We present an algorithm that has a linear-time complexity for graphs with a constant treewidth. Generalization from sequences to graphs allows the algorithm to be used on pangenome graphs representing several related genomes and can be seen as a common abstraction for several biological problems on pangenomes, including searching for CpG islands, ChIP-seq data analysis, analysis of region enrichment for functional elements, or simple chaining problems.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Sierra, Pío; Durbin, Richard
Identification of transposable element families from pangenome polymorphisms Journal Article
In: Mobile DNA, vol. 15, no. 1, pp. 13, 2024, ISSN: 1759-8753.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Sierra2024,
title = {Identification of transposable element families from pangenome polymorphisms},
author = {Pío Sierra and Richard Durbin},
url = {https://github.com/piosierra/pantera},
doi = {10.1186/s13100-024-00323-y},
issn = {1759-8753},
year = {2024},
date = {2024-06-01},
urldate = {2024-06-01},
journal = {Mobile DNA},
volume = {15},
number = {1},
pages = {13},
publisher = {Springer Science and Business Media LLC},
abstract = {Background: Transposable Elements (TEs) are segments of DNA, typically a few hundred base pairs up to several tens of thousands bases long, that have the ability to generate new copies of themselves in the genome. Most existing methods used to identify TEs in a newly sequenced genome are based on their repetitive character, together with detection based on homology and structural features. As new high quality assemblies become more common, including the availability of multiple independent assemblies from the same species, an alternative strategy for identification of TE families becomes possible in which we focus on the polymorphism at insertion sites caused by TE mobility.
Results: We develop the idea of using the structural polymorphisms found in pangenomes to create a library of the TE families recently active in a species, or in a closely related group of species. We present a tool, pantera, that achieves this task, and illustrate its use both on species with well-curated libraries, and on new assemblies.
Conclusions: Our results show that pantera is sensitive and accurate, tending to correctly identify complete elements with precise boundaries, and is particularly well suited to detect larger, low copy number TEs that are often undetected with existing de novo methods.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Results: We develop the idea of using the structural polymorphisms found in pangenomes to create a library of the TE families recently active in a species, or in a closely related group of species. We present a tool, pantera, that achieves this task, and illustrate its use both on species with well-curated libraries, and on new assemblies.
Conclusions: Our results show that pantera is sensitive and accurate, tending to correctly identify complete elements with precise boundaries, and is particularly well suited to detect larger, low copy number TEs that are often undetected with existing de novo methods.
Gabory, Esteban; Liu, Chang; Loukides, Grigorios; Pissis, Solon P.; Zuba, Wiktor
Space-Efficient Indexes for Uncertain Strings Proceedings Article
In: 2024 IEEE 40th International Conference on Data Engineering (ICDE), pp. 4828–4842, IEEE, 2024.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@inproceedings{Gabory2024a,
title = {Space-Efficient Indexes for Uncertain Strings},
author = {Esteban Gabory and Chang Liu and Grigorios Loukides and Solon P. Pissis and Wiktor Zuba},
url = {https://arxiv.org/abs/2403.14256},
doi = {10.1109/icde60146.2024.00367},
year = {2024},
date = {2024-05-01},
urldate = {2024-05-01},
booktitle = {2024 IEEE 40th International Conference on Data Engineering (ICDE)},
pages = {4828–4842},
publisher = {IEEE},
abstract = {Strings in the real world are often encoded with some level of uncertainty, for example, due to: unreliable data measurements; flexible sequence modeling; or noise introduced for privacy protection. In the character-level uncertainty model, an uncertain string X of length $n$ on an alphabet Σ is a sequence of $n$ probability distributions over Σ. Given an uncertain string $X$ and a weight threshold $frac{1}{z}in(0,1]$, we say that pattern $P$ occurs in $X$ at position $i$, if the product of probabilities of the letters of $P$ at positions $i,…, i+ vert Pvert-1$ is at least $frac{1}{z}$. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has $O(nz)$ size, requires $O(nz)$ time and $O(nz)$ space to be constructed, and answers pattern matching queries in the optimal $O(m+ vert Occ vert)$ time, where $m$ is the length of $P$ and $vert Occvert$ is the total number of occurrences of $P$ in $X$ . For large $n$ and (moderate) $z$ values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We show that when we have at hand a lower bound ℓ on the length of the supported pattern queries, as is often the case in real-world applications, we can slash the index size and the construction space roughly by ℓ. In particular, we propose an index of $mathcal{O}(frac{nz}{l}log z)$ expected size, which can be constructed using $mathcal{O}(frac{nz}{l}log z)$ expected space, and supports very fast pattern matching queries in expectation, for patterns of length m ≥ ℓ. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {inproceedings}
}
Taş, Gizem; Westerdijk, Timo; Postma, Eric; Rheenen, Wouter; Bakker, Mark K.; Eijk, Kristel R.; Kooyman, Maarten; Khleifat, Ahmad Al; Iacoangeli, Alfredo; Ticozzi, Nicola; Cooper-Knock, Johnathan; Gromicho, Marta; Chandran, Siddharthan; Morrison, Karen E.; Shaw, Pamela J.; Hardy, John; Sendtner, Michael; Meyer, Thomas; Başak, Nazli; Fogh, Isabella; Chiò, Adriano; Calvo, Andrea; Pupillo, Elisabetta; Logroscino, Giancarlo; Gotkine, Marc; Vourc’h, Patrick; Corcia, Philippe; Couratier, Philippe; Millecamps, Stèphanie; Salachas, François; Pardina, Jesus S. Mora; Rojas-García, Ricardo; Dion, Patrick; Ross, Jay P.; Ludolph, Albert C.; Weishaupt, Jochen H.; Freischmidt, Axel; Bensimon, Gilbert; Tittmann, Lukas; Lieb, Wolfgang; Franke, Andre; Ripke, Stephan; Whiteman, David C.; Olsen, Catherine M.; Uitterlinden, Andre G.; Hofman, Albert; Amouyel, Philippe; Traynor, Bryan; Singleton, Adrew B.; Neto, Miguel Mitne; Cauchi, Ruben J.; Ophoff, Roel A.; Deerlin, Vivianna M.; Grosskreutz, Julian; Graff, Caroline; Brylev, Lev; Rogelj, Boris; Koritnik, Blaž; Zidar, Janez; Stević, Zorica; Drory, Vivian; Povedano, Monica; Blair, Ian P.; Kiernan, Matthew C.; Nicholson, Garth A.; Henders, Anjali K.; Carvalho, Mamede; Pinto, Susana; Petri, Susanne; Weber, Markus; Rouleau, Guy A.; Silani, Vincenzo; Glass, Jonathan; Brown, Robert H.; Landers, John E.; Shaw, Christopher E.; Andersen, Peter M.; Garton, Fleur C.; McRae, Allan F.; McLaughlin, Russell L.; Hardiman, Orla; Kenna, Kevin P.; Wray, Naomi R.; Al-Chalabi, Ammar; Damme, Philip Van; Berg, Leonard H.; Veldink, Jan H.; Veldink, Jan H.; Schönhuth, Alexander; Balvert, Marleen
Computing linkage disequilibrium aware genome embeddings using autoencoders Journal Article
In: Bioinformatics, vol. 40, no. 6, 2024, ISSN: 1367-4811.
Abstract | Links | BibTeX | Tags: WP3: Translational CPG
@article{Tas2024,
title = {Computing linkage disequilibrium aware genome embeddings using autoencoders},
author = {Gizem Taş and Timo Westerdijk and Eric Postma and Wouter Rheenen and Mark K. Bakker and Kristel R. Eijk and Maarten Kooyman and Ahmad Al Khleifat and Alfredo Iacoangeli and Nicola Ticozzi and Johnathan Cooper-Knock and Marta Gromicho and Siddharthan Chandran and Karen E. Morrison and Pamela J. Shaw and John Hardy and Michael Sendtner and Thomas Meyer and Nazli Başak and Isabella Fogh and Adriano Chiò and Andrea Calvo and Elisabetta Pupillo and Giancarlo Logroscino and Marc Gotkine and Patrick Vourc’h and Philippe Corcia and Philippe Couratier and Stèphanie Millecamps and François Salachas and Jesus S. Mora Pardina and Ricardo Rojas-García and Patrick Dion and Jay P. Ross and Albert C. Ludolph and Jochen H. Weishaupt and Axel Freischmidt and Gilbert Bensimon and Lukas Tittmann and Wolfgang Lieb and Andre Franke and Stephan Ripke and David C. Whiteman and Catherine M. Olsen and Andre G. Uitterlinden and Albert Hofman and Philippe Amouyel and Bryan Traynor and Adrew B. Singleton and Miguel Mitne Neto and Ruben J. Cauchi and Roel A. Ophoff and Vivianna M. Deerlin and Julian Grosskreutz and Caroline Graff and Lev Brylev and Boris Rogelj and Blaž Koritnik and Janez Zidar and Zorica Stević and Vivian Drory and Monica Povedano and Ian P. Blair and Matthew C. Kiernan and Garth A. Nicholson and Anjali K. Henders and Mamede Carvalho and Susana Pinto and Susanne Petri and Markus Weber and Guy A. Rouleau and Vincenzo Silani and Jonathan Glass and Robert H. Brown and John E. Landers and Christopher E. Shaw and Peter M. Andersen and Fleur C. Garton and Allan F. McRae and Russell L. McLaughlin and Orla Hardiman and Kevin P. Kenna and Naomi R. Wray and Ammar Al-Chalabi and Philip Van Damme and Leonard H. Berg and Jan H. Veldink and Jan H. Veldink and Alexander Schönhuth and Marleen Balvert},
editor = {Peter Robinson},
url = {https://github.com/gizem-tas/haploblock-autoencoders},
doi = {10.1093/bioinformatics/btae326},
issn = {1367-4811},
year = {2024},
date = {2024-05-01},
urldate = {2024-05-01},
journal = {Bioinformatics},
volume = {40},
number = {6},
publisher = {Oxford University Press (OUP)},
abstract = {Motivation: The completion of the genome has paved the way for genome-wide association studies (GWAS), which explained certain proportions of heritability. GWAS are not optimally suited to detect non-linear effects in disease risk, possibly hidden in non-additive interactions (epistasis). Alternative methods for epistasis detection using, e.g. deep neural networks (DNNs) are currently under active development. However, DNNs are constrained by finite computational resources, which can be rapidly depleted due to increasing complexity with the sheer size of the genome. Besides, the curse of dimensionality complicates the task of capturing meaningful genetic patterns for DNNs; therefore necessitates dimensionality reduction.
Results: We propose a method to compress single nucleotide polymorphism (SNP) data, while leveraging the linkage disequilibrium (LD) structure and preserving potential epistasis. This method involves clustering correlated SNPs into haplotype blocks and training per-block autoencoders to learn a compressed representation of the block’s genetic content. We provide an adjustable autoencoder design to accommodate diverse blocks and bypass extensive hyperparameter tuning. We applied this method to genotyping data from Project MinE, and achieved 99% average test reconstruction accuracy—i.e. minimal information loss—while compressing the input to nearly 10% of the original size. We demonstrate that haplotype-block based autoencoders outperform linear Principal Component Analysis (PCA) by approximately 3% chromosome-wide accuracy of reconstructed variants. To the extent of our knowledge, our approach is the first to simultaneously leverage haplotype structure and DNNs for dimensionality reduction of genetic data.
Availability and implementation: Data are available for academic use through Project MinE at https://www.projectmine.com/research/data-sharing/, contingent upon terms and requirements specified by the source studies. Code is available at https://github.com/gizem-tas/haploblock-autoencoders.},
keywords = {WP3: Translational CPG},
pubstate = {published},
tppubtype = {article}
}
Results: We propose a method to compress single nucleotide polymorphism (SNP) data, while leveraging the linkage disequilibrium (LD) structure and preserving potential epistasis. This method involves clustering correlated SNPs into haplotype blocks and training per-block autoencoders to learn a compressed representation of the block’s genetic content. We provide an adjustable autoencoder design to accommodate diverse blocks and bypass extensive hyperparameter tuning. We applied this method to genotyping data from Project MinE, and achieved 99% average test reconstruction accuracy—i.e. minimal information loss—while compressing the input to nearly 10% of the original size. We demonstrate that haplotype-block based autoencoders outperform linear Principal Component Analysis (PCA) by approximately 3% chromosome-wide accuracy of reconstructed variants. To the extent of our knowledge, our approach is the first to simultaneously leverage haplotype structure and DNNs for dimensionality reduction of genetic data.
Availability and implementation: Data are available for academic use through Project MinE at https://www.projectmine.com/research/data-sharing/, contingent upon terms and requirements specified by the source studies. Code is available at https://github.com/gizem-tas/haploblock-autoencoders.
Rivals, Eric; Wang, Pengfei
Counting overlapping pairs of strings Unpublished
2024.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@unpublished{Rivals2024a,
title = {Counting overlapping pairs of strings},
author = {Eric Rivals and Pengfei Wang},
doi = {10.48550/ARXIV.2405.09393},
year = {2024},
date = {2024-05-01},
publisher = {arXiv},
abstract = {A correlation is a binary vector that encodes all possible positions of overlaps of two words, where an overlap for an ordered pair of words (u,v) occurs if a suffix of word u matches a prefix of word v. As multiple pairs can have the same correlation, it is relevant to count how many pairs of words share the same correlation depending on the alphabet size and word length n. We exhibit recurrences to compute the number of such pairs – which is termed population size – for any correlation; for this, we exploit a relationship between overlaps of two words and self-overlap of one word. This theorem allows us to compute the number of pairs with a longest overlap of a given length and to show that the expected length of the longest border of two words asymptotically diverges, which solves two open questions raised by Gabric in 2022. Finally, we also provide bounds for the asymptotic of the population ratio of any correlation. Given the importance of word overlaps in areas like word combinatorics, bioinformatics, and digital communication, our results may ease analyses of algorithms for string processing, code design, or genome assembly.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {unpublished}
}
Hannoush, Khodor; Marchet, Camille; Peterlongo, Pierre
Cdbgtricks: Strategies to update a compacted de Bruijn graph Unpublished
2024.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@unpublished{Hannoush2024,
title = {Cdbgtricks: Strategies to update a compacted de Bruijn graph},
author = {Khodor Hannoush and Camille Marchet and Pierre Peterlongo},
url = {https://github.com/khodor14/Cdbgtricks},
doi = {10.1101/2024.05.24.595676},
year = {2024},
date = {2024-05-01},
urldate = {2024-05-01},
publisher = {Cold Spring Harbor Laboratory},
abstract = {We propose Cdbgtricks, a new method for updating a compacted de Bruijn graph when adding novel sequences, such as full genomes. Our method indexes the graph, enabling to identify in constant time the location (unitig and offset) of any k-mer. The update operation that we propose also updates the index. Our results show that Cdbgtricks is faster than Bifrost and GGCAT. We benefit from the index of the graph to provide new functionalities, such as reporting the subgraph that shares a desired percentage of k-mers with a query sequence with the ability to query a set of reads. The open-source Cdbgtricks software is available at https://github.com/khodor14/Cdbgtricks.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {unpublished}
}
Parmigiani, Luca; Wittler, Roland; Stoye, Jens
Revisiting pangenome openness with k-mers Journal Article
In: Peer Community Journal, vol. 4, 2024, ISSN: 2804-3871.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@article{Parmigiani2024,
title = {Revisiting pangenome openness with k-mers},
author = {Luca Parmigiani and Roland Wittler and Jens Stoye},
doi = {10.24072/pcjournal.415},
issn = {2804-3871},
year = {2024},
date = {2024-04-01},
journal = {Peer Community Journal},
volume = {4},
publisher = {Peer Community In},
abstract = {Pangenomics is the study of related genomes collectively, usually from the same species or closely related taxa. Originally, pangenomes were defined for bacterial species. After the concept was extended to eukaryotic genomes, two definitions of pangenome evolved in parallel: the gene-based approach, which defines the pangenome as the union of all genes, and the sequence-based approach, which defines the pangenome as the set of all nonredundant genomic sequences. Estimating the total size of the pangenome for a given species has been subject of study since the very first mention of pangenomes. Traditionally, this is performed by predicting the ratio at which new genes are discovered, referred to as the openness of the species. Here, we abstract each genome as a set of items, which is entirely agnostic of the two approaches (gene-based, sequence-based). Genes are a viable option for items, but also other possibilities are feasible, e.g., genome sequence substrings of fixed length k (k-mers). In the present study, we investigate the use of k-mers to estimate the openness as an alternative to genes, and compare the results. An efficient implementation is also provided.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {article}
}
Almeida, Miguel Vasconcelos; Blumer, Moritz; Yuan, Chengwei Ulrika; Sierra, Pío; Price, Jonathan L.; Quah, Fu Xiang; Friman, Aleksandr; Dallaire, Alexandra; Vernaz, Grégoire; Putman, Audrey L. K.; Smith, Alan M.; Joyce, Domino A.; Butter, Falk; Haase, Astrid D.; Durbin, Richard; Santos, M. Emília; Miska, Eric A.
Dynamic co-evolution of transposable elements and the piRNA pathway in African cichlid fishes Unpublished
2024.
Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG
@unpublished{Almeida2024,
title = {Dynamic co-evolution of transposable elements and the piRNA pathway in African cichlid fishes},
author = {Miguel Vasconcelos Almeida and Moritz Blumer and Chengwei Ulrika Yuan and Pío Sierra and Jonathan L. Price and Fu Xiang Quah and Aleksandr Friman and Alexandra Dallaire and Grégoire Vernaz and Audrey L. K. Putman and Alan M. Smith and Domino A. Joyce and Falk Butter and Astrid D. Haase and Richard Durbin and M. Emília Santos and Eric A. Miska},
doi = {10.1101/2024.04.01.587621},
year = {2024},
date = {2024-04-01},
publisher = {Cold Spring Harbor Laboratory},
abstract = {East African cichlid fishes have diversified in an explosive fashion, but the (epi)genetic basis of the phenotypic diversity of these fishes remains largely unknown. Although transposable elements (TEs) have been associated with phenotypic variation in cichlids, little is known about their transcriptional activity and epigenetic silencing. Here, we describe dynamic patterns of TE expression in African cichlid gonads and during early development. Orthology inference revealed an expansion of piwil1 genes in Lake Malawi cichlids, likely driven by PiggyBac TEs. The expanded piwil1 copies have signatures of positive selection and retain amino acid residues essential for catalytic activity. Furthermore, the gonads of African cichlids express a Piwi-interacting RNA (piRNA) pathway that target TEs. We define the genomic sites of piRNA production in African cichlids and find divergence in closely related species, in line with fast evolution of piRNA-producing loci. Our findings suggest dynamic co-evolution of TEs and host silencing pathways in the African cichlid radiations. We propose that this co-evolution has contributed to cichlid genomic diversity.},
keywords = {WP2: Evolutionary/Comparative CPG},
pubstate = {published},
tppubtype = {unpublished}
}
Charalampopoulos, Panagiotis; Chen, Huiping; Christen, Peter; Loukides, Grigorios; Pisanti, Nadia; Pissis, Solon P.; Radoszewski, Jakub
Pattern Masking for Dictionary Matching: Theory and Practice Journal Article
In: Algorithmica, vol. 86, no. 6, pp. 1948–1978, 2024, ISSN: 1432-0541.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{Charalampopoulos2024,
title = {Pattern Masking for Dictionary Matching: Theory and Practice},
author = {Panagiotis Charalampopoulos and Huiping Chen and Peter Christen and Grigorios Loukides and Nadia Pisanti and Solon P. Pissis and Jakub Radoszewski},
doi = {10.1007/s00453-024-01213-8},
issn = {1432-0541},
year = {2024},
date = {2024-03-01},
journal = {Algorithmica},
volume = {86},
number = {6},
pages = {1948–1978},
publisher = {Springer Science and Business Media LLC},
abstract = {Data masking is a common technique for sanitizing sensitive data maintained in database systems which is becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem: given a dictionary D of d strings, each of length l, a query string q of length l, and a positive integer z, we are asked to compute a smallest set K ⊆ 1, ... , l, so that if q[i] is replaced by a wildcard for all i ∈ K , then q matches at least z strings from D. Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for binary strings. We thus approach the problem from a more practical perspective. We show a combinatorial $$O((dl)^|K|/3 + dl)$$-time and O(dl)-space algorithm for PMDM for |K| = O(1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails (Abboud et al. in SIAM J Comput 47:2527–2555, 2018; Lincoln et al., in: 29th ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018). Our combinatorial algorithm, executed with small |K|, is the backbone of a greedy heuristic that we propose. Our experiments on real-world and synthetic datasets show that our heuristic finds nearly-optimal solutions in practice and is also very efficient. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in D. PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z = 1, one obtains the minimal number of mismatches of q with any string from D. The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time $$O(2^ld)$$. We present a data structure for PMDM that answers queries over D in time $$O(2^l/2(2^l/2 + τ )l)$$ and requires space $$O(2^ld^2/τ^2 + 2^l/2d)$$, for any parameter τ ∈ [1, d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáˇc et al., ACM-SIAM Symposium on Discrete Algorithms (SODA) 2017]. This gives a polynomial-time $$O(d^1/4+epsilon)$$-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. This is an extended version of a paper that was presented at International Symposium on Algorithms and Computation (ISAAC) 2021.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
Rizzo, Nicola; Cáceres, Manuel; Mäkinen, Veli
Finding maximal exact matches in graphs Journal Article
In: Algorithms for Molecular Biology, vol. 19, no. 1, pp. 10, 2024, ISSN: 1748-7188.
Abstract | Links | BibTeX | Tags: WP1: Primary CPG
@article{Rizzo2024,
title = {Finding maximal exact matches in graphs},
author = {Nicola Rizzo and Manuel Cáceres and Veli Mäkinen},
url = {https://github.com/algbio/efg-mems},
doi = {10.1186/s13015-024-00255-5},
issn = {1748-7188},
year = {2024},
date = {2024-03-01},
urldate = {2024-03-01},
journal = {Algorithms for Molecular Biology},
volume = {19},
number = {1},
pages = {10},
publisher = {Springer Science and Business Media LLC},
abstract = {Background: We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least $kappa$($kappa$-MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., TALG 2023) even on acyclic graphs.
Results: In this paper we show an $O(n cdot L cdot d^{L-1} + m + M_{kappa , L})$-time algorithm finding all $kappa$-MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, $m = |Q|$, and $M_{kappa ,L}$ is the number of output MEMs. We use this algorithm to develop a $kappa$-MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time $O(nH^2 + m + M_{kappa})$, where H is the maximum number of nodes in a block, and $M_{kappa}$ is the total number of $kappa$-MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection.
Conclusions: We show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems.},
keywords = {WP1: Primary CPG},
pubstate = {published},
tppubtype = {article}
}
Results: In this paper we show an $O(n \cdot L \cdot d^{L-1} + m + M_{\kappa , L})$-time algorithm finding all $\kappa$-MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, $m = |Q|$, and $M_{\kappa ,L}$ is the number of output MEMs. We use this algorithm to develop a $\kappa$-MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time $O(nH^2 + m + M_{\kappa})$, where H is the maximum number of nodes in a block, and $M_{\kappa}$ is the total number of $\kappa$-MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection.
Conclusions: We show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems.