Publications – ALPACA ITN Project

122 entries « ‹ 1 of 7 › »

Ghaffaari, Ali; Schönhuth, Alexander; Marschall, Tobias

DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs Proceedings Article

In: Brejová, Broňa; Patro, Rob (Ed.): 25th International Conference on Algorithms for Bioinformatics (WABI 2025), pp. 10:1–10:24, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2025, ISSN: 1868-8969.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@inproceedings{ghaffaari_et_al:LIPIcs.WABI.2025.10,

title = {DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs},

author = {Ali Ghaffaari and Alexander Schönhuth and Tobias Marschall},

editor = {Broňa Brejová and Rob Patro},

url = {https://github.com/cartoonist/diverg},

doi = {10.4230/LIPIcs.WABI.2025.10},

issn = {1868-8969},

year  = {2025},

date = {2025-01-01},

urldate = {2025-01-01},

booktitle = {25th International Conference on Algorithms for Bioinformatics (WABI 2025)},

volume = {344},

pages = {10:1–10:24},

publisher = {Schloss Dagstuhl – Leibniz-Zentrum für Informatik},

address = {Dagstuhl, Germany},

series = {Leibniz International Proceedings in Informatics (LIPIcs)},

abstract = {Determining the distance between two loci within a genomic region is a recurrent operation in various tasks in computational genomics. A notable example of this task arises in paired-end read mapping as a form of validation of distances between multiple alignments. While straightforward for a single genome, graph-based reference structures render the operation considerably more involved. Given the sheer number of such queries in a typical read mapping experiment, an efficient algorithm for answering distance queries is crucial. In this paper, we introduce DiVerG, a compact data structure as well as a fast and scalable algorithm, for constructing distance indexes for general sequence graphs on multi-core CPU and many-core GPU architectures. DiVerG is based on PairG, but overcomes the limitations of PairG by exploiting the extensive potential for improvements in terms of scalability and space efficiency. As a consequence, DiVerG can process substantially larger datasets, such as whole human genomes, which are unmanageable by PairG. DiVerG offers faster index construction time and consistently faster query time with gains proportional to the size of the underlying compact data structure. We demonstrate that our method performs favorably on multiple real datasets at various scales. DiVerG achieves superior performance over PairG; e.g. resulting to 2.5--4x speed-up in query time, 44--340x smaller index size, and 3--50x faster construction time for the genome graph of the MHC region, as a particularly variable region of the human genome.The implementation is available at: https://github.com/cartoonist/diverg},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {inproceedings}

}

Kenneweg, Philip; Dandinasivara, Raghuram; Luo, Xiao; Hammer, Barbara; Schönhuth, Alexander

Generating Synthetic Genotypes using Diffusion Models Unpublished

2024.

Abstract | Links | BibTeX | Tags: WP3: Translational CPG

Cartes, Jorge Avila; Bonizzoni, Paola; Ciccolella, Simone; Vedova, Gianluca Della; Denti, Luca

PangeBlocks: customized construction of pangenome graphs via maximal blocks Journal Article

In: BMC Bioinformatics, vol. 25, no. 1, 2024, ISSN: 1471-2105.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

@article{AvilaCartes2024a,

title = {PangeBlocks: customized construction of pangenome graphs via maximal blocks},

author = {Jorge Avila Cartes and Paola Bonizzoni and Simone Ciccolella and Gianluca Della Vedova and Luca Denti},

url = {https://github.com/AlgoLab/pangeblocks},

doi = {10.1186/s12859-024-05958-5},

issn = {1471-2105},

year  = {2024},

date = {2024-11-01},

urldate = {2024-11-01},

journal = {BMC Bioinformatics},

volume = {25},

number = {1},

publisher = {Springer Science and Business Media LLC},

abstract = {\textbf{Background}

The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling. 



\textbf{Results} 

In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase. 



\textbf{Conclusion}

We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.},

keywords = {WP2: Evolutionary/Comparative CPG},

pubstate = {published},

tppubtype = {article}

}

Balvert, Marleen; Cooper-Knock, Johnathan; Stamp, Julian; Byrne, Ross P.; Mourragui, Soufiane; Gils, Juami; Benonisdottir, Stefania; Schlüter, Johannes; Kenna, Kevin; Abeln, Sanne; Iacoangeli, Alfredo; Daub, Joséphine T.; Browning, Brian L.; Taş, Gizem; Hu, Jiajing; Wang, Yan; Alhathli, Elham; Harvey, Calum; Pianesi, Luna; Schulte, Sara C.; González-Domínguez, Jorge; Garrisson, Erik; Al-Chalabi, Ammar; Cartes, Jorge Avila; Baaijens, Jasmijn; Berg, Joanna; Bolognini, Davide; Bonizzoni, Paola; Guarracino, Andrea; Koyuturk, Mehmet; Markowska, Magda; Dandinasivara, Raghuram; Bemmelen, Jasper; Vorbrugg, Sebastian; Zhang, Sai; Pasanuic, Bogdan; Snyder, Michael P.; Schönhuth, Alexander; Sng, Letitia M. F.; Twine, Natalie A.

Considerations in the search for epistasis Journal Article

In: Genome Biology, vol. 25, no. 1, pp. 296, 2024, ISSN: 1474-760X.

Abstract | Links | BibTeX | Tags: WP3: Translational CPG

Zhang, Wenhai; Liu, Yuansheng; Xu, Jialu; Chen, Enlian; Schönhuth, Alexander; Luo, Xiao

PanTax: Strain-level taxonomic classification of metagenomic data using pangenome graphs Unpublished

2024.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@unpublished{Zhang2024,

title = {PanTax: Strain-level taxonomic classification of metagenomic data using pangenome graphs},

author = {Wenhai Zhang and Yuansheng Liu and Jialu Xu and Enlian Chen and Alexander Schönhuth and Xiao Luo},

url = {https://github.com/LuoGroup2023/PanTax},

doi = {10.1101/2024.11.15.623887},

year  = {2024},

date = {2024-11-01},

urldate = {2024-11-01},

publisher = {Cold Spring Harbor Laboratory},

abstract = {Microbes are omnipresent, thriving in a range of habitats from oceans to soils and even within our gastrointestinal tracts. They play a vital role in maintaining ecological equilibrium and promoting the health of their hosts. Consequently, understanding the strain diversity within microbial communities is crucial, as variations between strains can lead to distinct phenotypic expressions or diverse biological functions. However, current methods for taxonomic classification from metagenomic sequencing data have several limitations, including their reliance solely on species resolution, support for either short or long reads, or their confinement to a given single species. Most notably, the majority of existing taxonomic classifiers rely solely on a single linear representative genome as a reference, which fails to capture the strain diversity, thereby introducing single-reference biases. 

Here, we present PanTax, a pangenome graph-based taxonomic classification method that overcomes the shortcomings of single-reference genome-based approaches, because pangenome graphs possess the capability to depict the genetic variability present across multiple evolutionarily or environmentally related genomes. PanTax provides a comprehensive solution to taxonomic classification for strain resolution, compatibility with both short and long reads, and compatibility with single or multiple species. Extensive benchmarking results demonstrate that PanTax drastically outperforms state-of-the-art approaches, primarily evidenced by its significantly higher precision or recall (at both species and strain levels), while maintaining comparable or better performance in other aspects across various datasets. PanTax is a user-friendly open-source tool that is publicly accessible at https://github.com/LuoGroup2023/PanTax.},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {unpublished}

}

Parmigiani, Luca; Garrison, Erik; Stoye, Jens; Marschall, Tobias; Doerr, Daniel

Panacus: fast and exact pangenome growth and core size estimation Journal Article

In: Bioinformatics, 2024, ISSN: 1367-4811.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

Frolova, Daria; Lima, Leandro; Roberts, Leah Wendy; Bohnenkämper, Leonard; Wittler, Roland; Stoye, Jens; Iqbal, Zamin

Applying rearrangement distances to enable plasmid epidemiology with pling Journal Article

In: Microbial Genomics, vol. 10, no. 10, 2024, ISSN: 2057-5858.

Abstract | Links | BibTeX | Tags: WP3: Translational CPG

@article{Frolova2024,

title = {Applying rearrangement distances to enable plasmid epidemiology with pling},

author = {Daria Frolova and Leandro Lima and Leah Wendy Roberts and Leonard Bohnenkämper and Roland Wittler and Jens Stoye and Zamin Iqbal},

url = {https://github.com/iqbal-lab-org/pling},

doi = {10.1099/mgen.0.001300},

issn = {2057-5858},

year  = {2024},

date = {2024-10-01},

urldate = {2024-10-01},

journal = {Microbial Genomics},

volume = {10},

number = {10},

publisher = {Microbiology Society},

abstract = {Plasmids are a key vector of antibiotic resistance, but the current bioinformatics toolkit is not well suited to tracking them. The rapid structural changes seen in plasmid genomes present considerable challenges to evolutionary and epidemiological analysis. Typical approaches are either low resolution (replicon typing) or use shared k-mer content to define a genetic distance. However, this distance can both overestimate plasmid relatedness by ignoring rearrangements, and underestimate by over-penalizing gene gain/loss. Therefore a model is needed which captures the key components of how plasmid genomes evolve structurally – through gene/block gain or loss, and rearrangement. A secondary requirement is to prevent promiscuous transposable elements (TEs) leading to over-clustering of unrelated plasmids. We choose the ‘Double Cut and Join Indel’ (DCJ-Indel) model, in which plasmids are studied at a coarse level, as a sequence of signed integers (representing genes or aligned blocks), and the distance between two plasmids is the minimum number of rearrangement events or indels needed to transform one into the other. We show how this gives much more meaningful distances between plasmids. We introduce a software workflow pling (https://github.com/iqbal-lab-org/pling), which uses the DCJ-Indel model, to calculate distances between plasmids and then cluster them. In our approach, we combine containment distances and DCJ-Indel distances to build a TE-aware plasmid network. We demonstrate superior performance and interpretability to other plasmid clustering tools on the ‘Russian Doll’ dataset and a hospital transmission dataset.},

keywords = {WP3: Translational CPG},

pubstate = {published},

tppubtype = {article}

}

Kang, Xiongbin; Zhang, Wenhai; Li, Yichen; Luo, Xiao; Schönhuth, Alexander

HyLight: Strain aware assembly of low coverage metagenomes Journal Article

In: Nature Communications, vol. 15, no. 1, pp. 8665, 2024, ISSN: 2041-1723.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@article{Kang2024,

title = {HyLight: Strain aware assembly of low coverage metagenomes},

author = {Xiongbin Kang and Wenhai Zhang and Yichen Li and Xiao Luo and Alexander Schönhuth},

doi = {10.1038/s41467-024-52907-0},

issn = {2041-1723},

year  = {2024},

date = {2024-10-01},

journal = {Nature Communications},

volume = {15},

number = {1},

pages = {8665},

publisher = {Springer Science and Business Media LLC},

abstract = {Different strains of identical species can vary substantially in terms of their spectrum of biomedically relevant phenotypes. Reconstructing the genomes of microbial communities at the level of their strains poses significant challenges, because sequencing errors can obscure strain-specific variants. Next-generation sequencing (NGS) reads are too short to resolve complex genomic regions. Third-generation sequencing (TGS) reads, although longer, are prone to higher error rates or substantially more expensive. Limiting TGS coverage to reduce costs compromises the accuracy of the assemblies. This explains why prior approaches agree on losses in strain awareness, accuracy, tendentially excessive costs, or combinations thereof. We introduce HyLight, a metagenome assembly approach that addresses these challenges by implementing the complementary strengths of TGS and NGS data. HyLight employs strain-resolved overlap graphs (OG) to accurately reconstruct individual strains within microbial communities. Our experiments demonstrate that HyLight produces strain-aware and contiguous assemblies at minimal error content, while significantly reducing costs because utilizing low-coverage TGS data. HyLight achieves an average improvement of 19.05% in preserving strain identity and demonstrates near-complete strain awareness across diverse datasets. In summary, HyLight offers considerable advances in metagenome assembly, insofar as it delivers significantly enhanced strain awareness, contiguity, and accuracy without the typical compromises observed in existing approaches.},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {article}

}

Rivals, Eric

Incremental computation of the set of period sets Unpublished

2024.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

10.

Gabory, Esteban; Mwaniki, Moses Njagi; Pisanti, Nadia; Pissis, Solon P.; Radoszewski, Jakub; Sweering, Michelle; Zuba, Wiktor

Pangenome comparison via ED strings Journal Article

In: Frontiers in Bioinformatics, vol. 4, 2024, ISSN: 2673-7647.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

11.

Bernardini, Giulia; Gabory, Esteban; Pissis, Solon P.; Stougie, Leen; Sweering, Michelle; Zuba, Wiktor

Elastic-Degenerate String Matching with 1 Error or Mismatch Journal Article

In: Theory of Computing Systems, vol. 68, no. 5, pp. 1442–1467, 2024, ISSN: 1433-0490.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@article{Bernardini2024b,

title = {Elastic-Degenerate String Matching with 1 Error or Mismatch},

author = {Giulia Bernardini and Esteban Gabory and Solon P. Pissis and Leen Stougie and Michelle Sweering and Wiktor Zuba},

doi = {10.1007/s00224-024-10194-8},

issn = {1433-0490},

year  = {2024},

date = {2024-09-01},

urldate = {2024-09-01},

journal = {Theory of Computing Systems},

volume = {68},

number = {5},

pages = {1442–1467},

publisher = {Springer Science and Business Media LLC},

abstract = {An elastic-degenerate (ED) string is a sequence of n finite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention by the combinatorial pattern matching community, culminating in an $tilde{mathcal{O}}(nw^{w-1})+mathcal{O}(N)$-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where $omega$ denotes the matrix multiplication exponent and the $tilde{mathcal{O}}(cdot)$ notation suppresses polylog factors. In the k-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most k errors. k-EDSM can be solved in $mathcal{O}(k^2mG+kN)$ time, under edit distance, or $mathcal{O}(kmG+kN)$ time, under Hamming distance, where G denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, G is only bounded by N, and so even for k = 1, the existing algorithms run in $Omega(mN)$ time in the worst case. In this paper we make progress in this direction. We show that 1-EDSM can be solved in $mathcal{O}((nm^2+N)log m)$ or $mathcal{O}(nm^3+N)$ time under edit distance. For the decision version of the problem, we present a faster $mathcal{O}(nm^2sqrt{log m} + Nlog log m)$-time algorithm. We also show that 1-EDSM can be solved in $mathcal{O}(nm^2+N log m)$ time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from 1-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the k-errata trees for indexing with errors [Cole et al., STOC 2004]. This is an extended version of a paper presented at LATIN 2022.},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {article}

}

An elastic-degenerate (ED) string is a sequence of n finite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention by the combinatorial pattern matching community, culminating in an $\tilde{\mathcal{O}}(nw^{w-1})+\mathcal{O}(N)$-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where $\omega$ denotes the matrix multiplication exponent and the $\tilde{\mathcal{O}}(\cdot)$ notation suppresses polylog factors. In the k-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most k errors. k-EDSM can be solved in $\mathcal{O}(k^2mG+kN)$ time, under edit distance, or $\mathcal{O}(kmG+kN)$ time, under Hamming distance, where G denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, G is only bounded by N, and so even for k = 1, the existing algorithms run in $\Omega(mN)$ time in the worst case. In this paper we make progress in this direction. We show that 1-EDSM can be solved in $\mathcal{O}((nm^2+N)\log m)$ or $\mathcal{O}(nm^3+N)$ time under edit distance. For the decision version of the problem, we present a faster $\mathcal{O}(nm^2\sqrt{\log m} + N\log \log m)$-time algorithm. We also show that 1-EDSM can be solved in $\mathcal{O}(nm^2+N \log m)$ time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from 1-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the k-errata trees for indexing with errors [Cole et al., STOC 2004]. This is an extended version of a paper presented at LATIN 2022.

12.

Brejová, Broňa; Gagie, Travis; Herencsárová, Eva; Vinař, Tomáš

Maximum-scoring path sets on pangenome graphs of constant treewidth Journal Article

In: Frontiers in Bioinformatics, vol. 4, 2024, ISSN: 2673-7647.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

13.

Sierra, Pío; Durbin, Richard

Identification of transposable element families from pangenome polymorphisms Journal Article

In: Mobile DNA, vol. 15, no. 1, pp. 13, 2024, ISSN: 1759-8753.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

14.

Gabory, Esteban; Liu, Chang; Loukides, Grigorios; Pissis, Solon P.; Zuba, Wiktor

Space-Efficient Indexes for Uncertain Strings Proceedings Article

In: 2024 IEEE 40th International Conference on Data Engineering (ICDE), pp. 4828–4842, IEEE, 2024.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@inproceedings{Gabory2024a,

title = {Space-Efficient Indexes for Uncertain Strings},

author = {Esteban Gabory and Chang Liu and Grigorios Loukides and Solon P. Pissis and Wiktor Zuba},

url = {https://arxiv.org/abs/2403.14256},

doi = {10.1109/icde60146.2024.00367},

year  = {2024},

date = {2024-05-01},

urldate = {2024-05-01},

booktitle = {2024 IEEE 40th International Conference on Data Engineering (ICDE)},

pages = {4828–4842},

publisher = {IEEE},

abstract = {Strings in the real world are often encoded with some level of uncertainty, for example, due to: unreliable data measurements; flexible sequence modeling; or noise introduced for privacy protection. In the character-level uncertainty model, an uncertain string X of length $n$ on an alphabet Σ is a sequence of $n$ probability distributions over Σ. Given an uncertain string $X$ and a weight threshold $frac{1}{z}in(0,1]$, we say that pattern $P$ occurs in $X$ at position $i$, if the product of probabilities of the letters of $P$ at positions $i,…, i+ vert Pvert-1$ is at least $frac{1}{z}$. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has $O(nz)$ size, requires $O(nz)$ time and $O(nz)$ space to be constructed, and answers pattern matching queries in the optimal $O(m+ vert Occ vert)$ time, where $m$ is the length of $P$ and $vert Occvert$ is the total number of occurrences of $P$ in $X$ . For large $n$ and (moderate) $z$ values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We show that when we have at hand a lower bound ℓ on the length of the supported pattern queries, as is often the case in real-world applications, we can slash the index size and the construction space roughly by ℓ. In particular, we propose an index of $mathcal{O}(frac{nz}{l}log z)$ expected size, which can be constructed using $mathcal{O}(frac{nz}{l}log z)$ expected space, and supports very fast pattern matching queries in expectation, for patterns of length m ≥ ℓ. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {inproceedings}

}

Strings in the real world are often encoded with some level of uncertainty, for example, due to: unreliable data measurements; flexible sequence modeling; or noise introduced for privacy protection. In the character-level uncertainty model, an uncertain string X of length $n$ on an alphabet Σ is a sequence of $n$ probability distributions over Σ. Given an uncertain string $X$ and a weight threshold $\frac{1}{z}\in(0,1]$, we say that pattern $P$ occurs in $X$ at position $i$, if the product of probabilities of the letters of $P$ at positions $i,…, i+ \vert P\vert-1$ is at least $\frac{1}{z}$. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has $O(nz)$ size, requires $O(nz)$ time and $O(nz)$ space to be constructed, and answers pattern matching queries in the optimal $O(m+ \vert Occ \vert)$ time, where $m$ is the length of $P$ and $\vert Occ\vert$ is the total number of occurrences of $P$ in $X$ . For large $n$ and (moderate) $z$ values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We show that when we have at hand a lower bound ℓ on the length of the supported pattern queries, as is often the case in real-world applications, we can slash the index size and the construction space roughly by ℓ. In particular, we propose an index of $\mathcal{O}(\frac{nz}{l}\log z)$ expected size, which can be constructed using $\mathcal{O}(\frac{nz}{l}\log z)$ expected space, and supports very fast pattern matching queries in expectation, for patterns of length m ≥ ℓ. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.

15.

Taş, Gizem; Westerdijk, Timo; Postma, Eric; Rheenen, Wouter; Bakker, Mark K.; Eijk, Kristel R.; Kooyman, Maarten; Khleifat, Ahmad Al; Iacoangeli, Alfredo; Ticozzi, Nicola; Cooper-Knock, Johnathan; Gromicho, Marta; Chandran, Siddharthan; Morrison, Karen E.; Shaw, Pamela J.; Hardy, John; Sendtner, Michael; Meyer, Thomas; Başak, Nazli; Fogh, Isabella; Chiò, Adriano; Calvo, Andrea; Pupillo, Elisabetta; Logroscino, Giancarlo; Gotkine, Marc; Vourc’h, Patrick; Corcia, Philippe; Couratier, Philippe; Millecamps, Stèphanie; Salachas, François; Pardina, Jesus S. Mora; Rojas-García, Ricardo; Dion, Patrick; Ross, Jay P.; Ludolph, Albert C.; Weishaupt, Jochen H.; Freischmidt, Axel; Bensimon, Gilbert; Tittmann, Lukas; Lieb, Wolfgang; Franke, Andre; Ripke, Stephan; Whiteman, David C.; Olsen, Catherine M.; Uitterlinden, Andre G.; Hofman, Albert; Amouyel, Philippe; Traynor, Bryan; Singleton, Adrew B.; Neto, Miguel Mitne; Cauchi, Ruben J.; Ophoff, Roel A.; Deerlin, Vivianna M.; Grosskreutz, Julian; Graff, Caroline; Brylev, Lev; Rogelj, Boris; Koritnik, Blaž; Zidar, Janez; Stević, Zorica; Drory, Vivian; Povedano, Monica; Blair, Ian P.; Kiernan, Matthew C.; Nicholson, Garth A.; Henders, Anjali K.; Carvalho, Mamede; Pinto, Susana; Petri, Susanne; Weber, Markus; Rouleau, Guy A.; Silani, Vincenzo; Glass, Jonathan; Brown, Robert H.; Landers, John E.; Shaw, Christopher E.; Andersen, Peter M.; Garton, Fleur C.; McRae, Allan F.; McLaughlin, Russell L.; Hardiman, Orla; Kenna, Kevin P.; Wray, Naomi R.; Al-Chalabi, Ammar; Damme, Philip Van; Berg, Leonard H.; Veldink, Jan H.; Veldink, Jan H.; Schönhuth, Alexander; Balvert, Marleen

Computing linkage disequilibrium aware genome embeddings using autoencoders Journal Article

In: Bioinformatics, vol. 40, no. 6, 2024, ISSN: 1367-4811.

Abstract | Links | BibTeX | Tags: WP3: Translational CPG

@article{Tas2024,

title = {Computing linkage disequilibrium aware genome embeddings using autoencoders},

author = {Gizem Taş and Timo Westerdijk and Eric Postma and Wouter Rheenen and Mark K. Bakker and Kristel R. Eijk and Maarten Kooyman and Ahmad Al Khleifat and Alfredo Iacoangeli and Nicola Ticozzi and Johnathan Cooper-Knock and Marta Gromicho and Siddharthan Chandran and Karen E. Morrison and Pamela J. Shaw and John Hardy and Michael Sendtner and Thomas Meyer and Nazli Başak and Isabella Fogh and Adriano Chiò and Andrea Calvo and Elisabetta Pupillo and Giancarlo Logroscino and Marc Gotkine and Patrick Vourc’h and Philippe Corcia and Philippe Couratier and Stèphanie Millecamps and François Salachas and Jesus S. Mora Pardina and Ricardo Rojas-García and Patrick Dion and Jay P. Ross and Albert C. Ludolph and Jochen H. Weishaupt and Axel Freischmidt and Gilbert Bensimon and Lukas Tittmann and Wolfgang Lieb and Andre Franke and Stephan Ripke and David C. Whiteman and Catherine M. Olsen and Andre G. Uitterlinden and Albert Hofman and Philippe Amouyel and Bryan Traynor and Adrew B. Singleton and Miguel Mitne Neto and Ruben J. Cauchi and Roel A. Ophoff and Vivianna M. Deerlin and Julian Grosskreutz and Caroline Graff and Lev Brylev and Boris Rogelj and Blaž Koritnik and Janez Zidar and Zorica Stević and Vivian Drory and Monica Povedano and Ian P. Blair and Matthew C. Kiernan and Garth A. Nicholson and Anjali K. Henders and Mamede Carvalho and Susana Pinto and Susanne Petri and Markus Weber and Guy A. Rouleau and Vincenzo Silani and Jonathan Glass and Robert H. Brown and John E. Landers and Christopher E. Shaw and Peter M. Andersen and Fleur C. Garton and Allan F. McRae and Russell L. McLaughlin and Orla Hardiman and Kevin P. Kenna and Naomi R. Wray and Ammar Al-Chalabi and Philip Van Damme and Leonard H. Berg and Jan H. Veldink and Jan H. Veldink and Alexander Schönhuth and Marleen Balvert},

editor = {Peter Robinson},

url = {https://github.com/gizem-tas/haploblock-autoencoders},

doi = {10.1093/bioinformatics/btae326},

issn = {1367-4811},

year  = {2024},

date = {2024-05-01},

urldate = {2024-05-01},

journal = {Bioinformatics},

volume = {40},

number = {6},

publisher = {Oxford University Press (OUP)},

abstract = {Motivation: The completion of the genome has paved the way for genome-wide association studies (GWAS), which explained certain proportions of heritability. GWAS are not optimally suited to detect non-linear effects in disease risk, possibly hidden in non-additive interactions (epistasis). Alternative methods for epistasis detection using, e.g. deep neural networks (DNNs) are currently under active development. However, DNNs are constrained by finite computational resources, which can be rapidly depleted due to increasing complexity with the sheer size of the genome. Besides, the curse of dimensionality complicates the task of capturing meaningful genetic patterns for DNNs; therefore necessitates dimensionality reduction. 

Results: We propose a method to compress single nucleotide polymorphism (SNP) data, while leveraging the linkage disequilibrium (LD) structure and preserving potential epistasis. This method involves clustering correlated SNPs into haplotype blocks and training per-block autoencoders to learn a compressed representation of the block’s genetic content. We provide an adjustable autoencoder design to accommodate diverse blocks and bypass extensive hyperparameter tuning. We applied this method to genotyping data from Project MinE, and achieved 99% average test reconstruction accuracy—i.e. minimal information loss—while compressing the input to nearly 10% of the original size. We demonstrate that haplotype-block based autoencoders outperform linear Principal Component Analysis (PCA) by approximately 3% chromosome-wide accuracy of reconstructed variants. To the extent of our knowledge, our approach is the first to simultaneously leverage haplotype structure and DNNs for dimensionality reduction of genetic data. 

Availability and implementation: Data are available for academic use through Project MinE at https://www.projectmine.com/research/data-sharing/, contingent upon terms and requirements specified by the source studies. Code is available at https://github.com/gizem-tas/haploblock-autoencoders.},

keywords = {WP3: Translational CPG},

pubstate = {published},

tppubtype = {article}

}

Motivation: The completion of the genome has paved the way for genome-wide association studies (GWAS), which explained certain proportions of heritability. GWAS are not optimally suited to detect non-linear effects in disease risk, possibly hidden in non-additive interactions (epistasis). Alternative methods for epistasis detection using, e.g. deep neural networks (DNNs) are currently under active development. However, DNNs are constrained by finite computational resources, which can be rapidly depleted due to increasing complexity with the sheer size of the genome. Besides, the curse of dimensionality complicates the task of capturing meaningful genetic patterns for DNNs; therefore necessitates dimensionality reduction.
Results: We propose a method to compress single nucleotide polymorphism (SNP) data, while leveraging the linkage disequilibrium (LD) structure and preserving potential epistasis. This method involves clustering correlated SNPs into haplotype blocks and training per-block autoencoders to learn a compressed representation of the block’s genetic content. We provide an adjustable autoencoder design to accommodate diverse blocks and bypass extensive hyperparameter tuning. We applied this method to genotyping data from Project MinE, and achieved 99% average test reconstruction accuracy—i.e. minimal information loss—while compressing the input to nearly 10% of the original size. We demonstrate that haplotype-block based autoencoders outperform linear Principal Component Analysis (PCA) by approximately 3% chromosome-wide accuracy of reconstructed variants. To the extent of our knowledge, our approach is the first to simultaneously leverage haplotype structure and DNNs for dimensionality reduction of genetic data.
Availability and implementation: Data are available for academic use through Project MinE at https://www.projectmine.com/research/data-sharing/, contingent upon terms and requirements specified by the source studies. Code is available at https://github.com/gizem-tas/haploblock-autoencoders.

16.

Rivals, Eric; Wang, Pengfei

Counting overlapping pairs of strings Unpublished

2024.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

17.

Hannoush, Khodor; Marchet, Camille; Peterlongo, Pierre

Cdbgtricks: Strategies to update a compacted de Bruijn graph Unpublished

2024.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

18.

Parmigiani, Luca; Wittler, Roland; Stoye, Jens

Revisiting pangenome openness with k-mers Journal Article

In: Peer Community Journal, vol. 4, 2024, ISSN: 2804-3871.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

19.

Almeida, Miguel Vasconcelos; Blumer, Moritz; Yuan, Chengwei Ulrika; Sierra, Pío; Price, Jonathan L.; Quah, Fu Xiang; Friman, Aleksandr; Dallaire, Alexandra; Vernaz, Grégoire; Putman, Audrey L. K.; Smith, Alan M.; Joyce, Domino A.; Butter, Falk; Haase, Astrid D.; Durbin, Richard; Santos, M. Emília; Miska, Eric A.

Dynamic co-evolution of transposable elements and the piRNA pathway in African cichlid fishes Unpublished

2024.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

20.

Charalampopoulos, Panagiotis; Chen, Huiping; Christen, Peter; Loukides, Grigorios; Pisanti, Nadia; Pissis, Solon P.; Radoszewski, Jakub

Pattern Masking for Dictionary Matching: Theory and Practice Journal Article

In: Algorithmica, vol. 86, no. 6, pp. 1948–1978, 2024, ISSN: 1432-0541.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@article{Charalampopoulos2024,

title = {Pattern Masking for Dictionary Matching: Theory and Practice},

author = {Panagiotis Charalampopoulos and Huiping Chen and Peter Christen and Grigorios Loukides and Nadia Pisanti and Solon P. Pissis and Jakub Radoszewski},

doi = {10.1007/s00453-024-01213-8},

issn = {1432-0541},

year  = {2024},

date = {2024-03-01},

journal = {Algorithmica},

volume = {86},

number = {6},

pages = {1948–1978},

publisher = {Springer Science and Business Media LLC},

abstract = {Data masking is a common technique for sanitizing sensitive data maintained in database systems which is becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem: given a dictionary D of d strings, each of length l, a query string q of length l, and a positive integer z, we are asked to compute a smallest set K ⊆ 1, ... , l, so that if q[i] is replaced by a wildcard for all i ∈ K , then q matches at least z strings from D. Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for binary strings. We thus approach the problem from a more practical perspective. We show a combinatorial $$O((dl)^|K|/3 + dl)$$-time and O(dl)-space algorithm for PMDM for |K| = O(1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails (Abboud et al. in SIAM J Comput 47:2527–2555, 2018; Lincoln et al., in: 29th ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018). Our combinatorial algorithm, executed with small |K|, is the backbone of a greedy heuristic that we propose. Our experiments on real-world and synthetic datasets show that our heuristic finds nearly-optimal solutions in practice and is also very efficient. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in D. PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z = 1, one obtains the minimal number of mismatches of q with any string from D. The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time $$O(2^ld)$$. We present a data structure for PMDM that answers queries over D in time $$O(2^l/2(2^l/2 + τ )l)$$ and requires space $$O(2^ld^2/τ^2 + 2^l/2d)$$, for any parameter τ ∈ [1, d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáˇc et al., ACM-SIAM Symposium on Discrete Algorithms (SODA) 2017]. This gives a polynomial-time $$O(d^1/4+epsilon)$$-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. This is an extended version of a paper that was presented at International Symposium on Algorithms and Computation (ISAAC) 2021.},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {article}

}

Data masking is a common technique for sanitizing sensitive data maintained in database systems which is becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem: given a dictionary D of d strings, each of length l, a query string q of length l, and a positive integer z, we are asked to compute a smallest set K ⊆ 1, ... , l, so that if q[i] is replaced by a wildcard for all i ∈ K , then q matches at least z strings from D. Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for binary strings. We thus approach the problem from a more practical perspective. We show a combinatorial $$O((dl)^|K|/3 + dl)$$-time and O(dl)-space algorithm for PMDM for |K| = O(1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails (Abboud et al. in SIAM J Comput 47:2527–2555, 2018; Lincoln et al., in: 29th ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018). Our combinatorial algorithm, executed with small |K|, is the backbone of a greedy heuristic that we propose. Our experiments on real-world and synthetic datasets show that our heuristic finds nearly-optimal solutions in practice and is also very efficient. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in D. PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z = 1, one obtains the minimal number of mismatches of q with any string from D. The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time $$O(2^ld)$$. We present a data structure for PMDM that answers queries over D in time $$O(2^l/2(2^l/2 + τ )l)$$ and requires space $$O(2^ld^2/τ^2 + 2^l/2d)$$, for any parameter τ ∈ [1, d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáˇc et al., ACM-SIAM Symposium on Discrete Algorithms (SODA) 2017]. This gives a polynomial-time $$O(d^1/4+epsilon)$$-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. This is an extended version of a paper that was presented at International Symposium on Algorithms and Computation (ISAAC) 2021.

122 entries « ‹ 1 of 7 › »