Publications – ALPACA ITN Project

77 entries « ‹ 2 of 4 › »

21.

Lemane, Téo; Lezzoche, Nolan; Lecubin, Julien; Pelletier, Eric; Lescot, Magali; Chikhi, Rayan; Peterlongo, Pierre

kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets Unpublished

bioRxiv, 2023.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

22.

Ayad, Lorraine A K; Loukides, Grigorios; Pissis, Solon P

Text Indexing for Long Patterns: Anchors are All you Need Proceedings Article

In: pp. 2117–2131, Proceedings VLDB Endowment, 2023, ISSN: 2150-8097.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@inproceedings{Ayad2023-vw,

title = {Text Indexing for Long Patterns: Anchors are All you Need},

author = {Lorraine A K Ayad and Grigorios Loukides and Solon P Pissis},

doi = {10.14778/3598581.3598586},

issn = {2150-8097},

year  = {2023},

date = {2023-05-01},

urldate = {2023-05-01},

journal = {Proceedings VLDB Endowment},

volume = {16},

number = {9},

pages = {2117–2131},

publisher = {Proceedings VLDB Endowment},

abstract = {In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (i) index space; (ii) query time; (iii) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound l on the length of the queried patterns --- which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors, by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index.},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {inproceedings}

}

23.

Liao, Wen-Wei; Asri, Mobin; Ebler, Jana; Doerr, Daniel; Haukness, Marina; Hickey, Glenn; Lu, Shuangjia; Lucas, Julian K; Monlong, Jean; Abel, Haley J; Buonaiuto, Silvia; Chang, Xian H; Cheng, Haoyu; Chu, Justin; Colonna, Vincenza; Eizenga, Jordan M; Feng, Xiaowen; Fischer, Christian; Fulton, Robert S; Garg, Shilpa; Groza, Cristian; Guarracino, Andrea; Harvey, William T; Heumos, Simon; Howe, Kerstin; Jain, Miten; Lu, Tsung-Yu; Markello, Charles; Martin, Fergal J; Mitchell, Matthew W; Munson, Katherine M; Mwaniki, Moses Njagi; Novak, Adam M; Olsen, Hugh E; Pesout, Trevor; Porubsky, David; Prins, Pjotr; Sibbesen, Jonas A; Sirén, Jouni; Tomlinson, Chad; Villani, Flavia; Vollger, Mitchell R; Antonacci-Fulton, Lucinda L; Baid, Gunjan; Baker, Carl A; Belyaeva, Anastasiya; Billis, Konstantinos; Carroll, Andrew; Chang, Pi-Chuan; Cody, Sarah; Cook, Daniel E; Cook-Deegan, Robert M; Cornejo, Omar E; Diekhans, Mark; Ebert, Peter; Fairley, Susan; Fedrigo, Olivier; Felsenfeld, Adam L; Formenti, Giulio; Frankish, Adam; Gao, Yan; Garrison, Nanibaa' A; Giron, Carlos Garcia; Green, Richard E; Haggerty, Leanne; Hoekzema, Kendra; Hourlier, Thibaut; Ji, Hanlee P; Kenny, Eimear E; Koenig, Barbara A; Kolesnikov, Alexey; Korbel, Jan O; Kordosky, Jennifer; Koren, Sergey; Lee, Hojoon; Lewis, Alexandra P; Magalhães, Hugo; Marco-Sola, Santiago; Marijon, Pierre; McCartney, Ann; McDaniel, Jennifer; Mountcastle, Jacquelyn; Nattestad, Maria; Nurk, Sergey; Olson, Nathan D; Popejoy, Alice B; Puiu, Daniela; Rautiainen, Mikko; Regier, Allison A; Rhie, Arang; Sacco, Samuel; Sanders, Ashley D; Schneider, Valerie A; Schultz, Baergen I; Shafin, Kishwar; Smith, Michael W; Sofia, Heidi J; Tayoun, Ahmad N Abou; Thibaud-Nissen, Françoise; Tricomi, Francesca Floriana; Wagner, Justin; Walenz, Brian; Wood, Jonathan M D; Zimin, Aleksey V; Bourque, Guillaume; Chaisson, Mark J P; Flicek, Paul; Phillippy, Adam M; Zook, Justin M; Eichler, Evan E; Haussler, David; Wang, Ting; Jarvis, Erich D; Miga, Karen H; Garrison, Erik; Marschall, Tobias; Hall, Ira M; Li, Heng; Paten, Benedict

A draft human pangenome reference Journal Article

In: Nature, vol. 617, no. 7960, pp. 312–324, 2023, ISBN: 1476-4687.

Abstract | Links | BibTeX | Tags: WP3: Translational CPG

@article{Liao2023-do,

title = {A draft human pangenome reference},

author = {Wen-Wei Liao and Mobin Asri and Jana Ebler and Daniel Doerr and Marina Haukness and Glenn Hickey and Shuangjia Lu and Julian K Lucas and Jean Monlong and Haley J Abel and Silvia Buonaiuto and Xian H Chang and Haoyu Cheng and Justin Chu and Vincenza Colonna and Jordan M Eizenga and Xiaowen Feng and Christian Fischer and Robert S Fulton and Shilpa Garg and Cristian Groza and Andrea Guarracino and William T Harvey and Simon Heumos and Kerstin Howe and Miten Jain and Tsung-Yu Lu and Charles Markello and Fergal J Martin and Matthew W Mitchell and Katherine M Munson and Moses Njagi Mwaniki and Adam M Novak and Hugh E Olsen and Trevor Pesout and David Porubsky and Pjotr Prins and Jonas A Sibbesen and Jouni Sirén and Chad Tomlinson and Flavia Villani and Mitchell R Vollger and Lucinda L Antonacci-Fulton and Gunjan Baid and Carl A Baker and Anastasiya Belyaeva and Konstantinos Billis and Andrew Carroll and Pi-Chuan Chang and Sarah Cody and Daniel E Cook and Robert M Cook-Deegan and Omar E Cornejo and Mark Diekhans and Peter Ebert and Susan Fairley and Olivier Fedrigo and Adam L Felsenfeld and Giulio Formenti and Adam Frankish and Yan Gao and Nanibaa' A Garrison and Carlos Garcia Giron and Richard E Green and Leanne Haggerty and Kendra Hoekzema and Thibaut Hourlier and Hanlee P Ji and Eimear E Kenny and Barbara A Koenig and Alexey Kolesnikov and Jan O Korbel and Jennifer Kordosky and Sergey Koren and Hojoon Lee and Alexandra P Lewis and Hugo Magalhães and Santiago Marco-Sola and Pierre Marijon and Ann McCartney and Jennifer McDaniel and Jacquelyn Mountcastle and Maria Nattestad and Sergey Nurk and Nathan D Olson and Alice B Popejoy and Daniela Puiu and Mikko Rautiainen and Allison A Regier and Arang Rhie and Samuel Sacco and Ashley D Sanders and Valerie A Schneider and Baergen I Schultz and Kishwar Shafin and Michael W Smith and Heidi J Sofia and Ahmad N Abou Tayoun and Françoise Thibaud-Nissen and Francesca Floriana Tricomi and Justin Wagner and Brian Walenz and Jonathan M D Wood and Aleksey V Zimin and Guillaume Bourque and Mark J P Chaisson and Paul Flicek and Adam M Phillippy and Justin M Zook and Evan E Eichler and David Haussler and Ting Wang and Erich D Jarvis and Karen H Miga and Erik Garrison and Tobias Marschall and Ira M Hall and Heng Li and Benedict Paten},

doi = {10.1038/s41586-023-05896-x},

isbn = {1476-4687},

year  = {2023},

date = {2023-05-01},

urldate = {2023-05-01},

journal = {Nature},

volume = {617},

number = {7960},

pages = {312–324},

publisher = {Springer Science and Business Media LLC},

abstract = {Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.},

keywords = {WP3: Translational CPG},

pubstate = {published},

tppubtype = {article}

}

24.

Forgia, Marco; Navarro, Beatriz; Daghino, Stefania; Cervera, Amelia; Gisel, Andreas; Perotto, Silvia; Aghayeva, Dilzara N.; Akinyuwa, Mary F.; Gobbi, Emanuela; Zheludev, Ivan N.; Edgar, Robert C.; Chikhi, Rayan; Turina, Massimo; Babaian, Artem; Serio, Francesco Di; Peña, Marcos

Hybrids of RNA viruses and viroid-like elements replicate in fungi Journal Article

In: Nature Communications, vol. 14, no. 1, pp. 2591, 2023, ISSN: 2041-1723.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@article{Forgia2023,

title = {Hybrids of RNA viruses and viroid-like elements replicate in fungi},

author = {Marco Forgia and Beatriz Navarro and Stefania Daghino and Amelia Cervera and Andreas Gisel and Silvia Perotto and Dilzara N. Aghayeva and Mary F. Akinyuwa and Emanuela Gobbi and Ivan N. Zheludev and Robert C. Edgar and Rayan Chikhi and Massimo Turina and Artem Babaian and Francesco Di Serio and Marcos Peña},

doi = {10.1038/s41467-023-38301-2},

issn = {2041-1723},

year  = {2023},

date = {2023-05-01},

urldate = {2023-05-01},

journal = {Nature Communications},

volume = {14},

number = {1},

pages = {2591},

publisher = {Springer Science and Business Media LLC},

abstract = {Earth’s life may have originated as self-replicating RNA, and it has been argued that RNA viruses and viroid-like elements are remnants of such pre-cellular RNA world. RNA viruses are defined by linear RNA genomes encoding an RNA-dependent RNA polymerase (RdRp), whereas viroid-like elements consist of small, single-stranded, circular RNA genomes that, in some cases, encode paired self-cleaving ribozymes. Here we show that the number of candidate viroid-like elements occurring in geographically and ecologically diverse niches is much higher than previously thought. We report that, amongst these circular genomes, fungal ambiviruses are viroid-like elements that undergo rolling circle replication and encode their own viral RdRp. Thus, ambiviruses are distinct infectious RNAs showing hybrid features of viroid-like RNAs and viruses. We also detected similar circular RNAs, containing active ribozymes and encoding RdRps, related to mitochondrial-like fungal viruses, highlighting fungi as an evolutionary hub for RNA viruses and viroid-like elements. Our findings point to a deep co-evolutionary history between RNA viruses and subviral elements and offer new perspectives in the origin and evolution of primordial infectious agents, and RNA life.},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {article}

}

25.

Mille, Marie; Ripoll, Julie; Cazaux, Bastien; Rivals, Eric

dipwmsearch: a Python package for searching di-PWM motifs Journal Article

In: Bioinformatics, vol. 39, no. 4, 2023, ISSN: 1367-4811.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

26.

Denti, Luca; Khorsand, Parsoa; Bonizzoni, Paola; Hormozdiari, Fereydoun; Chikhi, Rayan

SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads Journal Article

In: Nat. Methods, vol. 20, no. 4, pp. 550–558, 2023, ISBN: 1548-7105.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

27.

Porubsky, David; Vollger, Mitchell R; Harvey, William T; Rozanski, Allison N; Ebert, Peter; Hickey, Glenn; Hasenfeld, Patrick; Sanders, Ashley D; Stober, Catherine; Consortium, Human Pangenome Reference; Korbel, Jan O; Paten, Benedict; Marschall, Tobias; Eichler, Evan E

Gaps and complex structurally variant loci in phased genome assemblies Journal Article

In: Genome Res., vol. 33, no. 4, pp. 496–510, 2023, ISSN: 1549-5469.

Abstract | Links | BibTeX | Tags: WP3: Translational CPG

@article{Porubsky2023-ue,

title = {Gaps and complex structurally variant loci in phased genome  assemblies},

author = {David Porubsky and Mitchell R Vollger and William T Harvey and Allison N Rozanski and Peter Ebert and Glenn Hickey and Patrick Hasenfeld and Ashley D Sanders and Catherine Stober and Human Pangenome Reference Consortium and Jan O Korbel and Benedict Paten and Tobias Marschall and Evan E Eichler},

doi = {10.1101/gr.277334.122},

issn = {1549-5469},

year  = {2023},

date = {2023-04-01},

urldate = {2023-04-01},

journal = {Genome Res.},

volume = {33},

number = {4},

pages = {496–510},

abstract = {There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6-7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.},

keywords = {WP3: Translational CPG},

pubstate = {published},

tppubtype = {article}

}

28.

Břinda, Karel; Lima, Leandro; Pignotti, Simone; Quinones-Olvera, Natalia; Salikhov, Kamil; Chikhi, Rayan; Kucherov, Gregory; Iqbal, Zamin; Baym, Michael

Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression Unpublished

bioRxiv, 2023.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

29.

Carr, Victoria R; Pissis, Solon P; Mullany, Peter; Shoaie, Saeed; Gomez-Cabrero, David; Moyes, David L

Palidis: fast discovery of novel insertion sequences Journal Article

In: Microb. Genom., vol. 9, no. 3, 2023, ISSN: 2057-5858.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

30.

Frouin, Arthur; Laporte, Fabien; Hafner, Lukas; Maury, Mylene; McCaw, Zachary R.; Julienne, Hanna; Henches, Léo; Chikhi, Rayan; Lecuit, Marc; Aschard, Hugues

ChoruMM: a versatile multi-components mixed model for bacterial-GWAS Unpublished

bioRxiv, 2023.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

31.

Luo, Xiao; Kang, Xiongbin; Schönhuth, Alexander

Predicting the prevalence of complex genetic diseases from individual genotype profiles using capsule networks Journal Article

In: Nat. Mach. Intell., vol. 5, no. 2, pp. 114–125, 2023, ISBN: 2522-5839.

Abstract | Links | BibTeX | Tags: WP3: Translational CPG

@article{Luo2023-le,

title = {Predicting the prevalence of complex genetic diseases from individual genotype profiles using capsule networks},

author = {Xiao Luo and Xiongbin Kang and Alexander Schönhuth},

doi = {10.1038/s42256-022-00604-2},

isbn = {2522-5839},

year  = {2023},

date = {2023-02-01},

urldate = {2023-02-01},

journal = {Nat. Mach. Intell.},

volume = {5},

number = {2},

pages = {114–125},

publisher = {Springer Science and Business Media LLC},

abstract = {Diseases that have a complex genetic architecture tend to suffer from considerable amounts of genetic variants that, although playing a role in the disease, have not yet been revealed as such. Two major causes for this phenomenon are genetic variants that do not stack up effects, but interact in complex ways; in addition, as recently suggested, the omnigenic model postulates that variants interact in a holistic manner to establish disease phenotypes. Here we present DiseaseCapsule, as a capsule-network-based approach that explicitly addresses to capture the hierarchical structure of the underlying genome data, and has the potential to fully capture the non-linear relationships between variants and disease. DiseaseCapsule is the first such approach to operate in a whole-genome manner when predicting disease occurrence from individual genotype profiles. In experiments, we evaluated DiseaseCapsule on amyotrophic lateral sclerosis (ALS) and Parkinson’s disease, with a particular emphasis on ALS, which is known to have a complex genetic architecture and is affected by 40% missing heritability. On ALS, DiseaseCapsule achieves 86.9% accuracy on hold-out test data in predicting disease occurrence, thereby outperforming all other approaches by large margins. Also, DiseaseCapsule required sufficiently less training data for reaching optimal performance. Last but not least, the systematic exploitation of the network architecture yielded 922 genes of particular interest, and 644 ‘non-additive’ genes that are crucial factors in DiseaseCapsule, but remain masked within linear schemes.},

keywords = {WP3: Translational CPG},

pubstate = {published},

tppubtype = {article}

}

32.

Mwaniki, Njagi Moses; Pisanti, Nadia

Optimal Sequence Alignment to ED-Strings Proceedings Article

In: Bansal, Mukul S.; Cai, Zhipeng; Mangul, Serghei (Ed.): Bioinformatics Research and Applications, pp. 204–216, Springer Nature, Germany, 2023, ISSN: 0302-9743.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

33.

Rivals, Eric; Sweering, Michelle; Wang, Pengfei

Convergence of the number of period sets in strings Proceedings Article

In: Etessami, Kousha; Feige, Uriel; Puppis, Gabriele (Ed.): 50th International Colloquium on Automata, Languages, and Programming (ICALP 2023), pp. 100:1–100:14, Schloss Dagstuhl -- Leibniz-Zentrum f{"u}r Informatik, Dagstuhl, Germany, 2023, ISBN: 978-3-95977-278-5.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

34.

Rizzo, Nicola; Cáceres, Manuel; Mäkinen, Veli

Chaining of maximal exact matches in graphs Proceedings Article

In: Nardini, Franco Maria; Pisanti, Nadia; Venturini, Rossano (Ed.): String Processing and Information Retrieval, pp. 353–366, Springer Nature Switzerland, Cham, 2023, ISBN: 978-3-031-43980-3.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG