Publications – ALPACA ITN Project

77 entries « ‹ 3 of 4 › »

41.

Dufresne, Yoann; Lemane, Teo; Marijon, Pierre; Peterlongo, Pierre; Rahman, Amatur; Kokot, Marek; Medvedev, Paul; Deorowicz, Sebastian; Chikhi, Rayan

The K-mer File Format: a standardized and compact disk representation of sets of k-mers Journal Article

In: Bioinformatics, vol. 38, no. 18, pp. 4423–4425, 2022, ISSN: 1367-4811.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

42.

Kang, Xiongbin; Luo, Xiao; Schönhuth, Alexander

StrainXpress: strain aware metagenome assembly from short reads Journal Article

In: Nucleic Acids Res, vol. 50, no. 17, pp. e101, 2022, ISSN: 1362-4962.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

43.

Porrelli, Stefano; Gerbault-Seureau, Michèle; Rozzi, Roberto; Chikhi, Rayan; Curaudeau, Manon; Ropiquet, Anne; Hassanin, Alexandre

Draft genome of the lowland anoa (Bubalus depressicornis) and comparison with buffalo genome assemblies (Bovidae, Bubalina) Journal Article

In: G3 Genes|Genomes|Genetics, vol. 12, iss. 11, 2022, ISSN: 2160-1836.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

@article{Porrelli2022,

title = {Draft genome of the lowland anoa (Bubalus depressicornis) and comparison with buffalo genome assemblies (Bovidae, Bubalina)},

author = {Stefano Porrelli and Michèle Gerbault-Seureau and Roberto Rozzi and Rayan Chikhi and Manon Curaudeau and Anne Ropiquet and Alexandre Hassanin},

editor = {D -J Koning},

doi = {10.1093/g3journal/jkac234},

issn = {2160-1836},

year  = {2022},

date = {2022-09-01},

urldate = {2022-09-01},

journal = {G3 Genes|Genomes|Genetics},

volume = {12},

issue = {11},

publisher = {Oxford University Press (OUP)},

abstract = {Genomic data for wild species of the genus Bubalus (Asian buffaloes) are still lacking while several whole genomes are currently available for domestic water buffaloes. To address this, we sequenced the genome of a wild endangered dwarf buffalo, the lowland anoa (Bubalus depressicornis), produced a draft genome assembly and made comparison to published buffalo genomes. The lowland anoa genome assembly was 2.56 Gbp long and contained 103,135 contigs, the longest contig being 337.39 kbp long. N50 and L50 values were 38.73 and 19.83 kbp, respectively, mean coverage was 44× and GC content was 41.74%. Two strategies were adopted to evaluate genome completeness: (1) determination of genomic features with de novo and homology-based predictions using annotations of chromosome-level genome assembly of the river buffalo and (2) employment of benchmarking against universal single-copy orthologs (BUSCO). Homology-based predictions identified 94.51% complete and 3.65% partial genomic features. De novo gene predictions identified 32,393 genes, representing 97.14% of the reference’s annotated genes, whilst BUSCO search against the mammalian orthologs database identified 71.1% complete, 11.7% fragmented, and 17.2% missing orthologs, indicating a good level of completeness for downstream analyses. Repeat analyses indicated that the lowland anoa genome contains 42.12% of repetitive regions. The genome assembly of the lowland anoa is expected to contribute to comparative genome analyses among bovid species.},

keywords = {WP2: Evolutionary/Comparative CPG},

pubstate = {published},

tppubtype = {article}

}

44.

Goga, Adrián; Baláž, Andrej

Prefix-free parsing for building large tunnelled Wheeler graphs Proceedings Article

In: Boucher, Christina; Rahmann, Sven (Ed.): 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022), pp. 18:1–18:12, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2022, ISBN: 978-3-95977-243-3.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@inproceedings{Goga2022-eb,

title = {Prefix-free parsing for building large tunnelled Wheeler graphs},

author = {Adrián Goga and Andrej Baláž},

editor = {Boucher, Christina and Rahmann, Sven},

doi = {10.4230/LIPIcs.WABI.2022.18},

isbn = {978-3-95977-243-3},

year  = {2022},

date = {2022-08-01},

urldate = {2022-08-01},

booktitle = {22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},

volume = {242},

pages = {18:1--18:12},

publisher = {Schloss Dagstuhl - Leibniz-Zentrum für Informatik},

address = {Dagstuhl, Germany},

series = {Leibniz International Proceedings in Informatics (LIPIcs)},

abstract = {We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting Wheeler graph, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The phrases are then sorted lexicographically. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP representation of the text is generally much shorter than the original since individual phrases are used many times in the parse, thus reducing the size of the dictionary. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the original text, tunnel the Wheeler graph of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact Wheeler graph of the original text. Compared with constructing a Wheeler graph from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of Wheeler graphs as a pangenomic reference for real-world pangenomic datasets.},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {inproceedings}

}

We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting Wheeler graph, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The phrases are then sorted lexicographically. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP representation of the text is generally much shorter than the original since individual phrases are used many times in the parse, thus reducing the size of the dictionary. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the original text, tunnel the Wheeler graph of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact Wheeler graph of the original text. Compared with constructing a Wheeler graph from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of Wheeler graphs as a pangenomic reference for real-world pangenomic datasets.

45.

Charalampopoulos, Panagiotis; Pissis, Solon P.; Radoszewski, Jakub

Longest Palindromic Substring in Sublinear Time Proceedings Article

In: Bannai, Hideo; Holub, Jan (Ed.): 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022), pp. 20:1–20:9, Schloss Dagstuhl -- Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2022, ISSN: 1868-8969.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG, WP2: Evolutionary/Comparative CPG

46.

Bernardini, Giulia; Conte, Alessio; Gabory, Esteban; Grossi, Roberto; Loukides, Grigorios; Pissis, Solon P.; Punzi, Giulia; Sweering, Michelle

On Strings Having the Same Length- k Substrings Proceedings Article

In: Bannai, Hideo; Holub, Jan (Ed.): 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022), pp. 16:1–16:17, Schloss Dagstuhl -- Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2022, ISSN: 1868-8969.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG, WP2: Evolutionary/Comparative CPG

@inproceedings{Bernardini2022b,

title = {On Strings Having the Same Length- k Substrings},

author = {Bernardini, Giulia and Conte, Alessio and Gabory, Esteban and Grossi, Roberto and Loukides, Grigorios and Pissis, Solon P. and Punzi, Giulia and Sweering, Michelle},

editor = {Bannai, Hideo and Holub, Jan},

doi = {10.4230/LIPIcs.CPM.2022.16},

issn = {1868-8969},

year  = {2022},

date = {2022-06-22},

urldate = {2022-06-22},

booktitle = {33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},

volume = {223},

pages = {16:1--16:17},

publisher = {Schloss Dagstuhl -- Leibniz-Zentrum für Informatik},

address = {Dagstuhl, Germany},

series = {Leibniz International Proceedings in Informatics (LIPIcs)},

abstract = {Let Substr_k(X) denote the set of length-k substrings of a given string X for a given integer k > 0. We study the following basic string problem, called z-Shortest 𝒮_k-Equivalent Strings: Given a set 𝒮_k of n length-k strings and an integer z > 0, list z shortest distinct strings T₁,…,T_z such that Substr_k(T_i) = 𝒮_k, for all i ∈ [1,z]. The z-Shortest 𝒮_k-Equivalent Strings problem arises naturally as an encoding problem in many real-world applications; e.g., in data privacy, in data compression, and in bioinformatics. The 1-Shortest 𝒮_k-Equivalent Strings, referred to as Shortest 𝒮_k-Equivalent String, asks for a shortest string X such that Substr_k(X) = 𝒮_k.

Our main contributions are summarized below:

- Given a directed graph G(V,E), the Directed Chinese Postman (DCP) problem asks for a shortest closed walk that visits every edge of G at least once. DCP can be solved in 𝒪̃(|E||V|) time using an algorithm for min-cost flow. We show, via a non-trivial reduction, that if Shortest 𝒮_k-Equivalent String over a binary alphabet has a near-linear-time solution then so does DCP.

- We show that the length of a shortest string output by Shortest 𝒮_k-Equivalent String is in 𝒪(k+n²). We generalize this bound by showing that the total length of z shortest strings is in 𝒪(zk+zn²+z²n). We derive these upper bounds by showing (asymptotically tight) bounds on the total length of z shortest Eulerian walks in general directed graphs.

- We present an algorithm for solving z-Shortest 𝒮_k-Equivalent Strings in 𝒪(nk+n²log²n+zn²log n+|output|) time. If z = 1, the time becomes 𝒪(nk+n²log²n) by the fact that the size of the input is Θ(nk) and the size of the output is 𝒪(k+n²).},

keywords = {WP1: Primary CPG, WP2: Evolutionary/Comparative CPG},

pubstate = {published},

tppubtype = {inproceedings}

}

47.

Teramo, Antonella; Binatti, Andrea; Ciabatti, Elena; Schiavoni, Gianluca; Tarrini, Giulia; Baril`a, Gregorio; Calabretto, Giulia; Vicenzetto, Cristina; Gasparini, Vanessa Rebecca; Facco, Monica; Petrini, Iacopo; Grossi, Roberto; Pisanti, Nadia; Bortoluzzi, Stefania; Falini, Brunangelo; Tiacci, Enrico; Galimberti, Sara; Semenzato, Gianpietro; Zambello, Renato

Defining TCRγδlymphoproliferative disorders by combined immunophenotypic and molecular evaluation Journal Article

In: Nature Communications, vol. 13, no. 1, pp. 3298, 2022, ISBN: 2041-1723.

Abstract | Links | BibTeX | Tags: WP3: Translational CPG

48.

Bernardini, Giulia; Chen, Huiping; Loukides, Grigorios; Pissis, Solon P.; Stougie, Leen; Sweering, Michelle

Making de Bruijn Graphs Eulerian Proceedings Article

In: Bannai, Hideo; Holub, Jan (Ed.): 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022), pp. 12:1–12:18, Schloss Dagstuhl -- Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2022, ISSN: 1868-8969.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG, WP2: Evolutionary/Comparative CPG

@inproceedings{Bernardini2022c,

title = {Making de Bruijn Graphs Eulerian},

author = {Bernardini, Giulia and Chen, Huiping and Loukides, Grigorios and Pissis, Solon P. and Stougie, Leen and Sweering, Michelle},

editor = {Bannai, Hideo and Holub, Jan},

doi = {10.4230/LIPIcs.CPM.2022.12},

issn = {1868-8969},

year  = {2022},

date = {2022-06-06},

urldate = {2022-06-06},

booktitle = {33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},

volume = {223},

pages = {12:1--12:18},

publisher = {Schloss Dagstuhl -- Leibniz-Zentrum für Informatik},

address = {Dagstuhl, Germany},

series = {Leibniz International Proceedings in Informatics (LIPIcs)},

abstract = {A directed multigraph is called Eulerian if it has a circuit which uses each edge exactly once. Euler’s theorem tells us that a weakly connected directed multigraph is Eulerian if and only if every node is balanced. Given a collection S of strings over an alphabet Σ, the de Bruijn graph (dBG) of order k of S is a directed multigraph $G_{S,k}(V,E)$, where V is the set of length-(k-1) substrings of the strings in S, and $G_{S,k}$ contains an edge (u,v) with multiplicity $m_{u,v}$, if and only if the string u[0]⋅ v is equal to the string u⋅ v[k-2] and this string occurs exactly $m_{u,v}$ times in total in strings in S. Let $G_{Σ,k}(V_{Σ,k},E_{Σ,k})$ be the complete dBG of $Σ^k$. The Eulerian Extension (EE) problem on $G_{S,k}$ asks to extend $G_{S,k}$ with a set $cal{B}$ of nodes from $V_{Σ,k}$ and a smallest multiset $cal{A}$ of edges from $E_{Σ,k}$ to make it Eulerian. Note that extending dBGs is algorithmically much more challenging than extending general directed multigraphs because some edges in dBGs are by definition forbidden. Extending dBGs lies at the heart of sequence assembly [Medvedev et al., WABI 2007], one of the most important tasks in bioinformatics. The novelty of our work with respect to existing works is that we allow not only to duplicate existing edges of $G_{S,k}$ but to also add novel edges and nodes, in an effort to (i) connect multiple components and (ii) reduce the total EE cost. It is easy to show that EE on $G_{S,k}$ is NP-hard via a reduction from shortest common superstring. We further show that EE remains NP-hard, even when we are not allowed to add new nodes, via a highly non-trivial reduction from 3-SAT. We thus investigate the following two problems underlying EE in dBGs:

1) When $G_{S,k}$ is not weakly connected, we are asked to connect its d > 1 components using a minimum-weight spanning tree, whose edges are paths on the underlying $G_{Σ,k}$ and weights are the corresponding path lengths. This way of connecting guarantees that no new unbalanced node is added. We show that this problem can be solved in $O(|V|klog d+|E|)$ time, which is nearly optimal, since the size of $G_{S,k}$ is $Θ(|V|k+|E|)$.

2) When $G_{S,k}$ is not balanced, we are asked to extend $G_{S,k}$ to $H_{S,k}(V cup cal{B},E cup cal{A})$ such that every node of $H_{S,k}$ is balanced and the total number $|cal{A}|$ of added edges is minimized. We show that this problem can be solved in the optimal $O(k|V| + |E|+ |cal{A}|)$ time. Let us stress that, although our main contributions are theoretical, the algorithms we design for the above two problems are practical. We combine the two algorithms in one method that makes any dBG Eulerian; and show experimentally that the cost of the obtained feasible solutions on real-world dBGs is substantially smaller than the corresponding cost obtained by existing greedy approaches.},

keywords = {WP1: Primary CPG, WP2: Evolutionary/Comparative CPG},

pubstate = {published},

tppubtype = {inproceedings}

}

A directed multigraph is called Eulerian if it has a circuit which uses each edge exactly once. Euler’s theorem tells us that a weakly connected directed multigraph is Eulerian if and only if every node is balanced. Given a collection S of strings over an alphabet Σ, the de Bruijn graph (dBG) of order k of S is a directed multigraph $G_{S,k}(V,E)$, where V is the set of length-(k-1) substrings of the strings in S, and $G_{S,k}$ contains an edge (u,v) with multiplicity $m_{u,v}$, if and only if the string u[0]⋅ v is equal to the string u⋅ v[k-2] and this string occurs exactly $m_{u,v}$ times in total in strings in S. Let $G_{Σ,k}(V_{Σ,k},E_{Σ,k})$ be the complete dBG of $Σ^k$. The Eulerian Extension (EE) problem on $G_{S,k}$ asks to extend $G_{S,k}$ with a set $cal{B}$ of nodes from $V_{Σ,k}$ and a smallest multiset $cal{A}$ of edges from $E_{Σ,k}$ to make it Eulerian. Note that extending dBGs is algorithmically much more challenging than extending general directed multigraphs because some edges in dBGs are by definition forbidden. Extending dBGs lies at the heart of sequence assembly [Medvedev et al., WABI 2007], one of the most important tasks in bioinformatics. The novelty of our work with respect to existing works is that we allow not only to duplicate existing edges of $G_{S,k}$ but to also add novel edges and nodes, in an effort to (i) connect multiple components and (ii) reduce the total EE cost. It is easy to show that EE on $G_{S,k}$ is NP-hard via a reduction from shortest common superstring. We further show that EE remains NP-hard, even when we are not allowed to add new nodes, via a highly non-trivial reduction from 3-SAT. We thus investigate the following two problems underlying EE in dBGs:
1) When $G_{S,k}$ is not weakly connected, we are asked to connect its d > 1 components using a minimum-weight spanning tree, whose edges are paths on the underlying $G_{Σ,k}$ and weights are the corresponding path lengths. This way of connecting guarantees that no new unbalanced node is added. We show that this problem can be solved in $O(|V|klog d+|E|)$ time, which is nearly optimal, since the size of $G_{S,k}$ is $Θ(|V|k+|E|)$.
2) When $G_{S,k}$ is not balanced, we are asked to extend $G_{S,k}$ to $H_{S,k}(V cup cal{B},E cup cal{A})$ such that every node of $H_{S,k}$ is balanced and the total number $|cal{A}|$ of added edges is minimized. We show that this problem can be solved in the optimal $O(k|V| + |E|+ |cal{A}|)$ time. Let us stress that, although our main contributions are theoretical, the algorithms we design for the above two problems are practical. We combine the two algorithms in one method that makes any dBG Eulerian; and show experimentally that the cost of the obtained feasible solutions on real-world dBGs is substantially smaller than the corresponding cost obtained by existing greedy approaches.

49.

Sládeček, T.; Gažiová, M.; Pös, O.; Pös, Z.; Budiš, J.; Radvánsky, J.; Szemes, T.

Combination of expert decision systems with artificial intelligence leads to superior accuracy of automated prediction of clinical effect of copy number variation Presentation

Poster Presentation at ESHG, 01.06.2022.

BibTeX | Tags: Misc, WP3: Translational CPG

50.

Bernardini, Giulia; Gawrychowski, Paweł; Pisanti, Nadia; Pissis, Solon P.; Rosone, Giovanna

Elastic-Degenerate String Matching via Fast Matrix Multiplication Journal Article

In: SIAM Journal on Computing, vol. 51, no. 3, pp. 549–576, 2022.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@article{Bernardini2022,

title = {Elastic-Degenerate String Matching via Fast Matrix Multiplication},

author = {Giulia Bernardini and Paweł Gawrychowski and Nadia Pisanti and Solon P. Pissis and Giovanna Rosone},

doi = {10.1137/20m1368033},

year  = {2022},

date = {2022-05-01},

urldate = {2022-05-01},

journal = {SIAM Journal on Computing},

volume = {51},

number = {3},

pages = {549--576},

publisher = {Society for Industrial & Applied Mathematics (SIAM)},

abstract = {An elastic-degenerate (ED) string is a sequence of $n$ sets of strings of total length $N$ which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences of a pattern of length $m$ in an ED text. The EDSM problem has recently received some attention in the combinatorial pattern matching community, and an $mathcal{O}(nm^{1.5}sqrt{log m} + N)$-time algorithm is known [Aoyama et al., CPM 2018]. The standard assumption in the prior work on this question is that $N$ is substantially larger than both $n$ and $m$, and thus we would like to have a linear dependency on the former. Under this assumption, the natural open problem is whether we can decrease the 1.5 exponent in the time complexity, similarly as in the related (but, to the best of our knowledge, not equivalent) word break problem [Backurs and Indyk, FOCS 2016]. Our starting point is a conditional lower bound for the EDSM problem. We use the popular combinatorial Boolean matrix multiplication (BMM) conjecture stating that there is no truly subcubic combinatorial algorithm for BMM [Abboud and Williams, FOCS 2014]. By designing an appropriate reduction, we show that a combinatorial algorithm solving the EDSM problem in $mathcal{O}(nm^{1.5-epsilon} + N)$ time, for any $epsilon>0$, refutes this conjecture. Our reduction should be understood as an indication that decreasing the exponent requires fast matrix multiplication. String periodicity and fast Fourier transform are two standard tools in string algorithms. Our main technical contribution is that we successfully combine these tools with fast matrix multiplication to design a noncombinatorial $tilde{mathcal{O}}(nm^{omega-1}+N)$-time algorithm for EDSM, where $omega$ denotes the matrix multiplication exponent and the $tilde{mathcal{O}}(cdot)$ notation suppresses polylog factors. To the best of our knowledge, we are the first to combine these tools. In particular, using the fact that $omega<2.373$ [Alman and Williams, SODA 2021; Le Gall, ISSAC 2014; Williams, STOC 2012], we obtain an $mathcal{O}(nm^{1.373} + N)$-time algorithm for EDSM. An important building block in our solution that might find applications in other problems is a method of selecting a small set of length-$ell$ substrings of the pattern, called anchors, so that any occurrence of a string from an ED text set contains at least one but not too many (on average) such anchors inside.},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {article}

}

An elastic-degenerate (ED) string is a sequence of $n$ sets of strings of total length $N$ which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences of a pattern of length $m$ in an ED text. The EDSM problem has recently received some attention in the combinatorial pattern matching community, and an $mathcal{O}(nm^{1.5}sqrt{log m} + N)$-time algorithm is known [Aoyama et al., CPM 2018]. The standard assumption in the prior work on this question is that $N$ is substantially larger than both $n$ and $m$, and thus we would like to have a linear dependency on the former. Under this assumption, the natural open problem is whether we can decrease the 1.5 exponent in the time complexity, similarly as in the related (but, to the best of our knowledge, not equivalent) word break problem [Backurs and Indyk, FOCS 2016]. Our starting point is a conditional lower bound for the EDSM problem. We use the popular combinatorial Boolean matrix multiplication (BMM) conjecture stating that there is no truly subcubic combinatorial algorithm for BMM [Abboud and Williams, FOCS 2014]. By designing an appropriate reduction, we show that a combinatorial algorithm solving the EDSM problem in $mathcal{O}(nm^{1.5-epsilon} + N)$ time, for any $epsilon>0$, refutes this conjecture. Our reduction should be understood as an indication that decreasing the exponent requires fast matrix multiplication. String periodicity and fast Fourier transform are two standard tools in string algorithms. Our main technical contribution is that we successfully combine these tools with fast matrix multiplication to design a noncombinatorial $tilde{mathcal{O}}(nm^{omega-1}+N)$-time algorithm for EDSM, where $omega$ denotes the matrix multiplication exponent and the $tilde{mathcal{O}}(cdot)$ notation suppresses polylog factors. To the best of our knowledge, we are the first to combine these tools. In particular, using the fact that $omega<2.373$ [Alman and Williams, SODA 2021; Le Gall, ISSAC 2014; Williams, STOC 2012], we obtain an $mathcal{O}(nm^{1.373} + N)$-time algorithm for EDSM. An important building block in our solution that might find applications in other problems is a method of selecting a small set of length-$ell$ substrings of the pattern, called anchors, so that any occurrence of a string from an ED text set contains at least one but not too many (on average) such anchors inside.

51.

Charalampopoulos, Panagiotis; Iliopoulos, Costas S.; Kociumaka, Tomasz; Pissis, Solon P.; Radoszewski, Jakub; Straszy'nski, Juliusz

Efficient Computation of Sequence Mappability Journal Article

In: Algorithmica, vol. 84, no. 5, pp. 1418–1440, 2022, ISBN: 1432-0541.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

52.

Meyer, Fernando; Fritz, Adrian; Deng, Zhi-Luo; Koslicki, David; Lesker, Till Robin; Gurevich, Alexey; Robertson, Gary; Alser, Mohammed; Antipov, Dmitry; Beghini, Francesco; Bertrand, Denis; Brito, Jaqueline J.; Brown, C. Titus; Buchmann, Jan; Buluç, Aydin; Chen, Bo; Chikhi, Rayan; Clausen, Philip T. L. C.; Cristian, Alexandru; Dabrowski, Piotr Wojciech; Darling, Aaron E.; Egan, Rob; Eskin, Eleazar; Georganas, Evangelos; Goltsman, Eugene; Gray, Melissa A.; Hansen, Lars Hestbjerg; Hofmeyr, Steven; Huang, Pingqin; Irber, Luiz; Jia, Huijue; Jørgensen, Tue Sparholt; Kieser, Silas D.; Klemetsen, Terje; Kola, Axel; Kolmogorov, Mikhail; Korobeynikov, Anton; Kwan, Jason; LaPierre, Nathan; Lemaitre, Claire; Li, Chenhao; Limasset, Antoine; Malcher-Miranda, Fabio; Mangul, Serghei; Marcelino, Vanessa R.; Marchet, Camille; Marijon, Pierre; Meleshko, Dmitry; Mende, Daniel R.; Milanese, Alessio; Nagarajan, Niranjan; Nissen, Jakob; Nurk, Sergey; Oliker, Leonid; Paoli, Lucas; Peterlongo, Pierre; Piro, Vitor C.; Porter, Jacob S.; Rasmussen, Simon; Rees, Evan R.; Reinert, Knut; Renard, Bernhard; Robertsen, Espen Mikal; Rosen, Gail L.; Ruscheweyh, Hans-Joachim; Sarwal, Varuni; Segata, Nicola; Seiler, Enrico; Shi, Lizhen; Sun, Fengzhu; Sunagawa, Shinichi; Sørensen, Søren Johannes; Thomas, Ashleigh; Tong, Chengxuan; Trajkovski, Mirko; Tremblay, Julien; Uritskiy, Gherman; Vicedomini, Riccardo; Wang, Zhengyang; Wang, Ziye; Wang, Zhong; Warren, Andrew; Willassen, Nils Peder; Yelick, Katherine; You, Ronghui; Zeller, Georg; Zhao, Zhengqiao; Zhu, Shanfeng; Zhu, Jie; Garrido-Oter, Ruben; Gastmeier, Petra; Hacquard, Stephane; Häußler, Susanne; Khaledi, Ariane; Maechler, Friederike; Mesny, Fantin; Radutoiu, Simona; Schulze-Lefert, Paul; Smit, Nathiana; Strowig, Till; Bremges, Andreas; Sczyrba, Alexander; McHardy, Alice Carolyn

Critical Assessment of Metagenome Interpretation: the second round of challenges Journal Article

In: Nature Methods, vol. 19, no. 4, pp. 429–440, 2022, ISSN: 1548-7105.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@article{Meyer2022,

title = {Critical Assessment of Metagenome Interpretation: the second round of challenges},

author = {Fernando Meyer and Adrian Fritz and Zhi-Luo Deng and David Koslicki and Till Robin Lesker and Alexey Gurevich and Gary Robertson and Mohammed Alser and Dmitry Antipov and Francesco Beghini and Denis Bertrand and Jaqueline J. Brito and C. Titus Brown and Jan Buchmann and Aydin Buluç and Bo Chen and Rayan Chikhi and Philip T. L. C. Clausen and Alexandru Cristian and Piotr Wojciech Dabrowski and Aaron E. Darling and Rob Egan and Eleazar Eskin and Evangelos Georganas and Eugene Goltsman and Melissa A. Gray and Lars Hestbjerg Hansen and Steven Hofmeyr and Pingqin Huang and Luiz Irber and Huijue Jia and Tue Sparholt Jørgensen and Silas D. Kieser and Terje Klemetsen and Axel Kola and Mikhail Kolmogorov and Anton Korobeynikov and Jason Kwan and Nathan LaPierre and Claire Lemaitre and Chenhao Li and Antoine Limasset and Fabio Malcher-Miranda and Serghei Mangul and Vanessa R. Marcelino and Camille Marchet and Pierre Marijon and Dmitry Meleshko and Daniel R. Mende and Alessio Milanese and Niranjan Nagarajan and Jakob Nissen and Sergey Nurk and Leonid Oliker and Lucas Paoli and Pierre Peterlongo and Vitor C. Piro and Jacob S. Porter and Simon Rasmussen and Evan R. Rees and Knut Reinert and Bernhard Renard and Espen Mikal Robertsen and Gail L. Rosen and Hans-Joachim Ruscheweyh and Varuni Sarwal and Nicola Segata and Enrico Seiler and Lizhen Shi and Fengzhu Sun and Shinichi Sunagawa and Søren Johannes Sørensen and Ashleigh Thomas and Chengxuan Tong and Mirko Trajkovski and Julien Tremblay and Gherman Uritskiy and Riccardo Vicedomini and Zhengyang Wang and Ziye Wang and Zhong Wang and Andrew Warren and Nils Peder Willassen and Katherine Yelick and Ronghui You and Georg Zeller and Zhengqiao Zhao and Shanfeng Zhu and Jie Zhu and Ruben Garrido-Oter and Petra Gastmeier and Stephane Hacquard and Susanne Häußler and Ariane Khaledi and Friederike Maechler and Fantin Mesny and Simona Radutoiu and Paul Schulze-Lefert and Nathiana Smit and Till Strowig and Andreas Bremges and Alexander Sczyrba and Alice Carolyn McHardy},

doi = {10.1038/s41592-022-01431-4},

issn = {1548-7105},

year = {2022},

date = {2022-04-01},

urldate = {2022-04-01},

journal = {Nature Methods},

volume = {19},

number = {4},

pages = {429–440},

publisher = {Springer Science and Business Media LLC},

abstract = {Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {article}

}

53.

Sarkar, Aditya; Giros, Augustin; Mockly, Louis; Moore, Jaden; Moore, Andrew; Nagareddy, Anish; Fu, Boyang; Fiscutean, Andrada; Chhugani, Karishma; Darci-Maher, Nicholas; Patel, Yesha M.; Sarwal, Varuni; Chang, Yutong; Ginjala, Srishti; Garmire, Lana X.; Bao, Riyue; Sankararaman, Sriram; Chikhi, Rayan; Mangul, Serghei

Unlocking the microblogging potential for science and medicine Unpublished

bioRxiv, 2022.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@unpublished{Sarkar2022,

title = {Unlocking the microblogging potential for science and medicine},

author = {Aditya Sarkar and Augustin Giros and Louis Mockly and Jaden Moore and Andrew Moore and Anish Nagareddy and Boyang Fu and Andrada Fiscutean and Karishma Chhugani and Nicholas Darci-Maher and Yesha M. Patel and Varuni Sarwal and Yutong Chang and Srishti Ginjala and Lana X. Garmire and Riyue Bao and Sriram Sankararaman and Rayan Chikhi and Serghei Mangul},

doi = {10.1101/2022.04.22.488804},

year  = {2022},

date = {2022-04-01},

urldate = {2022-04-01},

publisher = {Cold Spring Harbor Laboratory},

abstract = {Microblogging platform Twitter allows researchers to showcase their work, receive constructive feedback, find jobs, and build scientific collaborations. While existing literature has analyzed the benefits of Twitter in the development and distribution of scientific knowledge, most of the studies only took into account a limited number of researchers, which affected the generalizability of derived results. Our study analyzed the activity of 6,000 biomedical scientists on Twitter using data-driven approaches, a third of whom were female. Furthermore, we estimated that up to a quarter of the members of the scientific community are engaged on Twitter. While the number of male scientists joining the microblogging platform every year has decreased, the number of female scientists has remained roughly the same. Scientists are very selective in who they are following as compared to the general public. We also found that the type of tweets and retweets one posts may affect the number of followers, specifically, that a moderate to high level of professionalism and a high level of positivity is correlated with followers count. Moreover, female scientists send fewer negative tweets as compared to male scientists (31.1% for females and 34.7% for males). Our analysis could provide insights and launch a conversation on the advantages and limitations of using Twitter for disseminating scientific information and engaging in constructive discussion and collaborations within the scientific community.},

howpublished = {bioRxiv},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {unpublished}

}

54.

Baaijens, Jasmijn A.; Bonizzoni, Paola; Boucher, Christina; Vedova, Gianluca Della; Pirola, Yuri; Rizzi, Raffaella; Sirén, Jouni

Computational graph pangenomics: a tutorial on data structures and their applications Journal Article

In: Natural Computing, vol. 21, no. 1, pp. 81–108, 2022, ISBN: 1572-9796.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@article{Baaijens2022,

title = {Computational graph pangenomics: a tutorial on data structures and their applications},

author = {Jasmijn A. Baaijens and Paola Bonizzoni and Christina Boucher and Gianluca Della Vedova and Yuri Pirola and Raffaella Rizzi and Jouni Sirén},

doi = {10.1007/s11047-022-09882-6},

isbn = {1572-9796},

year  = {2022},

date = {2022-03-01},

urldate = {2022-03-01},

journal = {Natural Computing},

volume = {21},

number = {1},

pages = {81--108},

abstract = {Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations---thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {article}

}

55.

Luo, Xiao; Kang, Xiongbin; Schönhuth, Alexander

Strainline: full-length de novo viral haplotype reconstruction from noisy long reads Journal Article

In: Genome Biology, vol. 23, iss. 1, no. 29, pp. 1–27, 2022, ISSN: 1474-760X.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG, WP3: Translational CPG

56.

Gažiová, M.; Sládeček, T.; Pös, O.; Števko, M.; Krampl, W.; Pös, Z.; Hekel, R.; Hlavačka, M.; Kucharík, M.; Radvánszky, J.; Budiš, J.; Szemes, T.

Automated prediction of the clinical impact of structural copy number variations Journal Article

In: Scientific Reports, vol. 12, no. 1, pp. 555, 2022, ISSN: 2045-2322.

Abstract | Links | BibTeX | Tags: Misc, WP3: Translational CPG

57.

Rizzo, Nicola; Mäkinen, Veli

Linear Time Construction of Indexable Elastic Founder Graphs Proceedings Article

In: Bazgan, Cristina; Fernau, Henning (Ed.): Combinatorial Algorithms, pp. 480–493, Springer International Publishing, Cham, 2022, ISBN: 978-3-031-06678-8.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

@inproceedings{Rizzo2022,

title = {Linear Time Construction of Indexable Elastic Founder Graphs},

author = {Nicola Rizzo and Veli Mäkinen},

editor = {Cristina Bazgan and Henning Fernau},

url = {https://arxiv.org/abs/2201.06492},

doi = {10.1007/978-3-031-06678-8_35},

isbn = {978-3-031-06678-8},

year  = {2022},

date = {2022-01-01},

urldate = {2022-01-01},

booktitle = {Combinatorial Algorithms},

pages = {480--493},

publisher = {Springer International Publishing},

address = {Cham},

abstract = {Pattern matching on graphs has been widely studied lately due to its importance in genomics applications. Unfortunately, even the simplest problem of deciding if a string appears as a subpath of a graph admits a quadratic lower bound under the Orthogonal Vectors Hypothesis (Equi et al. ICALP 2019, SOFSEM 2021). To avoid this bottleneck, the research has shifted towards more specific graph classes, e.g. those induced from multiple sequence alignments (MSAs). Consider segmenting $mathsf{MSA}[1..m,1..n]$ into $b$ blocks $mathsf{MSA}[1..m,1..j_1]$, $mathsf{MSA}[1..m,j_1+1..j_2]$, $ldots$, $mathsf{MSA}[1..m,j_b-1+1..n]$. The distinct strings in the rows of the blocks, after the removal of gap symbols, form the nodes of an elastic founder graph (EFG) where the edges represent the original connections observed in the MSA. An EFG is called indexable if a node label occurs as a prefix of only those paths that start from a node of the same block. Equi et al. (ISAAC 2021) showed that such EFGs support fast pattern matching and gave an $O(mn log m)$-time algorithm for preprocessing the MSA in a way that allows the construction of indexable EFGs maximizing the number of blocks and, alternatively, minimizing the maximum length of a block, in $O(n)$ and $O(n loglog n)$ time respectively. Using the suffix tree and solving a novel ancestor problem on trees, we improve the preprocessing to $O(mn)$ time and the $O(n loglog n)$-time EFG construction to $O(n)$ time, thus showing that both types of indexable EFGs can be constructed in time linear in the input size.},

howpublished = {arXiv},

keywords = {WP1: Primary CPG},

pubstate = {published},

tppubtype = {inproceedings}

}

58.

Mwaniki, Njagi Moses; Garrison, Erik; Pisanti, Nadia

Fast Exact String to D-Texts Alignments Unpublished

arXiv, 2022.

Abstract | Links | BibTeX | Tags: WP1: Primary CPG

59.

Zhong, Haodi; Loukides, Grigorios; Pissis, Solon P.

Clustering sequence graphs Journal Article

In: Data & Knowledge Engineering, vol. 138, pp. 101981, 2022, ISSN: 0169-023X.

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

60.

Cartes, Jorge Avila; Anand, Santosh; Ciccolella, Simone; Bonizzoni, Paola; Vedova, Gianluca Della

Accurate and fast clade assignment via deep learning and frequency chaos game representation Journal Article

In: GigaScience, vol. 12, 2022, ISSN: 2047-217X, (giac119).

Abstract | Links | BibTeX | Tags: WP2: Evolutionary/Comparative CPG

@article{Cartes2022,

title = {Accurate and fast clade assignment via deep learning and frequency chaos game representation},

author = {Jorge Avila Cartes and Santosh Anand and Simone Ciccolella and Paola Bonizzoni and Gianluca Della Vedova},

url = {https://github.com/AlgoLab/CouGaR-g},

doi = {10.1093/gigascience/giac119},

issn = {2047-217X},

year  = {2022},

date = {2022-01-01},

urldate = {2022-01-01},

journal = {GigaScience},

volume = {12},

publisher = {Cold Spring Harbor Laboratory},

abstract = {Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade.In this article, we leverage the frequency chaos game representation (FCGR) and convolutional neural networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieved an $96.29%$ overall accuracy, while a similar tool, Covidex, obtained a $77,12%$ overall accuracy. As far as we know, our method is the first using deep learning and FCGR for intraspecies classification. Furthermore, by using some feature importance methods, CouGaR-g allows to identify k-mers that match SARS-CoV-2 marker variants.By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on random forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants.The trained models can be tested online providing a FASTA file (with 1 or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL.},

note = {giac119},

keywords = {WP2: Evolutionary/Comparative CPG},

pubstate = {published},

tppubtype = {article}

}

77 entries « ‹ 3 of 4 › »