1st ESR Research Platform Workshop

📅 June 15 — 16, 2023
📍 Online

ESR Research Platform workshop series is an opportunity for ALPACA fellows to share their research with each other as well as other members of the consortium in order to facilitate the possibilities of further discussions and intra-network collaborations.

The first workshop will take place online in June 15–16, 2023.

Programme

All timings shown are Central Europe Summer Time (CEST)
Thursday, June 15
9:00–9:45	Pío Sierra (University of Cambridge) De novo Transposable Elements identification in a pangenome Abstract
9:45–10:30	Pengfei Wang (CNRS) Combinatorics and Algorithms of Pangenome Abstract
10:30–10:45	Coffee break
10:45–11:30	Luca Parmigiani (Bielefeld University) A BWT-based algorithm for random de Bruijn sequence construction Abstract
11:30–12:15	Daria Frolova (EMBL) Approximating plasmid relatedness using Double Cut and Join Indel model Abstract
12:15–13:30	Lunch break
13:30–14:15	Njagi Mwaniki (University of Pisa) The Skip Edit Operation and Implementing an Elastic-Degenerate String Intersection Graph Abstract
14:15–15:00	Francesco Andreace (Institut Pasteur) Construction and representation of human pangenome graphs Abstract
15:00–15:15	Coffee break
15:15–16:00	~~Raghuram D. Rangaram (Bielefeld University)~~ CANCELLED
Friday, June 16
9:00–9:45	Tomáš Sládeček (Geneton) Combination of Expert Guidelines-based and Machine Learning-based Approaches Leads to Superior Accuracy of Automated Prediction of Clinical Effect of Copy Number Variations Abstract
9:45–10:30	Estéban Gabory (NWO-I) Elastic Degenerate Strings: Definitions, previous results, and comparison problems Abstract
10:30–10:45	Coffee break
10:45–11:30	Khodor Hannoush (INRIA) CCdBGUpdater: Updating a Compacted and Colored de Bruijn Graph Abstract
11:30–12:15	Alessia Petescia (Comenius University Bratislava) MARIA: MEMs on grAphs with r-index alignment Abstract
12:15–13:30	Lunch break
13:30–14:15	Nicola Rizzo (University of Helsinki) Chaining of Maximal Exact Matches in Graphs Abstract
14:15–15:00	Konstantinn Bonnet (HHU) strpg: towards pangenome scale graph visualizations Abstract
15:00–15:15	Coffee break
15:15–16:00	Jorge Avila (University of Milano-Bicocca) Clustering pangenome graphs with Ricci-Flow Abstract

Abstracts

De novo Transposable Elements identification in a pangenome

Pío Sierra (University of Cambridge)

Abstract: Transposable Elements (TEs) are fragments of DNA, typically a few hundred base pairs up to 20kb, that have the ability to move around in the genome, often generating duplicate copies of themselves in the process. During the last 20 years we have been using methods to identify them based on their repetitive character in a genome, together with detection based on homology and structural features. Today it is becoming common to have more than a single complete assembly for a species, and that will become even more prevalent with the introduction of pangenomes. This opens up the possibility of looking at TE detection from a new angle. In my project I develop the idea of using the structural polymorphisms that we find in pangenomes to identify TEs and create libraries of the families present in a species. My results show that this protocol is particularly well suited to detect larger, lower copy number TEs that could previously go undetected with existing de novo methods. My plan is to deliver a tool that can be used for downstream de novo TE library creation starting from a pangenome of genomes from one species or a set of closely related species.

Combinatorics and Algorithms of Pangenome

Pengfei Wang (University of Montpellier, CNRS)

Abstract: In this presentation, I will talk about my following several projects:

(1) In computational pangenomics, several data structures have been designed to store overlapping sequences taken from a set of individual reads. These data structures, such as Overlap graph, etc., must store the information of overlaps between strings. A basic research question regarding overlaps is how many combinations of overlaps do exist? Last year, we closed one related conjecture which was proposed by Guibas and Odlyzko [3] in 1981. This result contributes to a publication in ICALP’2023. This is a joint work with Michelle Sweering (Solon Pissis group in Amsterdam)

(2) Elastic degenerate strings (ED strings) have been introduced to represent a set of closely-related DNA sequences (a Pangenome), A recent paper (CWI & Uni.Pisa) shows that it takes polynomial time to decide whether two ED strings intersect or not. We would like to further investigate the multiple intersection of ED strings, that is how fast could we check whether three or more EDS strings has a nonempty intersection. First notice this is a NP-hard problem. However, is this a w[t]-hard or even XP-hard problem? I am looking into it with Estéban Gabory, Michelle Sweering, and Hilde Verbeek (Solon Pissis group in Amsterdam).

(3) A degenerate string is a string with “undetermined” symbols, which can denote arbitrary subsets of the alphabet Σ. We may consider it is a simplified version of elastic degenerate string. One application in computational pangenomics is to present the local gapless alignment (a “simplified” pangenome). We study the paper of combinatorics on partial words by Blanchet-Sadri et al. [1] and extend the results to degenerate strings. In particular, we define different periods on degenerates strings and investigate the combinatorial structures of different periods. This is a joint work with Estéban Gabory, Michelle Sweering, and Hilde Verbeek from CWI (Solon Pissis group in Amsterdam).

(4) Brinda et al.[2] infers antibiotic resistance profiles of pathogens via their closest known relatives. They provide a super rapidly way by using nanopore sequencing and alignment-free methods to sequence and compare the reads to a comprehensive DB of strains with known resistance profile. However, It may cause ambiguity when predicts lineage based on method given by the paper. We are thinking about to use phylogenetic placement (e.g. rapid phylogenetic placement with informative k-mers[4]) to improve lineage classification, and hence help with resistance prediction. I collaborate this work with Nikolai Romashchenko (our group).

[1] F. Blanchet-Sadri, Justin Fowler, Joshua D. Gafni, and Kevin H. Wilson. Combinatorics on partial word correlations. Journal of Combinatorial Theory, Series A, 117(1):607–624, 2010.
[2] Karel Brinda et al. Rapid inference of antibiotic resistance and susceptibility by genomic neighbour typing. Nature microbiology, 2020.
[3] Leonidas J. Guibas and Andrew M. Odlyzko. Periods in strings. Journal of Combinatorial Theory, Series. A, 30:19–42, 1981.
[4] Nikolai Romashchenko. Computing informative k-mers for phylogenetic placement. (Calcul de k-mers informatifs pour le placement phylogénétique). PhD thesis, University of Montpellier, France, 2021.

Construction and representation of human pangenome graphs

Francesco Andreace (Institut Pasteur)

Abstract: As a single reference genome cannot possibly represent all the variation present across human individuals, pangenome graphs have been introduced to incorporate population diversity within a wide range of genomic analyses. Several data structures have been proposed for representing collections of genomes as pangenomes, in particular graphs. In this work we collect all publicly available high-quality human haplotypes and constructed the largest human pangenome graphs to date, incorporating 52 individuals in addition to two synthetic references (CHM13 and GRCh38). We build variation graphs and de Bruijn graphs of this collection using five of the state-of-the-art tools: Bifrost, mdbg, Minigraph, Minigraph-Cactus and pggb. We examine differences in the way each of these tools represents variations between input sequences, both in terms of overall graph structure and representation of specific genetic loci. This work sheds light on key differences between pangenome graph representations, informing end-users on how to select the most appropriate graph type for their application. Joint work with Pierre Lechat, Yoann Dufresne and Rayan Chikhi.

Approximating plasmid relatedness using Double Cut and Join Indel model

Daria Frolova (EMBL-EBI)

Bacterial cells contain the chromosome, plus between 0 and 15 small circular DNA molecules which occur at copy number 1-100, and replicate independently, called plasmids. They are extremely diverse, and are known major carriers of antimicrobial resistance. Dense sequencing in laboratories and hospitals clearly show that structural changes (gene gain/loss, inversions) occur at comparable or higher rates than SNPs. This presents considerable
challenges to evolutionary and epidemiological analysis. Sophisticated maximum likelihoodand Bayesian approaches, as used in phylogenetics, are not available as we do not have goodevolutionary models.

Given that, the typical approach is to base analysis on shared sequence content, either via aligned blocks or shared kmers, and use this to define a genetic distance. These approaches create artefacts with plasmids of different size, which is a significant limitation given gene gain/loss is a common process, and plasmids sometimes fuse/cointegrate. Furthermore, the numeric values are not interpretable in terms of genetic events separating a pair of plasmids. We therefore introduce a model which captures the key components of how plasmid genomes evolve structurally – through gene/block gain or loss, and through rearrangement. We use the “Double Cut and Join Indel” (DCJ-Indel) model for measuring rearrangement distances
between genomes, which Bohnenkamper et al recently solved [1]. Plasmids are studied at a coarse level, as a sequence of integers (representing genes or aligned blocks), and the distance between two plasmids is the minimum number of rearrangement events (insertion, deletion, or “double cut and join”) between them.

We introduce a software workflow Pling, which uses the DCJ-Indel model, to calculate distances between plasmids and then cluster them. The user is allowed to specify whether to use genes as markers or to use sequence blocks found through alignment. We show that distances measured in these two ways are in general very concordant. Finally, we apply Pling to a set of plasmids from rivers, livestock and human bloodstream infections, isolated in
Oxfordshire over a period of 10 years [2].

[1] L. Bohnenkämper, M. D.V. Braga, D. Doerr, and J. Stoye, Journal of Computational Biology 28:4, 410-431 (2021)
[2] W. Matlock, S. Lipworth, K. K. Chau, M. AbuOun, L. Barker, J. Kavanagh, M. Andersson, S. Oakley, M. Morgan, D. W. Crook, D. S. Read, M. Anjum, L. P. Shaw, N. Stoesser, REHAB Consortium, UK eLife 12, e85302 (2023)

The Skip Edit Operation and Implementing an Elastic-Degenerate String Intersection Graph

Njagi Mwaniki (University of Pisa)

Abstract: An Elastic-Degenerate String (ED-string) $T$ is a sequence of n sets of strings $T[1], . . . , T[n]$ containing m strings in total whose cumulative length is $N$. It can be used to represent the outcome of a multiple sequence alignment (MSA) by highlighting gaps and variants as a list of strings that can differ in size and that can possibly include the empty string.

Partial Order Alignment (POA) lets us align a string to a directed acyclic graph allowing progressive MSA. However, the POA edit transcript introduces ambiguity by not refecting the possible elasticity of the MSA. To solve this, we introduce a skip operation to represent and track gaps, allowing us to have an exhaustive edit transcript. Further, taking l as the length of the query string, we show that this can be done in $O(N \cdot l)$ time and space.

Finally, in collaborative work with CWI, we demonstrate how to solve the ED-string intersection (EDSI) problem, which asks whether a single string is encoded by both EDS, by implementing an intersection graph between two elastic degenerate strings.

A BWT-based algorithm for random de Bruijn sequence construction

Luca Parmigiani (Bielefeld University)

Abstract: A binary de Bruijn sequence (dB sequence) of order k is a circular binary string that contains each k-length word exactly once as a substring. Most existing algorithms construct a specific dB sequence, or members of a specific class of dB sequences, representing only a tiny fraction of the complete set. The only algorithms capable of generating all dB sequences are based on finding Euler cycles in de Bruijn graphs, which can be very expensive in practice. We present an algorithm for constructing random binary dB sequences which uses the extended Burrows-Wheeler Transform. Our method is simple to implement (less than 120 lines of C++ code) and can produce random dB sequences of any order. Even though it does not output dB sequences uniformly at random, it provably outputs each dB sequence with positive probability. The algorithm runs in space and time linear in the length of the dB sequence and needs less than one second on a
laptop computer for orders up to 23, including outputting the sequence. It can be straightforwardly extended to any constant-size alphabet.
To the best of our knowledge, this is the first practical algorithm for generating random dB sequences which is capable of producing all dB sequences. Apart from its immediate usefulness in contexts where it is desirable to use a dB sequence that cannot be guessed easily (such as cryptographic protocols), we also demonstrate our algorithm’s potential in theoretical studies, giving hitherto unknown estimates of the average discrepancy of binary dB sequences.

Combination of Expert Guidelines-based and Machine Learning-based Approaches Leads to Superior Accuracy of Automated Prediction of Clinical Effect of Copy Number Variations

Tomáš Sládeček (Geneton)

Abstract: Clinical interpretation of copy number variants (CNVs) is a complex process that requires skilled clinical professionals. General recommendations have been recently released to guide the CNV interpretation based on predefined criteria to uniform the decision process. Several semiautomatic computational methods have been proposed to recommend appropriate choices, relieving clinicians of tedious searching in vast genomic databases. We have developed and evaluated such a tool called MarCNV and tested it on CNV records collected from the ClinVar database. Alternatively, the emerging machine learning-based tools, such as the recently published ISV (Interpretation of Structural Variants), showed promising ways of even fully automated predictions using the broader characterization of affected genomic elements. Such tools utilize features additional to ACMG criteria, thus providing supporting evidence and the potential to improve CNV classification. Since both approaches contribute to the evaluation of CNVs clinical impact, we propose a combined solution in the form of a decision support tool based on automated ACMG guidelines (MarCNV) supplemented by a machine learning-based pathogenicity prediction (ISV) for the classification of CNVs. We provide evidence that such a combined approach is able to reduce the number of uncertain classifications and reveal potentially incorrect classifications using automated guidelines.

Elastic Degenerate Strings: Definitions, previous results, and comparison problems

Estéban Gabory (NWO-I)

Abstract: An elastic-degenerate (ED) string $T$ is a sequence of $n$ sets $T[1], . . . , T[n]$ containing $m$ strings in total whose cumulative length is $N$. The ED string $T$ encodes for the set of strings $S_1…S_n$ with $S_i\in T[i]$ for each $i$ (we call this set its language). ED strings have been introduced to represent a set of closely-related DNA sequences, such as a pangenome. After giving a brief overview of existing results on ED strings, we will extend the study of the ED string intersection problem, presented in Njagi Mwaniki’s talk, which asks whether two ED strings have respective languages with nonempty intersection. We will present the theoretical part of the obtained results, namely some or all of the following:

Lower bounds: Unless the Strong Exponential Hypothesis is false,there is no algorithm for EDSI which is subquadratic in the size of both ED strings. We also present a more strict lower bound for a combinatorial algorithm.
Unary Case: We study the simple case where both ED strings are defined over a single letter alphabet.
Efficient Algorithm: We give a more efficient non combinatorial algorithm using fast matrix multiplication
Applications: We show how to apply our techniques for Acronym Generation

CCdBGUpdater: Updating a Compacted and Colored de Bruijn Graph

Khodor Hannoush (INRIA)

Abstract: De Bruijn graphs have proven to be a valuable tool for genomic data analysis, enabling efficient representation and investigation of DNA sequences. In recent years, compact de Bruijn graphs were proposed as a promising approach to reduce memory requirements and improve computational efficiency. In this study, we present a novel method for updating compact Bruijn graphs, which is during the optimization phase, focusing on improving performance without considering color information in this iteration of development.
Our method leverages the structural properties of compact de Bruijn graphs to streamline the update process. By capitalizing on the compact representation, we achieve a better performance on the tests that we performed. Although our current implementation does not handle colors, it serves as a stepping stone towards developing an optimized approach that encompasses color information in future iterations.

MARIA: MEMs on grAphs with r-index alignment

Alessia Petescia (Comenius University Bratislava)

Abstract: Using a single reference genome for sequence analyses can introduce the so-called reference bias. This problem can be overcome by using a pan-genome, i.e. a collection of genomes meant to be analyzed jointly.

Pan-genomes can be represented as a set of strings. Related genomes from the same species are expected to be highly similar, making them compressible by using text indexes. However, mapping against such a collection may result in a huge number of matches across the genomes, making it extremely difficult to list all of them and to use them for subsequent analyses.

Another suitable representation are pan-genome graphs, where common sub-sequences within genomes can form the nodes of a graph and each sequence is represented as a path through these nodes. Although appealing, pan-genomes are challenging to index. Including more sequences might increase the number of possible paths in the graph, leading to chimeric recombinations and increasing the indexing complexity.

Current tools perform mapping on pan-genomes either using their text representations or their graph representations, but not both. Here, we introduce MARIA, a new tool for succinctly indexing pan-genome graphs. Given a pattern, such as a MEM, and its position on a set of genomes, MARIA can quickly list all the distinct nodes in the corresponding pan-genome graph where the pattern occurs, only accounting for non-chimeric paths. Consequently, MARIA allows to leverage existing efficient and compact data structures for collection of sequences (such as the r-index) also on pan-genome graphs. We expect that MARIA will be a valuable tool for subsequent analyses, such as seeds chaining for reads mapping.

Chaining of Maximal Exact Matches in Graphs

Nicola Rizzo (University of Helsinki)

Abstract: The (co-linear) chaining of short matches between two sequences approximates alignment and was recently extended to the pangenomic setting of a sequence and a labeled graph. Current formulations in this setting have not been proven to fully capture edit distance. Through a new symmetric formulation of the problem, we show how chain Maximal Exact Matches between a sequence and an acyclic labeled graph to obtain the Longest Common Subsequence in parameterized linearithmic time. (Joint work with Manuel Càceres and Veli Mäkinen)

strpg: towards pangenome scale graph visualizations

Konstantinn Bonnet (Heinrich Heine University Düsseldorf)

Abstract: Visualizing large graphs in the million node scale and beyond remains a challenge and is relevant to multiple fields of study. In pangenomics, as databases are continuously enriched by new and high quality assemblies, there is a growing need for tools and methods tailored to handle extremely large
volumes of data. As pangenome sizes multiply, available techniques quickly hit operational limits in terms of processing time and memory. In particular,
visualizing graphs in a general, intuitive and interactive manner is useful for analysis and interpretation, yet computationally prohibitive at such a scale.
This talk will present and discuss preliminary work on a new visualization tool aiming to address these limitations by using offline techniques combining graph coarsening and partitioning with external memory, in order to minimize CPU and memory usage during interaction. In addition, it is provisioned to allow for displaying intuitive path and read alignment information, and is implemented in a portable, highly modular and extensible manner with the goal of providing a general framework for experimentation with layouting.

Clustering pangenome graphs with Ricci-Flow

Jorge Avila Cartes (University of Milano-Bicocca)

Abstract: The goal of a pangenome graph is to represent a set of genomic sequences in a compact form, by taking advantage of any shared subsequences between the input sequences, such that any of the input sequences can be represented as a path in this graph.

Regardless of the increasing development of graph builders, the comparison of different graph representations has been always based on some downstream analysis (like variant calling or read alignment against graphs), even though metrics for graph comparison exist [1], to the best of our knowledge no prior work has extended any of these approaches to the field of Pangenomics.

The comparison of pangenome graphs imposes a challenge from the combinatorial point of view since paths and labels need to be considered. For example, in [2] they study the problem of covering alignment between graphs, and the conclusion was that comparing two pan-genome representations is NP-hard if accepting the notion of covering alignment presented in their work.

Motivated by this discovery, we propose to study the comparison of two pangenome graphs using Riemannian Geometry, in particular, with the use of Ricci-Flow which was proposed in the 1980s. This area of mathematics gained significant attention in the early 2000s when Perelman published three papers proving the Poincaré Conjecture, one of the Millenium Problems.

In this work, we explore how to align variation graphs by using Riemannian Geometry. This framework can lead to defining a metric for the comparison of variation graphs, which can then be used to perform clustering.

[1] Wills, P., & Meyer, F. G. (2020). Metrics for graph comparison: a practitioner’s guide. Plos one, 15(2), e0228728.
[2] Rizzi, R., Cairo, M., Mäkinen, V., Tomescu, A. I., & Valenzuela, D. (2018). Hardness of covering alignment: phase transition in post-sequence genomics. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 16(1), 23-30.

1st ESR Research Platform Workshop