Institut Pasteur

Location: Paris, France

Supervisor: Dr. Rayan Chikhi

ESR: Francesco Andreace

Objectives

Compacted de Bruijn graphs are natural candidates for representing pan-genome graphs. The problem of constructing compacted de Bruijn graphs has been studied extensively, in both cases where the input is a (set of) genomes or raw sequencing data. There is however, no work in the literature devoted to efficiently merging two already compacted graphs into a single one. This operation is non-trivial, as it requires going back to tracking k-mers within unitigs, in order to possibly split or merge the unitigs. A typical use case for this operation is the concatenation of pan-genomes constructed from different data sources that would then need to be analysed jointly. The ESR will lay the algorithmic foundations for performing the aforementioned operation efficiently, and will provide a software implementation.

Expected Results

An algorithm and software for merging pan-genome graphs. The new techniques will enable scalable analysis of large data collections through a decomposition into smaller tractable graph construction steps.

Efficiently merging compacted de Bruijn graphs