Objectives
The role of viruses is key for understanding the environment (e.g., in the sea, the soil or the air) or the functioning of humans, animals and plants’ microbiomes. Despite their comparatively small genome sizes, viruses pose specific challenges for constructing and mining pan-genomes due to their capacity of rapid evolution through mutation and recombination. This capacity leads to high intra-host diversity of viral genomes, meaning that numerous viral haplotypes with different frequencies exist within an infected individual. When considering different viral species even in closely related family, one may face high level of genome sequence divergence that can translate in complex graph for representing a pan-genome. Our goal is to design adequate data structures to store viral pan-genomes and efficient query algorithms to analyse, compare, or mine such pan-genomes. We aim at developing queries for finding similarities among viral sequences in the presence of highly divergent individual genomes. For instance, enhancing usual pattern matching queries is required to mine pan-genomes in general. Advanced query methods are needed to search e.g. probabilistic motifs, which cover numerous applications, including the annotation of protein binding sites in DNA or RNA (for instance, for transcription factors). Such queries are useful to classify viral genomes or to analyse virome datasets.
Expected Results
- Structures for representing pan-genomes of viral species.
- Efficient algorithms for querying pan-genomes.