Pan-genome representations for deep machine learning applications

Bielefeld University

Location: Bielefeld, Germany

ESR: Raghuram Dandinasivara Rangaram

Objectives

The amount of sequenced genomes, and in many areas of application also the amount of annotations, have reached a mass – hundreds of thousands of sequenced genomes – that is critical for successful application of deep learning pipelines. However, deep learning relies on well-structured (such as image-type) input to unfold its power. In this project, we will develop systematic approaches to transform input data into highly structured representations that serve as input to deep learning pipelines. The goal is to (i) develop/design these representations and (ii) to construct deep network architectures that enable to derive predictions from those representations.

Expected Results

Pan-genome data structures have the potential to pre-bundle ubiquitous commonalities, and therefore to support to distinguish common from differential features of genomes. For example, it was previously pointed out that viewing genomes as binary vectors over paths in pan-genome graphs, or transforming frequented regions of pan-genome graphs into input features could significantly enhance classification tasks. Judging from the power of deep learning in general, and the potential of particularly relevant prior work described so far, we expect to achieve performance rates in terms of predicting biomedical determinants and unravelling genotype-phenotype associations that will have substantial advantages over ordinary GWAS driven approaches.

Pan-genome representations for deep machine learning applications