Active STANDARD GRANT National Science Foundation (US)

III:Small: Expressiveness of Genome Graphs: Construction, Comparison, and Heterogeneity

$6M USD

Funder	National Science Foundation (US)
Recipient Organization	Carnegie-Mellon University
Country	United States
Start Date	Apr 01, 2023
End Date	Mar 31, 2026
Duration	1,095 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2232121`

Grant Description

Differences (also known as variants) in a person's genome help to determine specific characteristics such as their susceptibility to disease, their response to drugs, and other significant aspects of their biology. Similarly, differences in the genomes of bacteria and viruses help to determine their specific characteristics such as, for example, whether they are harmful to humans or animals.

These genetic differences are important to understand and to take into account when studying the biology of an organism because they play such an important role in how even individual cells in the organism function. Advances in genome sequencing technology have generated huge catalogs of such differences in many organisms, including humans. These rich repositories of genomic information cannot be fully integrated and analyzed due to a lack of effective computational methods, and existing methods suffer from computational inefficiencies and lapses of accuracy when drawing conclusions from collections of genomic differences.

This project will develop new computational methods to increase accuracy and decrease computational resource requirements for storing, comparing, and evaluating catalogs of genomic differences. It will result in new scientific software that will better organize catalogs of differences to make computational analyses more tractable. It will also result in software that more accurately measures the diversity of a population of individuals and software that supports making better comparisons between populations.

The project will validate these methods by subtyping cancer tumors, assessing the diversity of cells in various types of tumors, and by comparing populations of bacteria found in different environments. The project will result in faster, more accurate software for the analysis of many genomic differences that will advance our understanding of how genomic variants affect human health and biological processes.

To better explain the innovations developed during this project and the importance of studying genomic differences, the project will also produce a series of educational videos that will help other people understand the main ideas behind the techniques developed in this project.

Genome graphs have emerged as an important data structure in the analysis of collections of genomic variants. These are graphs in which nodes (or edges) are labeled with genomic sequences (strings) and paths in the graph represent substrings that are present in the population that the graph represents. They can be used as representations of a “reference” genome for a population of organisms.

Genome graphs have been used to reduce bias in the reference genome, form more inclusive reference genomes, and to reduce space and time requirements to perform genomic sequence analyses. For this reason, many tools are being adapted to use genome graphs as references in lieu of traditional linear (single sequence) references. While genome graphs have consistently proved useful in these areas, the algorithms for a number of problems associated with them suffer from poor computational scaling and lack of formalization.

The project will develop and validate algorithms for several central genome graph problems, specifically to (goal 1) construct genome graphs, to (goal 2) compare genome graphs, and to (goal 3) assess the complexity of genome graphs. The framework that the project will use to solve these problems is innovative in that it involves exploiting the under-explored connection between graph flow decompositions and genome graphs.

This approach reveals natural relationships between genome graphs and the population of strings they represent. This global view of the expressive power of a genome graph is central to the formulations that the project will explore. The problems that the project will tackle bridge graph theory and genomics, leading to greater interactions and connections between those fields.

Our algorithms will allow genome graphs to more accurately reflect desired populations, will allow information from multiple genomes to be better integrated, and will advance the informatics tools needed to exploit large collections of genomic variants. The project will apply and evaluate these algorithms to (1) improve sequence alignment for mapping populations of genomes, (2) improve clustering of cancer tumor sequences and metagenomic samples, and (3) better model the progression of heterogeneity in metastatic cancer samples.

The developed algorithms will be implemented in an open-source library to encourage their use in other systems. Finally, the project will create open-source, free instructional videos to introduce concepts such as pan-genomics, genome graphs, and the developed algorithms to a wider audience.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Carnegie-Mellon University

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

III:Small: Expressiveness of Genome Graphs: Construction, Comparison, and Heterogeneity

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants