Active NON-SBIR/STTR RPGS NIH (US)

Integrating the reference pangenome with biobank-scale data for complex trait analysis

$13.11M USD

Funder	NATIONAL HUMAN GENOME RESEARCH INSTITUTE
Recipient Organization	University of California, San Diego
Country	United States
Start Date	Sep 20, 2024
End Date	Aug 31, 2027
Duration	1,075 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	NIH (US)
Grant ID	`10977573`

Grant Description

Project summary The human reference pangenome, which represents a collection of genome sequences in a single data structure, has the potential to transform human genetics applications. Compared to a traditional linear reference genome, pangenomes enable analysis of megabases of genetic sequence that were previously

ignored, reduce bias when analyzing diverse genomes, and provide dramatically improved genotyping of structurally complex regions of the genome. These complex regions likely harbor medically relevant variants contributing to a range of human traits. However, pangenomes have yet to be integrated into medical genetics

and complex trait workflows due to a lack of analysis and visualization tools that are accessible to non-experts. Our central hypothesis is that pangenomes can be used to improve fine-mapping of trait associations and detection of pathogenic variants in complex regions by identifying particular paths enriched in individuals

with a phenotype of interest. We focus on developing and applying tools that leverage pangenomes to identify, visualize, and fine-map genomic loci associated with complex traits. The tools proposed below are motivated by two major challenges identified by our own efforts to this end. First, visualization and browsing pangenome

subgraphs for loci of interest, which is a critical step in exploring and understanding complex genomic regions, is currently a cumbersome and time-consuming process involving multiple command line tools geared at bioinformatics experts. Second, there is a lack of tools for integrating existing biobank datasets for which both

genotype and phenotype data are available for complex traits analysis, with the reference pangenome. Our proposal integrates multiple large datasets encompassing a range of technologies and builds on existing pangenome resources and the computational infrastructure developed by the HPRC. In particular, we

use genotype data and whole genome sequencing (WGS) datasets available for hundreds of thousands of individuals of a range of ancestries from the UKBiobank and All of Us as well as thousands of phenotypes available for these samples. A key goal is to enable backwards compatibility with existing biobank-scale

datasets that have been mapped to linear reference genomes, which will facilitate more immediate use of the pangenome reference. We additionally use near complete long read assemblies and the reference pangenomes (primarily minigraph-cactus) released by HPRC. Further, our tools are designed to integrate with

the current pangenome computational ecosystem by incorporating existing file formats (e.g. rGFA) and toolkits (e.g. vg). To this end we will develop a web-based pangenome browser that integrates with existing data based on linear genomes (Aim 1), develop metrics to quantify local graph complexity and use these metrics to

characterize existing GWAS signals (Aim 2), and integrate pangenomes with existing biobank datasets to perform fine-mapping and visualization of individual trait-associated loci (Aim 3).

All Grantees

University of California, San Diego

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Integrating the reference pangenome with biobank-scale data for complex trait analysis

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants