Loading…
Loading grant details…
| Funder | National Science Foundation (US) |
|---|---|
| Recipient Organization | Cornell University |
| Country | United States |
| Start Date | Jul 01, 2022 |
| End Date | Jun 30, 2027 |
| Duration | 1,825 days |
| Number of Grantees | 1 |
| Roles | Principal Investigator |
| Data Source | National Science Foundation (US) |
| Grant ID | 2145577 |
The cost of genome sequencing has decreased by orders of magnitude over the past two decades, enabling the creation of datasets comprised of up to millions of genomes of plants, animals, and humans. The long-term vision behind this research is to develop new methods for analyzing such datasets and to understand how genetic and environmental factors determine complex traits relevant to medicine and agriculture.
Existing methods for analyzing genetic data often struggle with the size and the complexity of today's massive datasets. This research seeks to improve existing approaches via novel techniques in artificial intelligence and machine learning. Specifically, this project will develop new mathematical models of genomic sequences that will serve as the basis for algorithms for genetic data analysis, including for tasks such as analyzing human ancestry, understanding the effect of genetics on disease, and more.
The second part the project will explore a specific application of the new models--assaying genomic sequences with high accuracy and low cost and will develop open-source software for this task. This software will contribute to supporting the cost of acquiring massive genetic datasets, and will facilitate large-scale genetic studies. These efforts will positively impact downstream applications that rely on accurate genomes--medical genetics, animal breeding, and others--and will contribute to enabling cheaper and more accurate medical diagnosis, explaining the role of genetics in human disease, and helping breed more nourishing crops, ultimately improving human and environmental health.
The long-term research vision behind this project is to create next-generation algorithms for statistical genetics based on novel methods in machine learning and deep learning. This proposal begins a first step in this direction by creating novel methods for modeling genetic variation and apply them to two important problems in statistical genetics; that is, genotype imputation and low-pass genome sequencing.
Both problems involve determining the complete sequence of a genome from a small number of measurements obtained using an inexpensive assay. Specifically, this research has two primary aims: (1) to develop a novel deep generative model of genetic sequences that replaces classical approaches based on hidden Markov models and that can serve as the foundation for algorithms throughout statistical genetics; (2) to significantly reduce the cost of genomic assays via novel algorithms for imputation and low-pass sequencing based on the new model.
Central to this effort is the development of new techniques, approaches, and frameworks in deep generative modeling that address challenges posed by genetic data--including high dimensionality and long range sequence dependencies--and that are useful beyond genomics. Ultimately, we envision this work laying the foundation for a new field of deep statistical genetics and inspire new algorithms for problems throughout the field, including haplotyping, ancestry inference, genome-wide association study analysis, polygenic risk scoring, and beyond.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Cornell University
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant