Completed CONTINUING GRANT National Science Foundation (US)

CAREER: Developing New Computational Methods to Address the Missing Data Problem in Population Genomics

$4.84M USD

Funder	National Science Foundation (US)
Recipient Organization	California State University San Marcos Corporation
Country	United States
Start Date	May 01, 2021
End Date	Sep 30, 2021
Duration	152 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2042516`

Grant Description

Population genomic data are becoming increasingly affordable and accessible, causing a sudden data explosion in the field of evolutionary biology. With this increased degree of data generation comes another important issue - the missing data problem. This missing data problem could be due to data that is either unobserved (e.g., due to the sampling method), observed incorrectly (e.g., due to errors in method of observation), or can’t be observed (e.g., due to extinction).

Missing data are often not accounted for and can cause incorrect conclusions in population genomics research. This project will build bioinformatics software to address these three missing data problems. The statistical framework and methods developed by this project will be utilized extensively by evolutionary biologists in a variety of fields.

Additionally, this project will develop accessible software pipelines and curricular material for recruiting and retaining underrepresented groups into computer programming and bioinformatics at a variety of levels (K-12, Undergraduate, Graduate, post-graduate).

Population genomic data are either considered to be missing due to (1) sequencing or genotyping errors, (2) systematic bias in the generation of genotyping libraries (e.g. from techniques such as restriction associated DNA sequencing (RADseq), or (3) the absence of genomic data from un-sampled, perhaps extinct “ghost” populations. This project will develop a series of tools to address all three missing data problems by accounting for missing data as an unobserved variable in statistical models for the estimation of population genetic parameters and evolutionary history from genomic data.

Specifically, we will (1) build a parallelized statistical framework for estimating population genetic structure from multi-allelic, multi-locus genomic data that incorporates missing data into a maximum likelihood framework, (2) systematically explore RADseq data sets – using extensive simulations and a meta-analysis of published studies to both quantify and account for how missing data due to “lost” polymorphisms at restriction sites biases estimation of evolutionary history, and (3) develop a statistical model to classify genomic loci as those having introgressed from extant or from “ghost” populations based on their coalescent histories under the Isolation with Migration (IM) model. This work will form the basis of a set of robust tools that will be utilized by evolutionary biologists in a variety of fields to systematically both assess and account for the effects of missing data in their population genomic data sets.

This CAREER grant will also strengthen University-public partnerships through (1) week-long summer bioinformatics workshops for high-school biology teachers in the Philadelphia and San Diego areas, (2) development of curricular material for The Galaxy Project, the Conservation Genomics Workshop at the University of Montana, and the California State University Program for Education and Research in Biotechnology (CSUPERB), (3) recruitment and retention of underrepresented student scholars into genomics research. All curricular material, software, and pipelines developed will be shared via the PI’s GitHub page: www.github.com/arunsethuraman.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

California State University San Marcos Corporation

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

CAREER: Developing New Computational Methods to Address the Missing Data Problem in Population Genomics

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants