Loading…
Loading grant details…
| Funder | National Science Foundation (US) |
|---|---|
| Recipient Organization | California State University San Marcos Corporation |
| Country | United States |
| Start Date | May 01, 2021 |
| End Date | Sep 30, 2021 |
| Duration | 152 days |
| Number of Grantees | 1 |
| Roles | Principal Investigator |
| Data Source | National Science Foundation (US) |
| Grant ID | 2042516 |
Population genomic data are becoming increasingly affordable and accessible, causing a sudden data explosion in the field of evolutionary biology. With this increased degree of data generation comes another important issue - the missing data problem. This missing data problem could be due to data that is either unobserved (e.g., due to the sampling method), observed incorrectly (e.g., due to errors in method of observation), or can’t be observed (e.g., due to extinction).
Missing data are often not accounted for and can cause incorrect conclusions in population genomics research. This project will build bioinformatics software to address these three missing data problems. The statistical framework and methods developed by this project will be utilized extensively by evolutionary biologists in a variety of fields.
Additionally, this project will develop accessible software pipelines and curricular material for recruiting and retaining underrepresented groups into computer programming and bioinformatics at a variety of levels (K-12, Undergraduate, Graduate, post-graduate).
Population genomic data are either considered to be missing due to (1) sequencing or genotyping errors, (2) systematic bias in the generation of genotyping libraries (e.g. from techniques such as restriction associated DNA sequencing (RADseq), or (3) the absence of genomic data from un-sampled, perhaps extinct “ghost” populations. This project will develop a series of tools to address all three missing data problems by accounting for missing data as an unobserved variable in statistical models for the estimation of population genetic parameters and evolutionary history from genomic data.
Specifically, we will (1) build a parallelized statistical framework for estimating population genetic structure from multi-allelic, multi-locus genomic data that incorporates missing data into a maximum likelihood framework, (2) systematically explore RADseq data sets – using extensive simulations and a meta-analysis of published studies to both quantify and account for how missing data due to “lost” polymorphisms at restriction sites biases estimation of evolutionary history, and (3) develop a statistical model to classify genomic loci as those having introgressed from extant or from “ghost” populations based on their coalescent histories under the Isolation with Migration (IM) model. This work will form the basis of a set of robust tools that will be utilized by evolutionary biologists in a variety of fields to systematically both assess and account for the effects of missing data in their population genomic data sets.
This CAREER grant will also strengthen University-public partnerships through (1) week-long summer bioinformatics workshops for high-school biology teachers in the Philadelphia and San Diego areas, (2) development of curricular material for The Galaxy Project, the Conservation Genomics Workshop at the University of Montana, and the California State University Program for Education and Research in Biotechnology (CSUPERB), (3) recruitment and retention of underrepresented student scholars into genomics research. All curricular material, software, and pipelines developed will be shared via the PI’s GitHub page: www.github.com/arunsethuraman.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
California State University San Marcos Corporation
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant