Loading…
Loading grant details…
| Funder | NATIONAL LIBRARY OF MEDICINE |
|---|---|
| Recipient Organization | University of Wisconsin-Madison |
| Country | United States |
| Start Date | Aug 07, 2024 |
| End Date | May 31, 2028 |
| Duration | 1,393 days |
| Number of Grantees | 1 |
| Roles | Principal Investigator |
| Data Source | NIH (US) |
| Grant ID | 10942272 |
PROJECT SUMMARY/ABSTRACT Biomedical research data sets are increasingly being deposited in public, centralized databases, such as the Sequence Read Archive (SRA), to which researchers submit sequencing-based data. Large centralized databases greatly enable opportunities for training powerful machine learning models, as well as for reanalysis
and cross-study meta-analysis of biomedical data. These analyses can be used to answer questions that were not addressed in the papers first describing the data, including those that could only be answered by aggregating data from multiple studies. Unfortunately, researchers have not been able to fully capitalize on
databases of biomedical data sets largely because the metadata provided for data sets are often unstructured, unstandardized, and incomplete. For example, the primary metadata for samples with assays deposited in the SRA are provided as a list of key-value pairs, with no standardization of the keys or values and no required
fields. Such poor metadata pose challenges for integrating datasets with these databases as well as for querying for specific data sets of interest. To fully enable the opportunities offered by large biomedical databases, we propose to develop automated methods for curating the metadata contained within them. These methods will standardize the metadata of a
database by assigning to each record a set of standardized terms for concepts represented within biomedical ontologies and will additionally identify the relationship between each concept and record (e.g., a record’s corresponding biological sample was derived from liver tissue). A complementary set of methods will be
developed to identify missing or unstandardized concepts in metadata. The developed methods will use machine learning approaches that can be trained with minimal human effort. To achieve high accuracy with sparse training data, we will take advantage of cutting-edge approaches in deep learning, natural language
processing, and active learning. As a specific application of these general methods, we will use them to standardize and enhance the metadata contained within the SRA and the Gene Expression Omnibus (GEO) for the most commonly assayed species using a comprehensive set of ontology concepts and relationships.
The resulting standardized metadata for the SRA and GEO will be made freely available and easily accessible via a web interface, bulk downloads, and R and Python interface packages. The developed methods, along with the standardized metadata they produce, will allow biomedical databases to be used to their full potential
in advancing our understanding of fundamental biology and human health.
University of Wisconsin-Madison
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant