Active NON-SBIR/STTR RPGS NIH (US)

Automated methods for standardization and enhancement of metadata in biomedical databases

$3.32M USD

Funder	NATIONAL LIBRARY OF MEDICINE
Recipient Organization	University of Wisconsin-Madison
Country	United States
Start Date	Aug 07, 2024
End Date	May 31, 2028
Duration	1,393 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	NIH (US)
Grant ID	`10942272`

Grant Description

PROJECT SUMMARY/ABSTRACT Biomedical research data sets are increasingly being deposited in public, centralized databases, such as the Sequence Read Archive (SRA), to which researchers submit sequencing-based data. Large centralized databases greatly enable opportunities for training powerful machine learning models, as well as for reanalysis

and cross-study meta-analysis of biomedical data. These analyses can be used to answer questions that were not addressed in the papers first describing the data, including those that could only be answered by aggregating data from multiple studies. Unfortunately, researchers have not been able to fully capitalize on

databases of biomedical data sets largely because the metadata provided for data sets are often unstructured, unstandardized, and incomplete. For example, the primary metadata for samples with assays deposited in the SRA are provided as a list of key-value pairs, with no standardization of the keys or values and no required

fields. Such poor metadata pose challenges for integrating datasets with these databases as well as for querying for specific data sets of interest. To fully enable the opportunities offered by large biomedical databases, we propose to develop automated methods for curating the metadata contained within them. These methods will standardize the metadata of a

database by assigning to each record a set of standardized terms for concepts represented within biomedical ontologies and will additionally identify the relationship between each concept and record (e.g., a record’s corresponding biological sample was derived from liver tissue). A complementary set of methods will be

developed to identify missing or unstandardized concepts in metadata. The developed methods will use machine learning approaches that can be trained with minimal human effort. To achieve high accuracy with sparse training data, we will take advantage of cutting-edge approaches in deep learning, natural language

processing, and active learning. As a specific application of these general methods, we will use them to standardize and enhance the metadata contained within the SRA and the Gene Expression Omnibus (GEO) for the most commonly assayed species using a comprehensive set of ontology concepts and relationships.

The resulting standardized metadata for the SRA and GEO will be made freely available and easily accessible via a web interface, bulk downloads, and R and Python interface packages. The developed methods, along with the standardized metadata they produce, will allow biomedical databases to be used to their full potential

in advancing our understanding of fundamental biology and human health.

All Grantees

University of Wisconsin-Madison

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Automated methods for standardization and enhancement of metadata in biomedical databases

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants