Active STANDARD GRANT National Science Foundation (US)

EAGER: NAIRR Pilot: Making Biomedical Data FAIR on the NAIRR

$2.38M USD

Funder	National Science Foundation (US)
Recipient Organization	University of Washington
Country	United States
Start Date	Apr 01, 2025
End Date	Jun 30, 2026
Duration	455 days
Number of Grantees	2
Roles	Principal Investigator; Co-Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2451163`

Grant Description

Artificial Intelligence (AI) methods depend on data. However, much of the most valuable biomedical data is subject to strict access controls due to its sensitivity. This significantly hinders the data’s ability to be findable, accessible, interoperable, and reusable (FAIR), effectively making it “dark data.” Dark data is a major reason why AI remains underdeveloped in healthcare.

To address this problem, this project will develop methods that can safely and securely illuminate dark health data and make it accessible for research without compromising the privacy of the data donors. By creating fully open but privacy-preserving replicas of the original datasets, the project will empower researchers to find, access, inspect, socialize, critique, and reuse this data.

Systematically illuminating dark data will support beneficial AI health applications, particularly for data stored in isolated repositories, such as data about rare diseases.

This project’s goal is to advance the state-of-the-art in privacy-preserving biomedical data sharing through the development of a software library with efficient algorithms and cryptographic protocols. This will enable data custodians from secured, siloed repositories to contribute data to a joint and secure synthetic data generation process. The solution combines Secure Multiparty Computation with Differential Privacy techniques to ensure input privacy (data custodians do not need to disclose their data to anyone, including intermediary and aggregating servers) and output privacy, where the synthetic data sets do not reveal sensitive information about individuals in the training data.

Additionally, the project will improve the understanding of current techniques’ strengths and limitations in training generative artificial intelligence models for high-fidelity and high-utility synthetic genomics (germline) data. It will investigate the limits of direct germline sequence generation using state-of-the-art generative foundation models, generate variant call file data using frontier models for tabular data generation, and validate the concordance of mutation annotation files derived from synthetic and original variant call files by comparing pathogenicity predictions.

The project focuses on rare diseases such as Neurofibromatosis and Acute Myeloid Leukemia, where existing data is minimal and fragmented across different data custodians.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

University of Washington

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

EAGER: NAIRR Pilot: Making Biomedical Data FAIR on the NAIRR

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants