Completed STANDARD GRANT National Science Foundation (US)

CRII: III: Toward the Compression of Pangenomic DNA Sequence Data Using Context-Free Grammars

$1.75M USD

Funder	National Science Foundation (US)
Recipient Organization	National Center for Genome Resources
Country	United States
Start Date	Aug 01, 2021
End Date	Jul 31, 2024
Duration	1,095 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2105391`

Grant Description

DNA sequence data is becoming ubiquitous in various domains of science, such as medicine and agriculture. However, the volume of these data and the rate at which they are being generated is rapidly outpacing storage and analysis capabilities. This project aims to address both the storage and analysis issues by developing new techniques for compressing DNA sequence data such that analyses can be performed directly on the compressed data.

Specifically, the project aims to compress collections of DNA sequence data from the same species, or pangenomes. In addition to reducing data storage costs and transmission times, this will enable the analysis of pangenomes at an unprecedented scale which could aid researchers seeking to understand the genetic basis of complex diseases in medical contexts, or similarly complex traits that are targets of directed breeding efforts in agricultural.

The primary goal of this project is to develop new methods for compressing pangenomic DNA sequence data. The motivation comes from the fact that these data are too large to store uncompressed but must be continuously analyzed by the research community. This project addresses the issue by building on preliminary work that compresses collections of strings using context-free grammars in a manner that allows the string content of a compressed collection to be updated over time.

The first aim of the project is to develop new algorithms for compressing multiple genomes of the same species in a manner that enables search and computation directly on compressed archives. The second aim is to develop methods for mapping sequencing reads to compressed archives. The third aim is to develop methods for compressing reads by using their mappings to integrate them into the compressed archive while enabling search and computation.

The fourth aim is to develop methods for performing searches directly on compressed read archives. And the fifth aim is to implement these methods in an open-source software package including an Application Programming Interface for use by biological data science researchers. This project will innovate both DNA sequence data compression techniques and general data compression techniques. It will also enable pagenomic analyses at scale across industry and academia.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

National Center for Genome Resources

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

CRII: III: Toward the Compression of Pangenomic DNA Sequence Data Using Context-Free Grammars

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants