Completed STANDARD GRANT National Science Foundation (US)

EAGER: Scalable, Content-Based, Domain-Agnostic Search of Scientific Data through Concise Topological Representations

$1.8M USD

Funder	National Science Foundation (US)
Recipient Organization	Tulane University
Country	United States
Start Date	Oct 01, 2021
End Date	Sep 30, 2023
Duration	729 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2136744`

Grant Description

Cutting-edge science relies on scientists’ ability to sift through and access the massive amounts of data that are being produced by the latest research. Much of that data is stored in online databases and is searchable only by using specific, scientific terms, like keywords, tags, or descriptions. If someone doesn’t know exactly the right terms to use, they often can’t access all the data that might be useful for their research.

By using mathematical approaches for information retrieval in a new way, this project will study whether a powerful search tool, called content-based search, can be modified for scientific data. If successful, this project will free data users from needing to know exactly which keywords to use, transforming how scientists are able to access and share data and creating new opportunities for scientists with vastly different expertise to work together.

One particularly promising way to describe the content of scientific data is through a dataset’s topology. Therefore, this project will develop approaches to compute topological similarity that are smaller, faster, and more scalable than previously thought possible, with the goal of creating a method for cross-cutting, content-based search of scientific data.

Specifically, the investigators will develop a learned-hash function to convert a dataset’s persistence diagram - the common encoding of its topology - to a simple binary code. This hash will be trained such that the bitwise distance between codes will maintain a measure of topological similarity between datasets. This will convert topological comparisons from the current state of an expensive bottleneck to one with nominal processing costs that can scale to large database queries.

Initially, this project will focus on binary codes that maintain clusters and neighborhoods, ultimately developing codes that are rank or semi-metric preserving. The investigators will also explore strategies for training a learned-hash function on synthetic data, with the goal of developing a fully domain-oblivious approach to content-based search.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Tulane University

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

EAGER: Scalable, Content-Based, Domain-Agnostic Search of Scientific Data through Concise Topological Representations

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants