Completed STANDARD GRANT National Science Foundation (US)

III: Medium: Dataset Search and Ranking for Data Augmentation and Explanation

$10.93M USD

Funder	National Science Foundation (US)
Recipient Organization	New York University
Country	United States
Start Date	Sep 01, 2021
End Date	Aug 31, 2025
Duration	1,460 days
Number of Grantees	2
Roles	Principal Investigator; Co-Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2106888`

Grant Description

There has been an explosion in the volume of data that is being collected and cataloged about the environment, society, and populace. Moreover, with the push towards transparency and open data, scientists, governments, and organizations are increasingly making these data available on the Web. Combined with advances in analytics and machine learning, such growing access to data should in theory allow for progress on many of the world’s most important scientific and societal questions.

However, this opportunity is often missed due to a central technical barrier: it is currently nearly impossible for domain experts to weed through the vast amount of publicly-available information to discover datasets that are needed for their specific application. Data repository platforms, such as CKAN and Dataverse, and dataset search engines, such as Google Dataset Search, aim to make it easy to share and find datasets.

But these systems only support simple, keyword-based queries and metadata search, which are insufficient for users to properly specify their information needs. The investigators envision a new kind of dataset search engine that unlocks the untapped value in open data by supporting a richer set of findability queries that cater to the needs of analytics tasks, and aid in the construction and refinement of machine learning models.

By empowering scientists and practitioners with the ability to discover relevant data, the project has great potential to stimulate data reuse both within and across domains.

The project will develop methods where the user’s existing data forms the basis of a query that retrieves additional, related data from a large collection of datasets and attributes. There are many technical hurdles to overcome to support such queries. One primary challenge is computational efficiency: this project will develop novel algorithms for rapidly computing and searching for dataset relationships.

The investigators will build on a rich variety of tools, including randomized sketching and hashing algorithms, and contribute new theoretical analyses to understand these methods. The algorithms contributed will address both highly-structured data (e.g., spatio-temporal) as well as generic numerical or categorical data. A second challenge is usability: the project will develop novel methods for assessing the significance of discovered data relationships, for pruning out coincidental or spurious relationships, and for ranking and presenting datasets to the end-user.

Finally, the project will contribute a formalism to the dataset search problem that supports a wide range of findability queries based on dataset relationships. Active plans for engagement in STEM related activities for high-school students are detailed.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

New York University

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

III: Medium: Dataset Search and Ranking for Data Augmentation and Explanation

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants