Loading…
Loading grant details…
| Funder | National Science Foundation (US) |
|---|---|
| Recipient Organization | Williams College |
| Country | United States |
| Start Date | Jun 15, 2021 |
| End Date | Nov 30, 2024 |
| Duration | 1,264 days |
| Number of Grantees | 1 |
| Roles | Principal Investigator |
| Data Source | National Science Foundation (US) |
| Grant ID | 2103813 |
Given a large dataset, how can one find a similar item to a given query? Preprocessing the dataset to quickly find these similar items is called "similarity search." Similarity search is a fundamental computing problem with applications in data science, machine learning, and bioinformatics. Similarity search has been studied extensively both in theory and in practice, and the state of the art has been highly optimized.
Unfortunately, many previous techniques require a very large amount of extra storage space to help answer queries quickly. This can be a significant limitation, as storage space is limited by the hardware being used to store the dataset. Surprisingly, space-efficient similarity-search methods have only been studied in a few restricted settings.
This project seeks to close this gap, first by creating black-box methods that can be used for space-efficient similarity search in a far wider variety of settings, and then by extending known methods to achieve improved results. This project will be integrated into undergraduate courses and research opportunities at Williams College.
The project seeks to solve this problem from several new directions. First, Fiat-Naor function inversion is a cryptographic technique giving a time-space tradeoff for inverting functions. Initial results show that Fiat-Naor inversion can be used to give a black-box result that improves the space usage of a wide variety of similarity-search algorithms.
Even better, Fiat-Naor inversion can be integrated with known space-efficient similarity-search techniques to give improved query times using the same space. Furthermore, it appears that Fiat-Naor inversion interfaces particularly well with similarity search, and that tuning Fiat-Naor inversion to similarity search can give even better bounds. Second, the project will look at similarity search in the context of text strings.
Similarity search on text strings has a rich literature of distinctive techniques tailor-made to the case of strings, generally using trie-based data structures. This project seeks to combine these techniques with more recent advances made in other areas (usually using hashing), to achieve better bounds. Finally, this project is examining heuristics for edit-distance similarity search, with the goal of matching or improving the current (space-inefficient) state of the art while retaining very small space requirements.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Williams College
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant