Completed STANDARD GRANT National Science Foundation (US)

III: Small: Large-Scale High Dimensional Dense Vector Management

$6M USD

Funder	National Science Foundation (US)
Recipient Organization	Rutgers University New Brunswick
Country	United States
Start Date	Sep 01, 2022
End Date	Aug 31, 2025
Duration	1,095 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2212629`

Grant Description

Real-world objects such as images and documents often contain rich metadata information. In addition, the rapid development of machine learning, especially deep learning, in recent years make it possible to extract meaningful relationships between real-world objects and encode them in numerical representations. In this way, the semantics of objects can be conveniently processed by computers.

The numerical representations of semantics plays an important role in many data science and artificial intelligence applications, such as face recognition, image retrieval, video understanding, recommender systems, text analysis, and knowledge-base management. In these applications, the numerical representations of real-world objects and their associated metadata are usually jointly queried.

While metadata management and representation management are investigated extensively independently, jointly managing metadata and representations is under-investigated making it difficult to do in practice. Unfortunately, due to the large data volume and the notorious "curse-of-dimensionality" phenomenon that makes all high-dimensional data objects appear far apart, metadata and representation joint management are challenging.

To support the ability of applications to work with both traditional and numerical representations of data, this project will study how to leverage the synergy between them. If successful, this project will advance the development of science and technology by providing new knowledge about data management. Moreover, despite being widely used, metadata and representations are still largely managed by individual application developers.

Without careful implementation, the performance can hardly meet the needs of a wide variety of potential users. This project will deliver an end-to-end data system to alleviate the burden on machine learning practitioners and application developers of managing the representations and metadata created by their programs by themselves. In addition, this project includes curriculum development and student training at Rutgers University to amplify the impact of the work.

Large-scale high dimensional dense vectors are ubiquitous nowadays due to the rapid development of representation learning (e.g., the learned feature vectors from well-established machine-learning systems such as word2vec, doc2vec, node2vec, graph2vec, item2vec, etc.). They play an important role in many applications in areas such as data mining, natural language processing, computer vision, information retrieval, and recommendations.

However, large-scale high dimensional dense vectors are notorious for being hard to query efficiently due to the well-known "curse of dimensionality" phenomenon. Existing research on high dimensional dense vector management mainly focuses on approximate nearest-neighbor search (ANNS). However, a few widely used, compute-intensive dense vector queries are under-examined by the research community and not well supported by existing systems.

This project will study three of them: multi-modal ANNS, parallel vector similarity join, and rank estimation. Specifically, multi-modal ANNS are queries involving both dense vectors (e.g., vector representations of product images or documents) and their structured attributes (e.g., product price or last edit time). Given a collection of dense vectors, the vector similarity join connects every vector with its nearest neighbors.

To deal with the huge computational cost of this operation, this project will study lock-free, massively parallel algorithms on CPUs and GPUs. Rank estimation approximates the inherent dimensionality of a data vector (e.g., vector representations of items in recommendation or documents in information retrieval) in a set of data vectors ordered by their distance to a relevant vector (e.g., a user purchased the item or a keyword query related to the document).

Such a scheme is useful in machine-learning model evaluation. The long-term goal of this project is to build an end-to-end system to make large-scale dense vector management transparent to machine-learning practitioners and application developers.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Rutgers University New Brunswick

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

III: Small: Large-Scale High Dimensional Dense Vector Management

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants