Completed STANDARD GRANT National Science Foundation (US)

RI: Medium: Improving grounding, generalization and contextual reasoning in vision and language models

$12M USD

Funder	National Science Foundation (US)
Recipient Organization	Princeton University
Country	United States
Start Date	Sep 01, 2021
End Date	Aug 31, 2025
Duration	1,460 days
Number of Grantees	2
Roles	Principal Investigator; Co-Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2107048`

Grant Description

Recent Artificial Intelligence (AI) advances have brought us closer to the possibility of important and exciting real-world applications: ranging from robot assistants for the elderly or differently-abled, to large-scale video analysis of footage from police body-worn cameras to examine police-civilian interactions. Such applications require AI models to understand both visual and natural language cues.

However, the state of vision-and-language technology is still not quite ready for these scenarios. Current visual recognition models appear to recognize many different objects but lack an understanding of the interconnection and structure of the visual world. Current image captioning systems output reasonable but completely generic image descriptions.

Modern visual question answering systems are not robust to simple changes like synonyms or word rearrangements. This research will lead to fundamental advances in visual recognition and natural language understanding, laying the groundwork for more effective human-machine collaboration.

The goal of this research is to move towards a tighter, more accurate and contextual integration of visual recognition and natural language processing. This involves addressing three key challenges: (1) enabling accurate and scalable grounding by establishing robust bi-directional connections between visual input and natural language tokens; (2) improving generalization of vision-and-language models to novel concepts and tasks; and (3) enabling contextual reasoning to allow models to effectively adapt to human or task-specific needs.

The unifying theme is that all three challenges require innovation in not only modeling but also in reliable and insightful benchmarking: current evaluation frameworks are insufficient to drive progress in this space. The roadmap is to redesign existing benchmarks and evaluation paradigms, use the newly formulated metrics to identify the shortcomings in existing systems, and rely on these insights to drive the deep learning modeling innovations.

This research uses the team’s expertise in designing multi-modal models for vision and language as well as in constructing effective large-scale benchmarks. The findings will be disseminated through technical workshops, open access publications, and open-source code. They will also be integrated into undergraduate, graduate and K-12 curriculum through collaboration with foundations like AI4ALL.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Princeton University

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

RI: Medium: Improving grounding, generalization and contextual reasoning in vision and language models

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants