Loading…
Loading grant details…
| Funder | National Science Foundation (US) |
|---|---|
| Recipient Organization | Princeton University |
| Country | United States |
| Start Date | Sep 01, 2021 |
| End Date | Aug 31, 2025 |
| Duration | 1,460 days |
| Number of Grantees | 2 |
| Roles | Principal Investigator; Co-Principal Investigator |
| Data Source | National Science Foundation (US) |
| Grant ID | 2107048 |
Recent Artificial Intelligence (AI) advances have brought us closer to the possibility of important and exciting real-world applications: ranging from robot assistants for the elderly or differently-abled, to large-scale video analysis of footage from police body-worn cameras to examine police-civilian interactions. Such applications require AI models to understand both visual and natural language cues.
However, the state of vision-and-language technology is still not quite ready for these scenarios. Current visual recognition models appear to recognize many different objects but lack an understanding of the interconnection and structure of the visual world. Current image captioning systems output reasonable but completely generic image descriptions.
Modern visual question answering systems are not robust to simple changes like synonyms or word rearrangements. This research will lead to fundamental advances in visual recognition and natural language understanding, laying the groundwork for more effective human-machine collaboration.
The goal of this research is to move towards a tighter, more accurate and contextual integration of visual recognition and natural language processing. This involves addressing three key challenges: (1) enabling accurate and scalable grounding by establishing robust bi-directional connections between visual input and natural language tokens; (2) improving generalization of vision-and-language models to novel concepts and tasks; and (3) enabling contextual reasoning to allow models to effectively adapt to human or task-specific needs.
The unifying theme is that all three challenges require innovation in not only modeling but also in reliable and insightful benchmarking: current evaluation frameworks are insufficient to drive progress in this space. The roadmap is to redesign existing benchmarks and evaluation paradigms, use the newly formulated metrics to identify the shortcomings in existing systems, and rely on these insights to drive the deep learning modeling innovations.
This research uses the team’s expertise in designing multi-modal models for vision and language as well as in constructing effective large-scale benchmarks. The findings will be disseminated through technical workshops, open access publications, and open-source code. They will also be integrated into undergraduate, graduate and K-12 curriculum through collaboration with foundations like AI4ALL.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Princeton University
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant