Active STUDENTSHIP UKRI Gateway to Research

Learning world models from raw visual data

Funder	Engineering and Physical Sciences Research Council
Recipient Organization	University of Oxford
Country	United Kingdom
Start Date	Sep 30, 2024
End Date	Mar 30, 2028
Duration	1,277 days
Number of Grantees	1
Roles	Student
Data Source	UKRI Gateway to Research
Grant ID	`2922573`

Grant Description

This project falls within the EPSRC information and communication technologies theme, specifically under Artificial intelligence technologies. Introduction and context

Understanding reality through visual perception is a key frontier in artificial intelligence (AI) research, particularly as the deployment of AI systems continues to expand into diverse real-world applications, such as augmented reality, embodied AI, humanoid robotics, and self-driving cars. However, current state-of-the-art AI systems, including large language/multimodal models and image/video generators, have a limited understanding of the physics and geometry of the real world.

These models, while capable of understanding static scenes and generating exceptionally realistic images, often fail to accurately interpret and represent the complexities of our physical environment, including physically plausible motion, object permanence, and temporal consistency. This project addresses these limitations by developing novel algorithms that enhance AI's capacity to comprehend and generate representations of the real world from raw visual data (e.g., publicly available videos).

The overarching goal of this project is to improve machines' ability to perceive, reason, and act in the real world (3D) and time, aiming to facilitate their deployment in the real world. Research Questions The project aims to answer the following questions:

How can we learn robust representations of the real world from raw (i.e., minimally annotated) visual data (e.g., videos) without human supervision? How can we efficiently generate novel 3D worlds while respecting underlying physical laws? Novel methodology

The primary objective of this research is to create novel AI algorithms that can learn directly from unprocessed visual inputs, such as images and videos, with minimal human intervention (supervision). There is a scarcity of high-quality, well-annotated 3D data due to the cost of creating 3D assets, preventing us from learning directly from 3D data.

By focusing on raw visual perception (e.g., raw videos), possibly augmented with other modalities such as language and audio, the main goal of the project is to improve automatic understanding of the real world. The expected novel contributions may include:

1. Novel self-supervised architectures and learning objectives, tailored for representation learning from video (and other modalities).

2. Algorithms for generating 3D assets and physically plausible videos while allowing fine-grained control of the generated outputs. For instance, models that support conditioning on the initial scene, the desired motion, or action.

3. New datasets tailored for specific tasks, such as scene understanding or object tracking, possibly from an egocentric viewpoint. Potential impact and applications

Better Synthetic Data: Generating realistic 3D worlds that follow physical laws allows for creating simulation environments where robots or autonomous systems can be trained. This speeds up the development of these systems, as simulation can be faster than reality. Similarly, developed generative models can synthesize multi-view training data that may improve existing 3D reconstruction methods.

Improved Scene Understanding: Learning representations directly from videos can improve scene understanding, object recognition, and motion tracking, which are crucial for applications like augmented reality (AR) and virtual reality (VR). This can enable more immersive and interactive virtual experiences, bridging the gap between the virtual and physical worlds.

Reduced Cost of 3D Content Generation: Modern (AAA) game titles, which aim for photorealistic graphics and complex worlds, can cost anywhere from $50 million to over $500 million to develop, due to a significant amount of manual labour. Consequently, algorithms that can generate novel 3D worlds could significantly reduce game development costs.

All Grantees

University of Oxford

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Learning world models from raw visual data

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants