Active STUDENTSHIP UKRI Gateway to Research

Scaling Unsupervised Environment Design

Funder	Engineering and Physical Sciences Research Council
Recipient Organization	University of Oxford
Country	United Kingdom
Start Date	Sep 30, 2023
End Date	Mar 30, 2027
Duration	1,277 days
Number of Grantees	2
Roles	Student; Supervisor
Data Source	UKRI Gateway to Research
Grant ID	`2888076`

Grant Description

Reinforcement learning (RL) is a subfield of machine learning where an agent (e.g. an autonomous vehicle) learns from acting in an environment (e.g. a real road / simulation of a road). Despite making great progress in solving complex video games (Atari, Go, StarCraft), it has not yet been successfully applied to many real world problems. The root cause of this is the inability of RL agents to generalise to unseen scenarios.

Specifically, an RL agent trained in a simulation doesn't transfer well when deployed in the real world, due to the inevitable inaccuracies of simulation (note that, due to the large volume of training data needed and potential dangers, it is often impractical to train an agent in the real world).

Recent pioneering work has demonstrated significant empirical benefits to generalisation by training a teacher that learns to propose high-quality scenarios (e.g. road layouts) for the agent to train on, mirroring results from supervised learning that have shown the importance of data quality in generalisation. A limitation to this work is that the teacher has to learn from a sparse and noisy signal, resulting in low sample efficiency and necessitating large computational resources, meaning it has only been successfully applied to very simple problems.

To reduce signal noise, I have proposed methods encouraging the teacher to maintain a diverse set of scenarios using metrics for approximated surprise, ease of discrimination and distance in a learned latent space. Furthermore, I propose a novel data augmentation method, whereby scenarios are decomposed into a set of 'sub-scenarios', expanding the training data with minimal computational cost.

Finally, the current state of the art method trains the teacher by applying random perturbations. I suggest a method for targeted perturbations by constantly approximating the agent's regret (the difference between how well it did at the task and how well an optimal agent would have done) and applying perturbations where this is lowest. All these techniques aim to improve the efficiency of the overall process, reducing the resources needed and allowing this powerful technique to be opened up to more complex domains, benefiting real world applications like autonomous driving.

It should be noted that, while I use autonomous driving as a running example, the methods being developed will be generalisable to any RL problem and will be evaluated over a diverse range of environments. This project falls within the EPSRC Artificial intelligence technologies research area.

All Grantees

University of Oxford

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Scaling Unsupervised Environment Design

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants