Loading…

Loading grant details…

Active STUDENTSHIP UKRI Gateway to Research

Scaling Unsupervised Environment Design


Funder Engineering and Physical Sciences Research Council
Recipient Organization University of Oxford
Country United Kingdom
Start Date Sep 30, 2023
End Date Mar 30, 2027
Duration 1,277 days
Number of Grantees 2
Roles Student; Supervisor
Data Source UKRI Gateway to Research
Grant ID 2888076
Grant Description

Reinforcement learning (RL) is a subfield of machine learning where an agent (e.g. an autonomous vehicle) learns from acting in an environment (e.g. a real road / simulation of a road). Despite making great progress in solving complex video games (Atari, Go, StarCraft), it has not yet been successfully applied to many real world problems. The root cause of this is the inability of RL agents to generalise to unseen scenarios.

Specifically, an RL agent trained in a simulation doesn't transfer well when deployed in the real world, due to the inevitable inaccuracies of simulation (note that, due to the large volume of training data needed and potential dangers, it is often impractical to train an agent in the real world).

Recent pioneering work has demonstrated significant empirical benefits to generalisation by training a teacher that learns to propose high-quality scenarios (e.g. road layouts) for the agent to train on, mirroring results from supervised learning that have shown the importance of data quality in generalisation. A limitation to this work is that the teacher has to learn from a sparse and noisy signal, resulting in low sample efficiency and necessitating large computational resources, meaning it has only been successfully applied to very simple problems.

To reduce signal noise, I have proposed methods encouraging the teacher to maintain a diverse set of scenarios using metrics for approximated surprise, ease of discrimination and distance in a learned latent space. Furthermore, I propose a novel data augmentation method, whereby scenarios are decomposed into a set of 'sub-scenarios', expanding the training data with minimal computational cost.

Finally, the current state of the art method trains the teacher by applying random perturbations. I suggest a method for targeted perturbations by constantly approximating the agent's regret (the difference between how well it did at the task and how well an optimal agent would have done) and applying perturbations where this is lowest. All these techniques aim to improve the efficiency of the overall process, reducing the resources needed and allowing this powerful technique to be opened up to more complex domains, benefiting real world applications like autonomous driving.

It should be noted that, while I use autonomous driving as a running example, the methods being developed will be generalisable to any RL problem and will be evaluated over a diverse range of environments. This project falls within the EPSRC Artificial intelligence technologies research area.

All Grantees

University of Oxford

Advertisement
Discover thousands of grant opportunities
Advertisement
Browse Grants on GrantFunds
Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant