Loading…
Loading grant details…
| Funder | Engineering and Physical Sciences Research Council |
|---|---|
| Recipient Organization | University of Oxford |
| Country | United Kingdom |
| Start Date | Sep 30, 2023 |
| End Date | Mar 30, 2027 |
| Duration | 1,277 days |
| Number of Grantees | 2 |
| Roles | Student; Supervisor |
| Data Source | UKRI Gateway to Research |
| Grant ID | 2888076 |
Reinforcement learning (RL) is a subfield of machine learning where an agent (e.g. an autonomous vehicle) learns from acting in an environment (e.g. a real road / simulation of a road). Despite making great progress in solving complex video games (Atari, Go, StarCraft), it has not yet been successfully applied to many real world problems. The root cause of this is the inability of RL agents to generalise to unseen scenarios.
Specifically, an RL agent trained in a simulation doesn't transfer well when deployed in the real world, due to the inevitable inaccuracies of simulation (note that, due to the large volume of training data needed and potential dangers, it is often impractical to train an agent in the real world).
Recent pioneering work has demonstrated significant empirical benefits to generalisation by training a teacher that learns to propose high-quality scenarios (e.g. road layouts) for the agent to train on, mirroring results from supervised learning that have shown the importance of data quality in generalisation. A limitation to this work is that the teacher has to learn from a sparse and noisy signal, resulting in low sample efficiency and necessitating large computational resources, meaning it has only been successfully applied to very simple problems.
To reduce signal noise, I have proposed methods encouraging the teacher to maintain a diverse set of scenarios using metrics for approximated surprise, ease of discrimination and distance in a learned latent space. Furthermore, I propose a novel data augmentation method, whereby scenarios are decomposed into a set of 'sub-scenarios', expanding the training data with minimal computational cost.
Finally, the current state of the art method trains the teacher by applying random perturbations. I suggest a method for targeted perturbations by constantly approximating the agent's regret (the difference between how well it did at the task and how well an optimal agent would have done) and applying perturbations where this is lowest. All these techniques aim to improve the efficiency of the overall process, reducing the resources needed and allowing this powerful technique to be opened up to more complex domains, benefiting real world applications like autonomous driving.
It should be noted that, while I use autonomous driving as a running example, the methods being developed will be generalisable to any RL problem and will be evaluated over a diverse range of environments. This project falls within the EPSRC Artificial intelligence technologies research area.
University of Oxford
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant