Completed STANDARD GRANT National Science Foundation (US)

CRII: OAC: Scalability of Deep-Learning Methods on HPC Systems: An I/O-centric Approach

$1.75M USD

Funder	National Science Foundation (US)
Recipient Organization	University of Southern California
Country	United States
Start Date	Jun 01, 2021
End Date	May 31, 2023
Duration	729 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2105044`

Grant Description

Machine learning (ML) algorithms have became key techniques in many scientific domains over the last few years. Thanks to the recent democratization of graphics processing units (GPUs), machine learning is mostly fueled by deep learning (DL) techniques that require extensive computational capabilities and vast volumes of data. Training large deep neural networks (DNN) is compute-intensive.

However, thanks to GPUs, the cost of the compute-intensive components of the training has been reduced, while the relative cost of memory accesses has increased following the huge increase in the size of inputs datasets and in the complexity of the ML models. Due to their increasing requirements in terms of computational and memory capabilities, DNNs are now trained on distributed systems and have recently gained attention from the high-performance computing (HPC) community.

A key challenge on HPC systems at extreme scale is the communication bottleneck, as communication is much slower than the required computations and also accounts for high energy consumption on large-scale machines. A lack of a comprehensive understanding of the different trade-offs, costs, and impacts induced by ML algorithms may severely impair science discoveries and AI breakthroughs in the near future.

This project aims to address this problem by developing accurate performance models that can capture the complexity of training a DNN at scale in terms of I/O (communication) and, based on these models, producing efficient scheduling heuristics to reduce communication when training DNNs on HPC machines. Reducing data exchanges during the training phase decreases the execution time of this costly process and is likely to also reduce its energy consumption.

The training of DNNs is becoming essential for many scientific domains, so optimizing the execution of this key component will help NSF fulfill its mission to advance and promote the progress of science. The proposed research will provide researchers with performance models that are key to supporting the development of novel middleware systems for large-scale ML on HPC platforms.

Educational and outreach activities will include the development of pedagogic modules that will teach students key concepts of distributed computing and training of large neural networks and enable students to participate in workshops and conferences that serve the community.

Training large neural networks on distributed HPC systems is challenging. DNN training involves complex communications patterns with some randomness due to the optimization method used to solve the network, which most of the time is stochastic gradient descent (SGD). Most distributed ML has been designed to run on cloud infrastructures, however HPC machines exhibit different characteristics in terms of hardware with fast interconnect networks and advanced communications capabilities, such as remote direct memory access (RDMA) and, in terms of software with the usage of the message passing interface (MPI) and OpenMP parallel programming models.

This project will design performance models taking into account HPC characteristics that will give useful insights into the behavior of DNN training at scale, for example, how the data communication volume evolves with the DNN batch size or how to leverage HPC multi-layered storage, such as burst buffers, to improve DNN training performance. This project is organized around three research thrusts: (i) estimation of data movement costs when training DNNs on HPC machines (ii) augmenting performance models with energy metrics and (iii) developing bi-objective heuristics minimizing communication and energy while still providing training accuracy guarantees.

In order to address these three research thrusts, this project will adopt a simulation-driven approach. The first step will be to characterize the I/O behaviors of DNNs when trained on HPC machines. Based on the analysis of the collected data, several performance models and scheduling heuristics will be designed.

Then, a simulator of the HPC machine will be developed using the NSF-funded WRENCH project. This simulator will be calibrated with the data collected during the characterization phase. Finally, the performance models and the scheduling heuristics will be evaluated using the calibrated simulator.

The project will also leverage the simulator to continuously improve the performance models and heuristics. This project will provide scientists with models to better understand performance trade-offs arising when training large-scale neural networks on complex distributed systems.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

University of Southern California

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

CRII: OAC: Scalability of Deep-Learning Methods on HPC Systems: An I/O-centric Approach

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants