Loading…
Loading grant details…
| Funder | National Science Foundation (US) |
|---|---|
| Recipient Organization | University of Maryland, College Park |
| Country | United States |
| Start Date | Mar 01, 2021 |
| End Date | Feb 28, 2026 |
| Duration | 1,825 days |
| Number of Grantees | 1 |
| Roles | Principal Investigator |
| Data Source | National Science Foundation (US) |
| Grant ID | 2047120 |
Recent advances in machine learning (ML) approaches are driving scientific discovery across many disciplines. This presents a unique opportunity in the parallel computing community to remove the human and associated guesswork in the performance engineering loop, and instead, use data-driven ML models for performance modeling, forecasting and tuning.
Analytics of data about software performance and operational efficiency of the parallel systems can be used to identify performance anomalies and their root causes. This can transform the process of optimizing the performance of parallel software and operational efficiency of parallel systems. By using data-driven statistical modeling based on machine learning, the impact of human errors in the process can be minimized, and parallel software and systems can become truly self-tuning.
This work is leveraging and contributing to the growing body of work on ML for Systems, and brings its benefits to extreme-scale parallel software and systems. The project is also engaging high school students, training undergraduate and graduate students in parallel computing and preparing them for a career in HPC to address a significant shortage of computer and computational scientists in HPC, both in the industry and national laboratories.
The project is applying statistical and ML algorithms to analyze performance data, and using the trained models and insights to enable the self-tuning of performance of parallel software and systems. This work is developing a holistic methodology for accomplishing the following tasks: (1) analyze large volumes of software and system data collected over time, (2) apply machine learning to model application and system behavior, and (3) use these models to guide application, runtime and system optimization decisions that impact future executions.
This holistic approach of data-driven self-tuning can significantly improve the performance and portability of parallel software, and operational efficiency of HPC and data center systems even as codes and systems evolve. Better performance of individual jobs leads to faster science results and increased job throughput. This work is making advances in three key areas.
First, development of ML-based mechanisms to model the performance of parallel software and use of such models to automatically optimize their performance by selecting high-performance configurations. Second, the development of automated methods to analyze large-scale longitudinal monitoring data for analysis of parallel systems, and develop mechanisms to use trained ML models to automatically tune the operation of parallel systems.
And finally, the first two thrusts can be used to automatically tune the performance of parallel codes as they are ported to new or future architectures by using techniques such as transfer learning. This project is leading to the development of a suite of techniques and frameworks to analyze performance-related data being gathered at different levels (job, system and facility) and to make decisions for optimizing various operational efficiency related metrics.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
University of Maryland, College Park
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant