Loading…
Loading grant details…
| Funder | National Science Foundation (US) |
|---|---|
| Recipient Organization | University of Kansas Center for Research Inc |
| Country | United States |
| Start Date | Oct 01, 2025 |
| End Date | Sep 30, 2030 |
| Duration | 1,825 days |
| Number of Grantees | 1 |
| Roles | Principal Investigator |
| Data Source | National Science Foundation (US) |
| Grant ID | 2441633 |
Resource scheduling is a critical component of high-performance computing (HPC) systems. Despite extensive literature on scheduling, new challenges continue to arise due to advancements in hardware, software, and evolving models, metrics, and performance demands. Today’s HPC systems operate on an unprecedented scale, presenting significant challenges for resource management, particularly when facing uncertainty introduced by emerging application characteristics and system-level complexities.
Existing schedulers lack robust mechanisms to effectively handle uncertainty, limiting their ability to achieve optimal performance. This project takes on the grand challenge of scheduling HPC resources under uncertainty by introducing an integrated approach that combines algorithm and machine learning (ML). The approach leverages the rigor of algorithmic analysis to provide performance guarantees while utilizing ML’s predictive capabilities to manage uncertainty effectively.
The anticipated outcome is a substantial enhancement to current HPC schedulers, enabling more efficient execution of a diverse range of scientific applications, such as neuroscience, medical research, climate modeling, and artificial intelligence. Additionally, the project includes a series of synergistic activities, including outreach programs, curriculum development, and student recruitment, aimed at engaging students from K-12 through graduate levels.
These efforts focus particularly on underrepresented and underserved communities, offering research opportunities that foster success in STEM and CS education.
Technically, this project aims to design, implement, and evaluate scheduling algorithms that integrate ML prediction models to enhance efficiency. The focus will be on addressing three primary sources of uncertainty: (1) inherent runtime variability of emerging applications; (2) resource contention in job co-scheduling; and (3) structural variations within dynamic workflows.
These aspects represent uncertainties across temporal, spatial, and structural dimensions, all of which demand solutions due to their growing prevalence in modern HPC environments. Algorithmically, approximation and semi-online algorithms will be developed to provide performance guarantees relative to theoretical lower bounds for metrics such as job completion time and resource utilization.
On the ML front, various models, including those based on regression and reinforcement learning, will be trained to deliver accurate predictions for job runtime, performance degradation, and structural variability. A key ambition of this project is to establish an incubation framework that enables the effective integration of heuristic-based algorithms and data-driven ML models.
This approach aims to achieve a level of performance that neither paradigm could accomplish independently. The framework will offer a novel perspective on resource management and potentially set the stage for future HPC advancements.
This project is jointly funded by Software and Hardware Foundations and the Established Program to Stimulate Competitive Research (EPSCoR).
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
University of Kansas Center for Research Inc
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant