Loading…

Loading grant details…

Active STANDARD GRANT National Science Foundation (US)

CIF: Small: Tackling Demand and Service Uncertainty in Multi-Access Parallel Computing

$5.65M USD

Funder National Science Foundation (US)
Recipient Organization Carnegie-Mellon University
Country United States
Start Date Dec 01, 2024
End Date Nov 30, 2027
Duration 1,094 days
Number of Grantees 2
Roles Principal Investigator; Co-Principal Investigator
Data Source National Science Foundation (US)
Grant ID 2428569
Grant Description

Modern computing systems, especially those supporting machine learning applications, are increasingly burdened by highly variable and unpredictable workloads and service capabilities. For example, machine learning tasks such as generating movie recommendations or interacting with large language models exhibit significant variation in arrival times, user characteristics, and computational demands.

Service variability and uncertainty also become pervasive as computing systems scale, due to differences in hardware generations and the dynamic sharing of resources among many applications. These challenges render traditional resource allocation algorithms ineffective as they rely on knowing specific details about job demand and server capabilities.

This project aims to advance the fundamental knowledge and to innovate algorithm design for resource allocation in modern computing systems, addressing the challenges posed by high variability and uncertainty in both demand and service. The resulting algorithms aim to improve the performance and energy efficiency of data centers, thereby reducing their carbon footprint while maintaining flexibility, affordability, and scalability in computing services.

Additionally, the project will promote interdisciplinary collaboration and educational initiatives, offering mentorship opportunities for underrepresented groups in STEM and contributing to high school and undergraduate curricula in data science and machine learning.

The demand and services have become highly heterogeneous and unpredictable in modern computing systems, especially those serving machine learning applications. Such uncertainty poses great challenges to developing performant resource allocation algorithms. Existing performance analyses and algorithms based on traditional methods of queueing and stochastic systems that rely on the knowledge of arrival and service rates are ineffective in these modern ML-workload-dominated computing systems.

Although significant progress has been made from the systems perspective to address these challenges, there lacks a theoretical foundation that can offer a deeper understanding of the fundamental limits of system performance and guide the designs of resource allocation algorithms. This project aims to make fundamental advances to queueing and scheduling algorithms for modern computing systems with machine learning workloads.

It focuses on two dimensions of uncertainty in computing systems, namely, demand uncertainty (Thrust 1) and service uncertainty (Thrust 2). The proposed research brings two new techniques to the design of computing servers and resource allocation algorithms: 1) coding-theoretic techniques to design more flexible servers to handle multiple job types and heterogeneous demands without having to overprovision resources, and 2) online-learning-based job scheduling strategies that dynamically estimate service capabilities.

The research exploit the cross-pollination of ideas between the queueing theory, information/coding theory and online learning.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Carnegie-Mellon University

Advertisement
Discover thousands of grant opportunities
Advertisement
Browse Grants on GrantFunds
Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant