Loading…

Loading grant details…

Active STANDARD GRANT National Science Foundation (US)

Elements: A Sustainable, Resource-Efficient Cyberinfrastructure for Notebook Interactive ML Training Workloads

$6M USD

Funder National Science Foundation (US)
Recipient Organization University of Virginia Main Campus
Country United States
Start Date Sep 15, 2024
End Date Aug 31, 2027
Duration 1,080 days
Number of Grantees 2
Roles Principal Investigator; Co-Principal Investigator
Data Source National Science Foundation (US)
Grant ID 2411009
Grant Description

Notebooks are general-purpose programming platforms widely used in machine learning (ML), artificial intelligence (AI), data science, and data analytics across almost every science and engineering field. Despite supporting a wide diversity of disciplines, a dominating application of production Notebook workloads is interactive ML training (IMLT). To guarantee high interactivity, modern Notebook services typically allocate and reserve GPU resources for actively running Notebook sessions.

These Notebook sessions are long-running but characterized by intermittent and sporadic GPU usage. Consequently, during most of their lifetimes, Notebook sessions do not use the reserved GPUs, resulting in extremely low GPU utilization and prohibitively high cost. This project aims to build a new Notebook platform solution for IMLT workloads to address these issues.

The success of the project will provide an efficient and interactive Notebook platform that significantly reduces GPU resource wastage. The project will advance understanding in large-scale cluster computing systems and gain insights into achieving high carbon efficiency and sustainability of large-scale GPU computing infrastructure. The integrated educational plan will create a new, versatile, educational platform.

This will include new pedagogical tools, new courses, as well as a proof-of-concept carbon-efficient Notebook service built on energy-efficient computers. This initiative aims to provide graduate and undergraduate students with multidisciplinary research and education experiences and develop outreach activities for K-12 students.

This proposal rethinks resource management for large-scale Notebook IMLT workloads by designing a novel, resource-efficient, and sustainable cyberinfrastructure called REITOS. The research is organized around several key research thrusts: (1) REITOS will develop distributed Notebook algorithms that replicate the Notebook kernel state for high availability and high interactivity. (2) Distributed Notebooks will enable a new way for oversubscribing and dynamically sharing significantly fewer GPU resources. (3) REITOS proposes new GPU cluster scheduling algorithms to dynamically preempt or migrate Notebook processes in cluster setups. (4) The project will establish a sustainable REITOS community by developing and maintaining a healthy GitHub community project that will encompass the entire REITOS ecosystem.

The potential contributions of this project are multi-fold: REITOS will enable new capabilities that are urgently required by existing Notebook platforms, such as high availability and efficient GPU sharing. The proposed research will create new cyberinfrastructure techniques as standalone, reusable modules that will be adopted by independent applications.

REITOS will be deployed to active Notebook user communities at multiple facilities through community outreach and collaborations.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

University of Virginia Main Campus

Advertisement
Discover thousands of grant opportunities
Advertisement
Browse Grants on GrantFunds
Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant