Completed STANDARD GRANT National Science Foundation (US)

CNS Core: Small: Integrating Real-Time Learning and Control for Large and Dynamic Networked Computer Systems

$5.2M USD

Funder	National Science Foundation (US)
Recipient Organization	Purdue University
Country	United States
Start Date	Oct 01, 2021
End Date	Sep 30, 2025
Duration	1,460 days
Number of Grantees	3
Roles	Principal Investigator; Former Principal Investigator; Former Co-Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2113893`

Grant Description

Large computer and network systems (such as data centers) are the workhorses driving our information society. However, they are also increasingly difficult to control and operate due to their enormous size, fast-changing workload, and significant uncertainty in resource requirement and availability. Traditional approaches to control and optimization rely on carefully-constructed models of the systems under study, but they become insufficient in such a fast-changing environment when crucial components of the system model are either unknown or constantly changing.

Instead, this project aims to develop new methods that can quickly learn an updated model from fresh real-time data, and that integrate such real-time learning with real-time control to improve the efficiency, adaptability, and quality-of-service (QoS) of large-scale and dynamic networked computer systems. Specifically, the project focuses on the operation of large data centers serving big-data analytics and deep-learning training workloads, and develops new real-time learning and stochastic control policies that are not only efficient, but also scalable, able to interpret, and adaptive.

The intellectual merits include: (i) real-time learning and control policies that can learn, from real-time feedback, server-dependent features of the computing and network jobs, to greatly improve the throughput of data centers running large and heterogeneous workload, reduce job completing times, and meet service deadlines; and (ii) real-time learning and control policies tailored to the unique features of deep-learning training workload, which can quickly estimate the total training time and the dependency across heterogeneous processing units, to optimize both throughput and delay.

The proposed research has the potential to have a lasting impact to knowledge discovery and education. The results could enable data centers to run jobs faster and complete them sooner, and therefore benefit the computing industry, both by improving the overall efficiency of data centers running diverse and fast-changing workload, and by improving the satisfaction of users who rely on data centers for business decisions and data analytics.

The research findings may contribute to the general theory of both online learning and stochastic control, which will also be useful for other computer and network systems with both uncertain system dynamics and uncertain agent features, such as wireless networks and online service platforms. Students on the project will be trained on both theoretic tools (including online learning, stochastic control, and data analytics) and system building skills (including cluster computing and data-center networking), which are essential for the future big-data economy.

Further, the outreach activity integrated with the research computed will broaden the knowledge of high school students on the key principles of online learning and big-data.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Purdue University

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

CNS Core: Small: Integrating Real-Time Learning and Control for Large and Dynamic Networked Computer Systems

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants