Loading…

Loading grant details…

Active CONTINUING GRANT National Science Foundation (US)

CAREER: Low-Effort Runtime Assurance for Robust Cloud Systems

$2.74M USD

Funder National Science Foundation (US)
Recipient Organization University of Virginia Main Campus
Country United States
Start Date Feb 01, 2025
End Date Jan 31, 2030
Duration 1,825 days
Number of Grantees 1
Roles Principal Investigator
Data Source National Science Foundation (US)
Grant ID 2441284
Grant Description

Cloud systems are the backbone of modern societies. Ensuring their robustness is paramount. Despite significant efforts to improve cloud availability, today cloud systems become increasingly inadequate at managing emerging failure modes.

One example is silent semantic failures, which violate system semantics without generating error signals. The current practice relies on statistical methods to analyze system metrics (e.g., logs, resource usage, I/O), often leaving silent semantic failures undetected and causing huge damage. This gap calls for a paradigm shift from merely monitoring metrics to verifying execution during failures at runtime.

However, state-of-the-art runtime checking solutions require high manual effort for writing semantic checkers, a time-consuming and error-prone process even for small programs. Consequently, few of these solutions were actually adopted in production systems. This absence leads to jeopardized stability of critical infrastructure and significant economic losses.

The overall objective in this proposal is to develop a framework which provides runtime assurance for detecting, diagnosing, mitigating and preventing cloud failures with low developer effort. The central hypothesis is that the bottleneck of manual effort arises from isolation between runtime checkers and other system components. The project’s novelties are leveraging automated reasoning to systematically extract insights from existing system resources, elevating checker construction from isolated and labor-intensive work to continuous and integrated activities throughout cloud development.

The project's broader significance and importance are in its potential to greatly reduce the impact of cloud failures, minimizing financial losses and enhancing service availability to a new standard. Leveraging the PI’s prior experience, the project will deploy and validate the developed techniques in collaboration with partners including Microsoft and Amazon.

The specific aims of this research include synergistic thrusts for handling cloud failures end-to-end. First, timely detection is the first step in dealing with failures; thus, the project synthesizes semantic checkers from test cases for detecting ongoing silent failures. Since the failure root cause often hides in numerous services, the project models system dynamics and helps developers verify their failure hypotheses.

This project further proposes a systematic approach to mitigate failures without side effects on production services. It uses a novel dry-run execution on transformed system codes to verify the consequences of failure mitigation which reduces action risks. Lastly, the project deploys prevention mechanisms that verify system execution against triggering conditions of past failures to avoid recurrence of similar incidents.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

University of Virginia Main Campus

Advertisement
Discover thousands of grant opportunities
Advertisement
Browse Grants on GrantFunds
Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant