Loading…

Loading grant details…

Active CONTINUING GRANT National Science Foundation (US)

CAREER: Mitigating the Risks of Metastable Failures in Distributed Systems

$2.12M USD

Funder National Science Foundation (US)
Recipient Organization University of New Hampshire
Country United States
Start Date Feb 01, 2025
End Date Jan 31, 2030
Duration 1,825 days
Number of Grantees 1
Roles Principal Investigator
Data Source National Science Foundation (US)
Grant ID 2440896
Grant Description

Large software systems, such as cloud systems, often exhibit intricate failure patterns distinguished by complicated interactions between components. Metastable failure is one such pattern characterized by a positive feedback loop that continuously worsens the failure and prevents system recovery. In this project, we investigate the phenomena and work towards mitigating the risks and impacts of metastable failures on critical software infrastructure.

The project's novelties are several methods for holistically addressing the problem at different stages of its development or life cycle. Firstly, we study the patterns of metastable failures in various vulnerable distributed algorithms and systems. Secondly, we use the acquired knowledge to design new, more failure-resistant systems.

Thirdly, we work on eliminating the feedback loops or reducing their amplification effects to ensure engineers have sufficient time to deal with the problem before it becomes more difficult to recover from. The project's broader significance and importance go beyond metastable failures, as several of our techniques apply to other performance failures.

Furthermore, this project provides ample educational and knowledge transfer opportunities as we incorporate our findings into a classroom and design broader educational materials for practicing professionals.

The first step in mitigating the risks of metastable failures is knowing the triggers that initiate the feedback mechanisms and designing the algorithms to be more resistant to them. This step requires designing algorithms that can operate under a broader range of environmental conditions, such as less ideal communication and synchrony assumptions, node crashes, gray failures, and other adversarial effects.

If a failure progresses to the feedback loop stage, we are working on reducing the impacts of the feedback mechanisms through intelligent admission control, work prioritization, scheduling, and budgeting. In this project, we will produce tangible methods and strategies for mitigating metastable failures in distributed systems. We will provide protocols for more tolerant state machine replication, distributed caching, distributed transactions, and micro-service-style applications.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

University of New Hampshire

Advertisement
Apply for grants with GrantFunds
Advertisement
Browse Grants on GrantFunds
Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant