Loading…
Loading grant details…
| Funder | National Science Foundation (US) |
|---|---|
| Recipient Organization | University of New Hampshire |
| Country | United States |
| Start Date | Feb 01, 2025 |
| End Date | Jan 31, 2030 |
| Duration | 1,825 days |
| Number of Grantees | 1 |
| Roles | Principal Investigator |
| Data Source | National Science Foundation (US) |
| Grant ID | 2440896 |
Large software systems, such as cloud systems, often exhibit intricate failure patterns distinguished by complicated interactions between components. Metastable failure is one such pattern characterized by a positive feedback loop that continuously worsens the failure and prevents system recovery. In this project, we investigate the phenomena and work towards mitigating the risks and impacts of metastable failures on critical software infrastructure.
The project's novelties are several methods for holistically addressing the problem at different stages of its development or life cycle. Firstly, we study the patterns of metastable failures in various vulnerable distributed algorithms and systems. Secondly, we use the acquired knowledge to design new, more failure-resistant systems.
Thirdly, we work on eliminating the feedback loops or reducing their amplification effects to ensure engineers have sufficient time to deal with the problem before it becomes more difficult to recover from. The project's broader significance and importance go beyond metastable failures, as several of our techniques apply to other performance failures.
Furthermore, this project provides ample educational and knowledge transfer opportunities as we incorporate our findings into a classroom and design broader educational materials for practicing professionals.
The first step in mitigating the risks of metastable failures is knowing the triggers that initiate the feedback mechanisms and designing the algorithms to be more resistant to them. This step requires designing algorithms that can operate under a broader range of environmental conditions, such as less ideal communication and synchrony assumptions, node crashes, gray failures, and other adversarial effects.
If a failure progresses to the feedback loop stage, we are working on reducing the impacts of the feedback mechanisms through intelligent admission control, work prioritization, scheduling, and budgeting. In this project, we will produce tangible methods and strategies for mitigating metastable failures in distributed systems. We will provide protocols for more tolerant state machine replication, distributed caching, distributed transactions, and micro-service-style applications.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
University of New Hampshire
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant