Active CONTINUING GRANT National Science Foundation (US)

CAREER: Operationalising Fault-Tolerance in an Uncertain World.

$2.87M USD

Funder	National Science Foundation (US)
Recipient Organization	University of California-Berkeley
Country	United States
Start Date	Jan 01, 2025
End Date	Dec 31, 2029
Duration	1,825 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2442542`

Grant Description

Modern applications, be it e-commerce sites, ML serving systems or medical applications place increasingly stringent requirements on the cloud storage systems that underpin them. These systems must offer good performance and scalability, as well as robustness against hardware failures and malicious attacks. Consensus systems, specifically, are used as the root of trust to bootstrap application correctness, and ensure that machines agree on a shared state in spite of failures.

Their guarantees rely on the trust model (the set of assumptions about reality) accurately describing the conditions under which the system will operate. This means correctly modeling the network, the types of failures that can arise, as well the number of total possible failures. Unfortunately, existing trust models fail to capture realistic deployment conditions.

The project's novelties come from developing new trust models and protocols that explicitly recognise the true, uncertain nature of large scale distributed systems. The project's broader significance and importance are its ability to significantly improve the performance and robustness of consensus systems, and as such of all the systems that depend on them.

Production consensus implementations are deployed over networks that are heterogeneous between LAN and WAN, with blips, and subject to attack, misconfiguration or link failures. Replicas in these systems all have a probability of failure, and this failure rate evolves over time. Yet, engineers do not currently have a good way to precisely express these realistic setups as current abstractions are too coarse-grained.

They must either over-insure or under-insure, leading to poor performance and unnecessarily high replication factors. This project 1) revisits the network model by eschewing the idea that the network is necessarily fully synchronous/asynchronous, paying particular attention to how protocols recover from blips 2) revisits the failure model and introduces probability-native consensus protocols that view failure rates as dynamically evolving probability distributions.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

University of California-Berkeley

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

CAREER: Operationalising Fault-Tolerance in an Uncertain World.

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants