Completed STANDARD GRANT National Science Foundation (US)

Collaborative Research: FMitF: Track I: Composable Verification of Crash-Safe Distributed Systems with Grove

$5M USD

Funder	National Science Foundation (US)
Recipient Organization	Massachusetts Institute of Technology
Country	United States
Start Date	Oct 01, 2021
End Date	Sep 30, 2025
Duration	1,460 days
Number of Grantees	2
Roles	Co-Principal Investigator; Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2123864`

Grant Description

Distributed systems play a crucial role in computer systems infrastructure. Nevertheless, developing reliable distributed systems is challenging due to the need to contend with concurrency across machines, concurrency within each machine, unreliable networks that can delay or drop messages, and partial failures if one or more machines crash and reboot while others continue running.

As a result, distributed systems are error-prone and subtle bugs can lead to significant outages. Traditional testing approaches are insufficient to eliminate all such bugs. This project's novelty is a new approach to formal verification of distributed systems that allows verifying components in a modular fashion.

It allows for verification of distributed systems in the presence of crashes. This project's impact is intended to include improving the reliability and correctness of distributed systems and avoid costly outages. In addition, new lab assignments for systems-verification classes are being developed, focused on distributed systems.

The technical approach addresses two specific challenges: reasoning about crash recovery in distributed systems, as well as composing distributed systems from smaller components. Crash recovery is challenging because individual nodes can crash and reboot. Once a node starts running again, it might no longer be consistent with the rest of the system that did not crash.

This means the node may have lost all of its memory contents on crash but may have kept some state durably on disk. The second challenge lies in composing specifications and proofs of distributed systems (such as a key-value store) that are built out of smaller components (such as a configuration service, a lock service, or the implementation of an individual node).

Scaling verification of distributed systems requires the proof to reflect this modularity. For example, reasoning about an application that uses a lock service should not require reasoning about the network messages sent by the lock service itself. It should be done purely using the specifications for the lock service client stubs.

This project tackles these challenges using concurrent separation logic, which provides a natural approach for composing proofs about multiple components, as well as abstracting away implementation details with a pre/post-condition specification. This project extends earlier work with techniques for distributed system reasoning, including new kinds of per-node invariants (which might need to be repaired on crash) as opposed to global invariants (which must hold even if some nodes have crashed).

In addition, the project provides techniques for reasoning about exactly-once semantics of Remote Procedure Calls (RPC) on top of unreliable computer networks and locks that span multiple machines.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Massachusetts Institute of Technology

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Collaborative Research: FMitF: Track I: Composable Verification of Crash-Safe Distributed Systems with Grove

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants