Loading…

Loading grant details…

Completed STANDARD GRANT National Science Foundation (US)

Collaborative Research: CNS Core: Small: A new framework for building fail-slow fault-tolerant distributed systems

$2.5M USD

Funder National Science Foundation (US)
Recipient Organization Suny At Stony Brook
Country United States
Start Date Oct 01, 2021
End Date Sep 30, 2025
Duration 1,460 days
Number of Grantees 1
Roles Principal Investigator
Data Source National Science Foundation (US)
Grant ID 2130590
Grant Description

This project targets a long-lasting and an increasingly pervasive challenge of distributed system design and implementation—fail-slow fault tolerance. Most existing fault-tolerant distributed systems are developed and tested to tolerate faults where a node has completely stopped, but they often do not perform well with the “fail-slow” faults, where a faulty node has not crashed but is operating at a degraded speed far below the standard performance.

Fail-slow faults can happen for various reasons including hardware (e.g., an overheated chip), software (e.g., the process uses up all the memory), network (e.g., a loose cable), and human errors (e.g., the administrator launches too many processes on the same node). In many current fault-tolerant distributed systems, the fail-slow nodes can damage the entire system performance by holding up the healthy nodes in their execution.

For example, a healthy node may keep buffering outbound messages to the slow nodes until it uses up its memory and crash. Improving fail-slow fault-tolerance is an important issue as fail-slow faults have been reported to be common in large-scale distributed systems deployed in modern data centers. The performance issues they cause are more hidden and hard to debug.

To help improve this situation, this work will develop a set of novel, transformative technologies, including distributed-system programming support, design patterns, and runtime verification techniques, that will be encapsulated in a unified programming framework and will dramatically improve the performance and fault-tolerance of modern distributed systems.

This research may have a major impact on industry and society, since distributed systems are the cornerstones of modern computing infrastructures such as cloud computing, cluster and datacenter technologies, and high performance computing. In particular, this work will be done in collaboration with widely used distributed databases, specifically MongoDB and TiDB.

The PIs envision this effort as a catalyst for multidisciplinary research and education on distributed systems technologies at Stony Brook University and the University of Illinois. The PIs will use this work as a core that they hope will eventually grow to agglutinate other faculty of diverse expertise with interests in cloud computing, distributed systems, and software engineering technologies.

Both universities are experiencing an unprecedented surge of students in Computer Science. The PIs are working with the department to broaden the course offerings with multidisciplinary courses in the general area of cloud computing, distributed systems, reliable systems, and software engineering. The PIs will incorporate the topics in this proposal in the courses they are teaching.

The PIs have a long-standing commitment to undergraduate education and research, and to broaden participation to under-represented minorities. They will use this work to involve undergraduates and under-represented students in their research groups.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Suny At Stony Brook

Advertisement
Discover thousands of grant opportunities
Advertisement
Browse Grants on GrantFunds
Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant