Active STANDARD GRANT National Science Foundation (US)

CNS Core: Small: Testing and detecting software upgrade failures in data-intensive distributed systems

$6M USD

Funder	National Science Foundation (US)
Recipient Organization	Purdue University
Country	United States
Start Date	Oct 01, 2023
End Date	Sep 30, 2026
Duration	1,095 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2300562`

Grant Description

In the current big data era, Internet services are often built on top of data-intensive distributed systems. Such distributed systems have to go through frequent software upgrade as vendors need to add new features, improve performance, and deploy patches. With the rise of continuous deployment in the industry, the frequency of distributed system software upgrade could reach thousands of deployments in a single day in a major Internet company.

Unfortunately, distributed systems could experience upgrade failures – failures happen during software upgrade. These failures often have large-scale impact as upgrade is performed on the entire system. They are typically mitigated in the production environment with canary deployment, which slowly rollout updates from a small scale to the entire cluster and downgrade if a failure is encountered.

However, canary deployment easily takes hours and creates a dilemma between safe and fast upgrade. In addition, many upgrade failures have persistent impact and cannot be easily resolved by downgrading. Despite the severe consequence of upgrade failures and challenges faced by production mitigation techniques, there are no existing testing and program analysis techniques that focus on testing and analyzing the distributed system upgrade procedure systematically.

This work proposes to develop such techniques optimized to detect upgrade failures in early stages through exploring the effectiveness of unique properties of the distributed system software upgrade procedure. Data-intensive distributed systems deployed in public or private clouds are nowadays a cornerstone of many critical computing systems. The proposed techniques should dramatically improve the reliability of data-intensive distributed systems during upgrade and, consequently, reduce service disruptions and improve the availability of cloud systems.

In addition, improved reliability of the upgrade procedure will lead to more timely feedbacks about new features in production, which is critical for developers’ productivity and the quality of the resulting software.

In this project we plan to (1) implement differential testing between two standard distributed system upgrade procedures – full-stop upgrade and rolling upgrade, (2) explore utilizing source code difference between versions to design differential test oracles, feedback metrics, and input mutation strategies, that are specially tuned to trigger and detect upgrade failures, (3) design static program analysis guided by source code difference to detect data format incompatibilities between versions, and (4) validate the testing and detection techniques proposed through direct experimentation on real-world data-intensive distributed systems. The proposed fault localization and static analysis techniques will reduce the valuable time and effort that developers spend on root cause diagnosis, which is extremely challenging for bugs in distributed systems. All products of the project will be open sourced to ensure a widespread impact.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

Purdue University

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

CNS Core: Small: Testing and detecting software upgrade failures in data-intensive distributed systems

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants