Loading…

Loading grant details…

Active CONTINUING GRANT National Science Foundation (US)

CAREER: Modeling and Mitigating Error Propagation in High-Performance Computing Applications

$3.46M USD

Funder National Science Foundation (US)
Recipient Organization University of Iowa
Country United States
Start Date Jul 01, 2025
End Date Jun 30, 2030
Duration 1,825 days
Number of Grantees 1
Roles Principal Investigator
Data Source National Science Foundation (US)
Grant ID 2441136
Grant Description

In the world of high-performance computing (HPC), the growing complexity and shrinking size of hardware components make systems more vulnerable to "soft errors"— temporary glitches that can disrupt calculations. Traditionally, these issues were managed through hardware-based solutions like redundancy, but these approaches consume significant energy, a major concern for modern processors.

This project addresses the challenge of making HPC systems more resilient to soft errors without the high energy costs of traditional methods. It focuses on identifying and protecting the most vulnerable parts of a program — the specific states where errors are most likely to cause problems. By doing this efficiently, the project aims to ensure that programs can continue to function correctly even when errors occur.

The broader benefits of this project include advancing the field of reliable computing, promoting energy-efficient technologies, and supporting education by making cutting-edge resilience techniques accessible to software developers and classrooms. Ultimately, this work contributes to the creation of more robust and efficient computing systems that can handle the increasing demands of modern technology, benefiting industries, education, and society as a whole.

This project aims to address the increasing vulnerability of HPC systems to transient hardware faults, or soft errors, which are exacerbated by larger system scales, advanced technology scaling, and lower operating voltages. Traditional hardware-only solutions such as dual modular redundancy are becoming less viable due to their high energy consumption, making it essential for future HPC applications to tolerate such faults.

The project focuses on developing a compiler-directed framework that rapidly and accurately models error propagation, identifying and protecting only the most vulnerable program states to minimize performance and energy overheads. The project involves integrating static program analysis, dynamic input fuzzing, program invariants, redundancy, and compiler code transformations to create an efficient protection strategy.

By automating the process of hardening programs to meet specific reliability targets, the investigator aims to advance the field of reliable computing, reducing the barriers to implementing resilience techniques in HPC systems, and contributing to the development of energy-efficient, fault-tolerant software.

This project is jointly funded by the Software and Hardware Foundations Program, the Office of Advanced Cyberinfrastructure, and the Established Program to Stimulate Competitive Research (EPSCoR) Program.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

University of Iowa

Advertisement
Apply for grants with GrantFunds
Advertisement
Browse Grants on GrantFunds
Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant