Loading…
Loading grant details…
| Funder | National Science Foundation (US) |
|---|---|
| Recipient Organization | University of Iowa |
| Country | United States |
| Start Date | Jul 01, 2025 |
| End Date | Jun 30, 2030 |
| Duration | 1,825 days |
| Number of Grantees | 1 |
| Roles | Principal Investigator |
| Data Source | National Science Foundation (US) |
| Grant ID | 2441136 |
In the world of high-performance computing (HPC), the growing complexity and shrinking size of hardware components make systems more vulnerable to "soft errors"— temporary glitches that can disrupt calculations. Traditionally, these issues were managed through hardware-based solutions like redundancy, but these approaches consume significant energy, a major concern for modern processors.
This project addresses the challenge of making HPC systems more resilient to soft errors without the high energy costs of traditional methods. It focuses on identifying and protecting the most vulnerable parts of a program — the specific states where errors are most likely to cause problems. By doing this efficiently, the project aims to ensure that programs can continue to function correctly even when errors occur.
The broader benefits of this project include advancing the field of reliable computing, promoting energy-efficient technologies, and supporting education by making cutting-edge resilience techniques accessible to software developers and classrooms. Ultimately, this work contributes to the creation of more robust and efficient computing systems that can handle the increasing demands of modern technology, benefiting industries, education, and society as a whole.
This project aims to address the increasing vulnerability of HPC systems to transient hardware faults, or soft errors, which are exacerbated by larger system scales, advanced technology scaling, and lower operating voltages. Traditional hardware-only solutions such as dual modular redundancy are becoming less viable due to their high energy consumption, making it essential for future HPC applications to tolerate such faults.
The project focuses on developing a compiler-directed framework that rapidly and accurately models error propagation, identifying and protecting only the most vulnerable program states to minimize performance and energy overheads. The project involves integrating static program analysis, dynamic input fuzzing, program invariants, redundancy, and compiler code transformations to create an efficient protection strategy.
By automating the process of hardening programs to meet specific reliability targets, the investigator aims to advance the field of reliable computing, reducing the barriers to implementing resilience techniques in HPC systems, and contributing to the development of energy-efficient, fault-tolerant software.
This project is jointly funded by the Software and Hardware Foundations Program, the Office of Advanced Cyberinfrastructure, and the Established Program to Stimulate Competitive Research (EPSCoR) Program.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
University of Iowa
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant