Completed STANDARD GRANT National Science Foundation (US)

Collaborative Research: OAC Core: Enabling Extremely Fine-grained Parallelism on Modern Many-core Architectures

$1.66M USD

Funder	National Science Foundation (US)
Recipient Organization	University of Chicago
Country	United States
Start Date	Jul 01, 2021
End Date	Jun 30, 2025
Duration	1,460 days
Number of Grantees	1
Roles	Principal Investigator
Data Source	National Science Foundation (US)
Grant ID	`2107283`

Grant Description

Computer systems are becoming increasingly complex: multisocket systems with many-core processors and general graphic processors have the potential to address the needs of demanding applications at the node level. Programmability and efficiency are often not easy to find together due to the hardware growing several orders of magnitude in degree of parallelism to thousands of computing units on a chip.

Task parallelism is an important type of parallelism in which computation is broken down into a set of inter-dependent tasks which can be executed concurrently on various computing units. To achieve strong scaling and high levels of effective parallelism, there is a growing need in today's parallel languages with supporting over-decomposition (many more tasks than cores) in order to improve performance, hide latency caused by blocking operations, and otherwise achieve maximum speedup.

By enabling the efficient support of fine-grained parallelism across the growing range of scales seen in modern and future hardware, it is expected that the productivity of parallel programmers will be enhanced. Trends show evidence that most of the Top500 high-performance computing systems will likely employ hardware that this work directly targets.

The project aims to conduct a high-impact education program in distributed parallel programming with broad reach, encouraging student internships grounded in real-world challenges, and paving the way for technology transfer from research to open-source projects. Special emphasis is placed on engaging women and underrepresented minorities. This education facet will create a new and more accessible foundation for fluency in parallel computing for scientists and engineers.

This work explores novel data-structures and algorithms that allow for scalable runtime and execution models for fine-grained parallelism at sub-microsecond timescales. Preliminary work by the PIs at the language and runtime levels suggests a path to achieving this. The project objectives are: 1) unifying runtime enabling task granularities measured in cycles: design, analysis, and implementation of building blocks for efficient fine-grained computing on diverse node hardware; 2) evaluating performance of these building blocks in the context of real parallel systems and application kernels on a range of computer architectures; 3) measuring performance and scalability impact of runtime on benchmark kernels and real applications; and 4) integrating this research with education programs from undergraduate to graduate levels through new course material on parallel computing.

This high-risk/high-reward research is geared towards yielding transformative improvements in the ease and efficiency of programming parallel machines at every scale. The contributions lie in the realization of productive, implicitly parallel high-level languages optimized for single node deployments with many-core architectures to support fine-grained parallelism measured in cycles, enabling an entirely new class of many-task computing applications.

The dataflow architecture makes implicit parallelism tractable with a programming model whose impact could rival that of MATLAB, R, and Python, with the added benefit that the same code could also run in a distributed system or large-scale HPC systems. Thus, the scientist would be able to write a program once, run it at any suitable scale, and have it seamlessly use the most appropriate granularity for each component of the hardware.

This work’s innovations in dataflow architecture will be broadly applicable to a number of existing parallel programming systems such as OpenMP, Swift/Parsl, and CUDA/OpenCL, in terms of both efficiency in executing fine grained parallelism and adding support for implicit parallelism where possible. Target hardware includes Intel/AMD x86, ThunderX/2 ARM, IBM Power9, and NVIDIA/AMD GPUs.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

University of Chicago

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Collaborative Research: OAC Core: Enabling Extremely Fine-grained Parallelism on Modern Many-core Architectures

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants