Loading…
Loading grant details…
| Funder | National Science Foundation (US) |
|---|---|
| Recipient Organization | University of California-San Diego |
| Country | United States |
| Start Date | Sep 01, 2024 |
| End Date | Aug 31, 2026 |
| Duration | 729 days |
| Number of Grantees | 1 |
| Roles | Principal Investigator |
| Data Source | National Science Foundation (US) |
| Grant ID | 2438294 |
The NAIRR Pilot NVIDIA DGX Cloud offering allocates research groups sizeable single-tenant clusters of DGXs for several weeks, or even months. The allocations are restricted to a single research project per cluster at any given time. The configuration offers excellent interconnect performance within the cluster and provides a scalable solution to train production-ready models faster, decreasing the time-to-insights for science researchers.
It provides a unique cloud offering in that each allocated system is like a typical batch scheduled HPC system. NVIDIA provides initial support for setting up the cluster, backend maintenance of cloud resources, and with the security infrastructure encompassing it. There is a need for ongoing system monitoring, configuration changes, in-depth user support for porting and performance tuning that is provided on typical national HPC systems for research communities.
This project aims to support NAIRR Pilot researchers with on boarding activities, porting of workflows, user management, cluster management, and software installs (via containers); and explores profiling and performance tuning, and data movement strategies on the single-tenant compute clusters NVIDIA is providing for NAIRR Pilot projects. This project thus provides the type of support researchers expect from a national HPC system of this kind.
The goal of the project is to advance AI and scientific research at-scale by exploring system configuration, usage modalities, performance monitoring and tuning aspects on the cloud resources. The single tenant aspect allows for testing of configurations that may not be possible on a multi-tenant on-premises cluster with thousands of users. For example, some profiling tools may require settings that are typically not easily enabled on shared resources.
The NVIDIA DGX cloud cluster supports the use of the enroot tool that converts container/OS images into unprivileged sandboxes enabling researchers to easily develop their customized software environment. Once a container image is finalized, it is usable on both the cloud resource and on-premises clusters enabling performance comparisons with nearly identical software stacks.
The project explores data movement strategies for large datasets to/from various offsite locations with different data movement tools. This data movement work is required to support quick turnarounds for moving allocated projects on and off the DGX clusters with minimal downtimes between projects. The goal of the project is to develop usage guidelines, training and documentation for profiling and performance optimization, and optimal data movement strategies.
The NVIDIA DGX Cloud provides significant hardware and software options for NAIRR Pilot projects. The project’s work enables use of these resources by a wide range of NAIRR Pilot researchers and the development of usage guidelines, documentation for profiling/performance optimizations, and data movement strategies. All of these provide impact beyond the specific NVIDIA DGX cloud clusters and simplify the use of future cloud-based systems.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
University of California-San Diego
Complete our application form to express your interest and we'll guide you through the process.
Apply for This Grant