Loading…

Loading grant details…

Active STANDARD GRANT National Science Foundation (US)

EAGER: NAIRR Pilot: Enabling Large Scale Research Projects on the NVIDIA DGX Cloud Platform

$3M USD

Funder National Science Foundation (US)
Recipient Organization University of California-San Diego
Country United States
Start Date Sep 01, 2024
End Date Aug 31, 2026
Duration 729 days
Number of Grantees 1
Roles Principal Investigator
Data Source National Science Foundation (US)
Grant ID 2438294
Grant Description

The NAIRR Pilot NVIDIA DGX Cloud offering allocates research groups sizeable single-tenant clusters of DGXs for several weeks, or even months. The allocations are restricted to a single research project per cluster at any given time. The configuration offers excellent interconnect performance within the cluster and provides a scalable solution to train production-ready models faster, decreasing the time-to-insights for science researchers.

It provides a unique cloud offering in that each allocated system is like a typical batch scheduled HPC system. NVIDIA provides initial support for setting up the cluster, backend maintenance of cloud resources, and with the security infrastructure encompassing it. There is a need for ongoing system monitoring, configuration changes, in-depth user support for porting and performance tuning that is provided on typical national HPC systems for research communities.

This project aims to support NAIRR Pilot researchers with on boarding activities, porting of workflows, user management, cluster management, and software installs (via containers); and explores profiling and performance tuning, and data movement strategies on the single-tenant compute clusters NVIDIA is providing for NAIRR Pilot projects. This project thus provides the type of support researchers expect from a national HPC system of this kind.

The goal of the project is to advance AI and scientific research at-scale by exploring system configuration, usage modalities, performance monitoring and tuning aspects on the cloud resources. The single tenant aspect allows for testing of configurations that may not be possible on a multi-tenant on-premises cluster with thousands of users. For example, some profiling tools may require settings that are typically not easily enabled on shared resources.

The NVIDIA DGX cloud cluster supports the use of the enroot tool that converts container/OS images into unprivileged sandboxes enabling researchers to easily develop their customized software environment. Once a container image is finalized, it is usable on both the cloud resource and on-premises clusters enabling performance comparisons with nearly identical software stacks.

The project explores data movement strategies for large datasets to/from various offsite locations with different data movement tools. This data movement work is required to support quick turnarounds for moving allocated projects on and off the DGX clusters with minimal downtimes between projects. The goal of the project is to develop usage guidelines, training and documentation for profiling and performance optimization, and optimal data movement strategies.

The NVIDIA DGX Cloud provides significant hardware and software options for NAIRR Pilot projects. The project’s work enables use of these resources by a wide range of NAIRR Pilot researchers and the development of usage guidelines, documentation for profiling/performance optimizations, and data movement strategies. All of these provide impact beyond the specific NVIDIA DGX cloud clusters and simplify the use of future cloud-based systems.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

University of California-San Diego

Advertisement
Apply for grants with GrantFunds
Advertisement
Browse Grants on GrantFunds
Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant