GPU maker Nvidia will soon release the next version of the CUDA parallel programming frameworkversion 12, to accompany the release of its new GPU architecture codenamed Hopper.
“This is the biggest release we’ve ever had,” said stephen jonesCUDA architect at Nvidia, during a breakout session at Nvidia’s GPU technology conference held virtually earlier this month.
CUDA started as a simple programming language in June 2007 targeting graphics, and is currently at version 11.7, with a major update, version 11.8, planned before the move to version 12.
Jones didn’t provide an exact ship date for CUDA 12, but the Past Releases timeline indicates a 12 release available for download late this year or early next year.
Nvidia usually releases a new version of CUDA with each new GPU architecture. This is the first time in two years that CUDA users will experience a major version change.
GPUs were initially popular for graphics, but the chips’ ability to compute in parallel planted the seed for Nvidia’s hardware to be used in non-graphics applications. Today, Nvidia GPUs dominate the market as accelerators for AI, simulation, graphics, and supercomputing. But the proprietary CUDA parallel programming model only works best on Nvidia’s GPUs, forcing customers to buy the company’s hardware.
Nvidia is now trying to shift gears to expand into the software sector by selling AI software applications developed in CUDA. The company sees a $1 trillion market opportunity in software, with CUDA-based applications for self-driving cars, robots, medical devices and other AI systems.
A typical CUDA program has a GPU code section, which includes code for running on the graphics cores, and a CPU code section, which sets up the runtime environment which includes memory allocation and system management. material. CUDA also has a runtime system that includes libraries and a compiler that compiles code into an executable.
CUDA binaries have CPU and GPU sections, and a separate PTX assembly code section, which acts as a backwards and to some extent backwards compatibility layer for all versions of CUDA dating back to the first edition in 2007 .
But CUDA 12 apps will break on CUDA 11. Starting with CUDA 11, Nvidia has included a compatibility layer so APIs won’t break in online builds, for example, a CUDA 11.5 based app will work with CUDA 11.1. But this compatibility layer does not apply to completely new versions.
“You can’t run CUDA 12 apps, for example, on a system with 11.2 installed because the API signals may have changed in a major release,” Jones said, adding, “That means two things. first of all, you need to care about which major version of CUDA is running on your [system]and second, some APIs and data structures will change.
CUDA 12 is specifically tuned for the new GPU architecture called Hopper, which replaces the two-year-old architecture called Ampere, supported by CUDA 11. The Hopper-based flagship GPU, called H100, has been measured up to five times more faster than the previous generation A100-branded Ampere flagship GPU. Speed improvements in Hopper come from a slew of new features like better throughput and interconnect technologies, faster tensor cores for AI, and vector and floating point operations.
Hopper features 132 streaming multiprocessors, PCIe Gen5 support, HBM3 memory, 50MB L3 cache, and the new NVLink interconnect with 900GB/s bandwidth.
If you want the best performance from Hopper, you’ll only get it from CUDA 12. Nvidia keeps its hardware and software close to its chest, and if you’re using Khronos’ OpenCL, AMD’s ROCm, and other parallel programming, you won’t be able to harness the full power of Hopper.
The GPU Hopper H100 focuses on keeping data local and reducing the time it takes to execute code. The GPU has 132 streaming multiprocessor (SM) units in the H100, up from 15 in Kepler a decade ago. SM scaling is at the heart of CUDA 12, Jones said.
The CUDA programming model, at its core, asks users to divide work – such as processing an image – into blocks, which are arranged next to each other in a grid. Each block runs on a GPU as if it were a separate program, and Hopper can run several thousand blocks at a time. Each block, working on its own problem, is then broken down into threads.
Nvidia has broken down this grid-block-thread hierarchy even further with a new layer called “thread block cluster”. The “wire block cluster” essentially breaks down the old structure and weaves together interconnected mini-networks at the block level, all of which add to the larger network. Because of its massive scale, “we embraced the concept of a grid composition made up of completely independent work blocks,” Jones said.
SMs have been organized in this hierarchy of clusters of blocks of threads, which exchange data simultaneously in a synchronized manner. The 16 blocks run nearly 16,384 threads concurrently, which is a huge amount of concurrency, Jones said, adding that each block in a cluster can read and write the shared memory of all other blocks in the cluster.
“What we’ve created is a way to target a localized subset of your grid to a localized set of runtime resources that opens up more programmability and performance opportunities,” Jones said.
The thread block cluster feature in the programming model has a new syntax that allows developers to set the launch size and resources it needs for a task instead of relying on the CPU to do it right.
Another new Hopper feature is an asynchronous transaction barrier that reduces data back and forth for faster code execution. The asynchronous transaction barrier is more like a bedroom where waiting threads sleep until data from other threads arrives to complete a transaction. This reduces the energy, effort, and bandwidth required to move data.
“You just say ‘Wake me up when the data has arrived. I can have my thread on hold…waiting for data from lots of different places and only waking up when everything has happened,” Jones said.
In chips, work is usually divided into threads, which must coordinate with each other. With normal barriers, threads generally track where the data comes from and what source it syncs to, but that’s not the case in Hopper, which is just a one-time write operation.
“The asynchronous memory copy knows how many bytes it carries. The barrier knows how much it expects. When the data arrives, it is counted by itself. They are one-sided memory copies and they are seven times faster. [communication] because they just go one way and don’t have to go back and forth,” Jones said.
Hopper also has a new processing unit called Tensor Memory Accelerator, which the company has classified as a data movement engine. The engine allows bi-directional movement of large blocks of data between the global and shared memory hierarchy. TMA also supports asynchronous memory copying between blocks of threads in a cluster.
“You call [TMA] and it goes off to do the copying, which means the hardware takes care of calculating addresses and strides, checking limits, all that kind of stuff. It can cut out a section of data…and just drop it into shared memory or put it back the other way around. You don’t have to write a single line of code,” Jones said.
Hopper has new DPX instructions for something Nvidia calls “dynamic programming,” where one can efficiently find the solution to a larger problem by recursively solving overlapping subproblems. This could make CUDA 12 relevant for applications involved in computation that follows traces to optimize or solve problems, such as mapping or robotic path tracing.
“It’s very similar to a divide and conquer approach…. except it’s the overlapping data that’s harder to resolve,” Jones said.
Nvidia has also improved the concept of dynamic parallelism, which allows the GPU to launch a new kernel directly without using the CPU. “By adding special mechanisms to the dynamic parallel programming model, we were able to speed up launch performance by a factor of three,” Jones said.
An Nvidia moderator didn’t say whether dynamic parallelism would move to the OpenMP or OpenACC standards, saying “whether it makes it into the standards as an explicit language feature is up to the committees.”
Nvidia is actively trying to upstream certain features of the CUDA toolkit as part of standard C++ releases. CUDA has its own compiler called NVCC, which is designed for GPUs, and Runtime API with a simple C++-like interface. GPUs typically have computing elements like vector processing that are more suitable for applications like AI, and the runtime is built on top of a driver API.