AIWiki
Malaysia

CUDA

NVIDIA's parallel computing platform and programming model that lets developers use GPUs for general-purpose computation, underpinning most modern deep learning frameworks.

4 min readLast updated May 2026Infrastructure

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that exposes general-purpose computation on graphics processing units. Released publicly in 2007 after several years of internal development, CUDA turned GPUs from fixed-function graphics accelerators into massively parallel processors usable for scientific computing, signal processing, finance, and — most consequentially — deep learning. Almost every mainstream deep learning framework runs on top of the CUDA stack, and the ecosystem of more than three hundred CUDA acceleration libraries and around six million registered developers is widely cited as NVIDIA's most durable competitive moat in the AI hardware market.

Programming model

CUDA extends C and C++ with a small set of keywords that distinguish code that runs on the host CPU from code that runs on the GPU device. Developers write kernels — functions executed in parallel by many threads — and launch them with a grid of thread blocks. Threads within a block share fast on-chip memory and can synchronise, while blocks execute independently and can be scheduled across the streaming multiprocessors of any compatible GPU. Modern CUDA also provides cooperative groups, unified memory that migrates pages between host and device on demand, and asynchronous graph capture for kernel pipelines.

Higher-level Python bindings such as Numba, CuPy, and PyCUDA make CUDA accessible without writing low-level kernels, while frameworks like PyTorch, TensorFlow, and JAX hide CUDA entirely behind familiar tensor APIs.

The CUDA ecosystem

A large portion of CUDA's value comes from optimised libraries that NVIDIA distributes alongside the toolkit. cuDNN provides hand-tuned implementations of convolutions, attention, and recurrent operators that every major deep learning framework calls into. cuBLAS and cuSPARSE accelerate dense and sparse linear algebra. NCCL handles multi-GPU collective communication. TensorRT compiles trained networks into highly optimised inference engines with quantisation and kernel fusion. Triton Inference Server packages those engines for production deployment. RAPIDS extends the model to data science, providing GPU-accelerated equivalents of pandas (cuDF), scikit-learn (cuML), and NetworkX (cuGraph).

Hardware support

CUDA is tied to NVIDIA GPUs and exposes successive generations through a versioned compute capability — recent generations include Pascal, Volta, Turing, Ampere, Hopper, and the Blackwell architecture announced in 2024. At NVIDIA GTC 2025 the company unveiled Rubin CPX, a GPU class purpose-built for massive-context inference workloads. Each generation has added specialised matrix-multiplication units — Tensor Cores — that accelerate the mixed-precision arithmetic at the core of transformer training and inference.

Alternatives

AMD's ROCm and the open SYCL standard target the same general-purpose GPU computing space without requiring NVIDIA hardware, and Intel's oneAPI provides a portable runtime spanning CPUs, GPUs, and accelerators. Apple GPUs use the proprietary Metal API. Translation layers such as ZLUDA and HIP allow some CUDA code to run on AMD hardware, but the breadth of the CUDA library ecosystem means that most production AI workloads in 2025 still target NVIDIA GPUs first.

References

  1. Nickolls, J. et al. (2008). Scalable Parallel Programming with CUDA. ACM Queue.
  2. NVIDIA Corporation (2025). CUDA Toolkit Documentation. docs.nvidia.com/cuda.
  3. Computer Weekly (2025). CUDA at 20: From billion-dollar gamble to agentic AI. computerweekly.com.
  4. NVIDIA GTC (2025). CUDA: New Features and Beyond, S72383. nvidia.com/on-demand.