GPU (Graphics Processing Unit): From Rendering Pixels to Training AI
A GPU is a massively parallel processor originally designed for graphics rendering that has become the dominant hardware platform for AI training and inference.
A GPU (Graphics Processing Unit) is a specialized processor designed for massively parallel computation. Where a CPU has a small number of powerful cores (typically 4–64) optimized for low-latency sequential tasks, a GPU has thousands of simpler cores designed to execute many operations simultaneously. A modern data center GPU like NVIDIA's H100 has over 18,000 cores. ## Architecture GPUs exploit SIMD (Single Instruction, Multiple Data) parallelism: one instruction dispatched across many execution units at once. Threads are grouped into "warps" (NVIDIA, 32 threads) or "wavefronts" (AMD, 64 threads) executing in lockstep. GPUs also feature vastly higher memory bandwidth than CPUs — the H100 delivers ~3.35 TB/s via HBM3 memory, compared to ~100 GB/s on a typical server CPU. ## History NVIDIA coined the term "GPU" in 1999 with the GeForce 256, the first chip to handle transform and lighting on-chip. Programmable shaders in the early 2000s opened the door to non-graphics computation. In 2006, NVIDIA launched CUDA (Compute Unified Device Architecture), enabling general-purpose GPU programming. The watershed moment came in 2012 when AlexNet — trained on two GTX 580 GPUs — won the ImageNet competition by a dramatic margin, signaling the deep learning era. ## Role in AI GPUs dominate deep learning because transformer architecture workloads and convolutional neural network training are fundamentally large matrix multiplication operations — exactly what GPUs parallelize. The H100 delivers ~1,979 TFLOPS at FP16 precision using tensor cores, dedicated matrix-multiply units introduced with NVIDIA's Volta architecture in 2017. Training a frontier large language model involves trillions of multiply-accumulate operations; a single GPU can sustain hundreds of teraFLOPS on this workload where a CPU would be orders of magnitude slower. ## Manufacturers **NVIDIA** dominates AI and data center computing (H100, A100, B200 Blackwell). The CUDA ecosystem — with deep integration into PyTorch, TensorFlow, and JAX — creates strong lock-in. **AMD** competes with RDNA (consumer) and CDNA (data center, MI300X) architectures, with ROCm as an open-source CUDA alternative. **Intel** offers Arc consumer GPUs and Gaudi accelerators for data center AI. ## Power and Scale An H100 SXM5 draws 700W at full load. A large AI training cluster of 100,000 GPUs consumes ~70 MW continuously. GPU power demand is the primary driver behind the explosive growth in AI data center electricity consumption. Training GPT-4-scale models reportedly consumed ~50 GWh of electricity.