Memory Wall

Coined by Wulf & McKee's 1994 paper 'Hitting the memory wall,' the memory wall is the widening gap between CPU speed growth and memory speed growth. Modern systems spend most AI compute energy shuttling data between memory and compute, not doing math. HBM3/HBM4 helps but doesn't solve it. Photonic and other alternative compute architectures typically move the bottleneck rather than removing it.

The **memory wall** is the growing performance gap between processor speed and memory subsystem speed. Modern systems — especially AI workloads — spend the majority of their compute energy moving data between memory and compute units, not performing arithmetic. ## Origin Coined in the **1994 paper 'Hitting the memory wall: implications of the obvious'** by William A. Wulf and Sally A. McKee (University of Virginia). The paper observed that CPU speeds were improving ~60% per year while DRAM access latency was improving only ~7% per year. Extrapolating, they argued, the disparity would dominate system performance within a decade. It did. ## The gap today - **CPU clock**: has effectively plateaued since ~2005 due to thermal/power limits, but IPC and core counts continue rising. - **Compute throughput**: in AI accelerators, has grown roughly 5-10× per generation (A100 → H100 → B100 → ...). - **DRAM access latency**: has stayed roughly flat — typical DDR4/DDR5 access is ~60-100 ns. - **Memory bandwidth**: grows through wider buses + technologies like HBM (High Bandwidth Memory) and HBM3/HBM4, but per-bit energy is still dominated by off-chip moves. ## Energy economics Key data point: moving a 64-bit word from off-chip DRAM to a CPU takes roughly **200 pJ**. Performing the actual computation on that data takes ~1-10 pJ. The **compute-to-move energy ratio is ~1:20 to 1:200** depending on workload. This means modern AI accelerators spend >90% of their energy on data movement, not math. ## Mitigations ### Cache hierarchies - L1 / L2 / L3 SRAM caches exploit temporal + spatial locality. - Effective for workloads with reuse; useless for streaming / random-access. ### Tiling and blocking - Matrix operations are tiled so sub-tiles fit in cache; most of the arithmetic happens without fresh memory fetches. - BLAS libraries, cuBLAS, OneDNN all heavily tuned for tile sizes matching cache hierarchy. ### High Bandwidth Memory (HBM) - Stacked DRAM chips with wide interface directly next to the compute die via interposer. Orders of magnitude more bandwidth than DDR. - Expensive and thermally constrained, but now standard on GPUs and TPUs. ### Near-memory / processing-in-memory (PIM) - Computation units embedded inside memory chips. Reduces the move distance. - Samsung HBM-PIM, UPMEM, Mythic AI products are examples. ### On-chip SRAM scratchpads - Explicit software-managed memory (not a cache). Groq TSPs and some TPU variants use this. ## Memory wall in photonic computing Q.ANT Photonic AI Processor (NPU 2, 2026) compute with light but their inputs and outputs live in conventional electrical memory. Every read is an electrical→laser conversion; every write is laser→electrical. These conversions dominate latency and power once photonic compute is fast enough. The research frontier for removing this constraint: **Optical SRAM and the Photonic Latch** — on-chip memory that stays in the optical domain. Still experimental. ## Broader pattern Every compute acceleration — vector machines in the 1980s, GPUs in the 2000s, AI accelerators today, photonic chips tomorrow — hits the same wall from a different angle: making compute faster shifts the bottleneck to data movement. The principle generalizes: **systems-level performance is about the ratio of compute cost to data-movement cost, not either in isolation.** ## Architectural implications - Workloads with high arithmetic intensity (roofline model speak — flops per byte of memory access) benefit most from compute acceleration. - Workloads with low arithmetic intensity are memory-bound and see minimal benefit. - Optimization effort for modern accelerators is largely data-layout + kernel fusion + attention-pattern tiling, not raw arithmetic tricks. The memory wall is arguably the **single most important structural constraint** on modern computing, and it is underpublicized relative to its impact.

Memory Wall

Have insights to add?