CUDA Guide — Interactive GPU Simulator

Pick one CUDA primitive at a time, then map the same hardware floorplan to CNN and Transformer workloads. The goal is to make data motion, scheduling, and kernel bottlenecks concrete.

—

—

What it’s doing —

Pick a call from the list on the left.

What does each block mean?

Host↔GPU control path: CPU enqueues work then signals the GPU command queue over PCIe/NVLink control registers. Data path: DMA engines move payload bytes; host pages are pinned for overlap, and IOMMU mappings define which memory ranges the device may touch.

Speed 1.0×

Warps issued 0 32 threads marching in lockstep — the GPU's smallest execution unit

HBM bytes 0 real DRAM traffic, not allocation size

L2 hits 0 loads served from on-die cache

Stalls 0 cycles a warp could not issue

FLOPs 0 floating-point math ops done

warp = 32 GPU threads on one SM that execute the same instruction at the same time. “Warps issued” counts how many of these 32-thread groups the kernel has launched — it is the GPU’s heartbeat. More warps issued per second = more useful work done.
stream + DMA = CPU submits transfer/launch commands; driver maps pages for DMA; GPU copy engines move bytes over PCIe/NVLink while SMs run kernels. For P2P/RDMA, IOMMU policy can permit or block direct device-to-device paths.

—