CUDA Guide — Interactive GPU Simulator

Pick one CUDA primitive at a time, then map the same hardware floorplan to CNN and Transformer workloads. The goal is to make data motion, scheduling, and kernel bottlenecks concrete.

What it’s doing

Pick a call from the list on the left.

What does each block mean?

Host↔GPU control path: CPU enqueues work then signals the GPU command queue over PCIe/NVLink control registers. Data path: DMA engines move payload bytes; host pages are pinned for overlap, and IOMMU mappings define which memory ranges the device may touch.



                            
Warps issued 0 32 threads marching in lockstep — the GPU's smallest execution unit
HBM bytes 0 real DRAM traffic, not allocation size
L2 hits 0 loads served from on-die cache
Stalls 0 cycles a warp could not issue
FLOPs 0 floating-point math ops done
warp = 32 GPU threads on one SM that execute the same instruction at the same time. “Warps issued” counts how many of these 32-thread groups the kernel has launched — it is the GPU’s heartbeat. More warps issued per second = more useful work done.
stream + DMA = CPU submits transfer/launch commands; driver maps pages for DMA; GPU copy engines move bytes over PCIe/NVLink while SMs run kernels. For P2P/RDMA, IOMMU policy can permit or block direct device-to-device paths.