Pick one CUDA primitive at a time, then map the same hardware floorplan to CNN and Transformer workloads. The goal is to make data motion, scheduling, and kernel bottlenecks concrete.
—
—
What it’s doing—
Pick a call from the list on the left.
What does each block mean?
Host↔GPU control path: CPU enqueues work then signals the GPU command queue over PCIe/NVLink control registers.
Data path: DMA engines move payload bytes; host pages are pinned for overlap, and IOMMU mappings define which memory ranges the device may touch.
Warps issued032 threads marching in lockstep — the GPU's smallest execution unit
HBM bytes0real DRAM traffic, not allocation size
L2 hits0loads served from on-die cache
Stalls0cycles a warp could not issue
FLOPs0floating-point math ops done
warp
= 32 GPU threads on one SM that execute the same instruction at the same time.
“Warps issued” counts how many of these 32-thread groups the kernel has launched
— it is the GPU’s heartbeat. More warps issued per second = more useful work done.
stream + DMA
= CPU submits transfer/launch commands; driver maps pages for DMA; GPU copy engines move bytes over PCIe/NVLink while SMs run kernels. For P2P/RDMA, IOMMU policy can permit or block direct device-to-device paths.