How to Find GPU Gaps in PyTorch 2.12 With torch.profiler
Learn how to identify performance bottlenecks and idle GPU lanes using the native torch.profiler in PyTorch 2.12 across Blackwell and AMD hardware.
Hugging Face’s new guide to torch.profiler details how to identify performance bottlenecks using the latest features in the PyTorch 2.12 ecosystem. The native tool captures execution traces across CPU and GPU architectures. Manual profiling remains necessary to find GPU gaps where CUDA lanes idle due to CPU-bound data loading or Python overhead, especially when you scale PyTorch training across multiple nodes.
With the widespread adoption of torch.compile and automated kernel optimizers like AutoKernel, manual profiling has become more complex. Fused kernels and graph breaks obscure the standard execution timeline. The updated profiler acts as the primary diagnostic utility to decode these automated optimizations and verify efficient hardware utilization.
Trace Scheduling and Asynchronous Capture
Capturing every step of a training loop skews performance results through synchronization overhead. PyTorch 2.9 and later versions support capturing traces asynchronously. This lets you profile specific segments without interrupting the program flow.
The profiler schedule function mitigates first-iteration spikes, a common profiling error where one-time initialization costs are mistaken for steady-state training overhead. The recommended baseline schedule configuration relies on four parameters.
| Parameter | Recommended Value | Purpose |
|---|---|---|
wait | 1 | Skips the initial step entirely to bypass initialization overhead. |
warmup | 2 | Executes steps without recording to reach steady-state. |
active | 6 | Actively records traces during these core iterations. |
repeat | 1 | Runs the cycle a single time per profiling session. |
Memory Snapshot API
PyTorch has deprecated the export_memory_timeline method within the profiler module. You must now use the Memory Snapshot API to track allocations. This requires calling the new torch.cuda.memory._record_memory_history method followed by _export_memory_snapshot.
The updated memory attribution system natively integrates with the pytorch.org/memory_viz viewer. This integration resolves previous issues where unknown memory categories were vaguely labeled as allocator-reserved. Developers can now trace specific tensor allocations back to their exact origin in the training loop.
Hardware Compatibility and Graph Capture
PyTorch 2.12 introduces the torch.accelerator Graph API. This interface unifies graph capture and replay across CUDA, XPU, and third-party accelerators. The profiler hooks directly into this unified API to visualize graph-based workloads accurately.
This release ensures stable profiling for NVIDIA Blackwell architectures (H100, H200, B200) running CUDA 13.2. It also supports AMD MI350X hardware via ROCm 7.2.3, which is critical when you fine-tune Qwen3 on AMD MI300X environments. The latest updates specifically target vLLM AI inference workloads. They reduce idle gaps between GPU kernels by merging gather operations during embedding-heavy tasks.
Configuring Trace Specifics
The profiler offers specific configuration flags to capture deeper execution details. Enabling these features adds performance overhead, so they should only be activated during targeted diagnostic runs.
- record_shapes: Saves operator input shapes. This parameter is critical for identifying dynamic shape recompilation issues within
torch.compile. - with_stack: Records file and line numbers for operations. Recent PyTorch optimizations reduce the overhead of this flag during large-scale execution traces.
For full implementation syntax and additional parameter configurations, consult the profiler documentation.
Begin your optimization workflow by isolating a single batch step using the baseline schedule. This immediately highlights data loading bottlenecks before you invest time optimizing individual CUDA kernels.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Async CUDA Streams Eliminate 25% GPU Wait in Transformers
Hugging Face implemented asynchronous continuous batching in the transformers library, using CUDA streams to recover 25% of runtime lost to CPU idle gaps.
How to Scale PyTorch Training With AWS Building Blocks
Learn how to configure AWS infrastructure and Hugging Face tools to optimize large-scale foundation model pre-training and inference workflows.
How to Fine-Tune Qwen3 on AMD MI300X Using ROCm
Learn how to configure ROCm 6.1 environment variables and use the Hugging Face stack to fine-tune Qwen3-1.7B on AMD hardware without CUDA.
TPU v5p Inference Speeds Triple With DFlash Block-Diffusion
Google and UCSD researchers released DFlash, a block-diffusion speculative decoding method that achieves a 3.13x average inference speedup on TPU v5p hardware.
$40 Billion Anthropic Deal Trades Equity for 1M Google TPUs
Anthropic will receive $10 billion in upfront cash and up to 1 million Ironwood TPUs in a $40 billion infrastructure agreement with Google.