How to Find GPU Gaps in PyTorch 2.12 With torch.profiler

Hugging Face’s new guide to torch.profiler details how to identify performance bottlenecks using the latest features in the PyTorch 2.12 ecosystem. The native tool captures execution traces across CPU and GPU architectures. Manual profiling remains necessary to find GPU gaps where CUDA lanes idle due to CPU-bound data loading or Python overhead, especially when you scale PyTorch training across multiple nodes.

With the widespread adoption of torch.compile and automated kernel optimizers like AutoKernel, manual profiling has become more complex. Fused kernels and graph breaks obscure the standard execution timeline. The updated profiler acts as the primary diagnostic utility to decode these automated optimizations and verify efficient hardware utilization.

Trace Scheduling and Asynchronous Capture

Capturing every step of a training loop skews performance results through synchronization overhead. PyTorch 2.9 and later versions support capturing traces asynchronously. This lets you profile specific segments without interrupting the program flow.

The profiler schedule function mitigates first-iteration spikes, a common profiling error where one-time initialization costs are mistaken for steady-state training overhead. The recommended baseline schedule configuration relies on four parameters.

Parameter	Recommended Value	Purpose
`wait`	1	Skips the initial step entirely to bypass initialization overhead.
`warmup`	2	Executes steps without recording to reach steady-state.
`active`	6	Actively records traces during these core iterations.
`repeat`	1	Runs the cycle a single time per profiling session.

Memory Snapshot API

PyTorch has deprecated the export_memory_timeline method within the profiler module. You must now use the Memory Snapshot API to track allocations. This requires calling the new torch.cuda.memory._record_memory_history method followed by _export_memory_snapshot.

The updated memory attribution system natively integrates with the pytorch.org/memory_viz viewer. This integration resolves previous issues where unknown memory categories were vaguely labeled as allocator-reserved. Developers can now trace specific tensor allocations back to their exact origin in the training loop.

Hardware Compatibility and Graph Capture

PyTorch 2.12 introduces the torch.accelerator Graph API. This interface unifies graph capture and replay across CUDA, XPU, and third-party accelerators. The profiler hooks directly into this unified API to visualize graph-based workloads accurately.

This release ensures stable profiling for NVIDIA Blackwell architectures (H100, H200, B200) running CUDA 13.2. It also supports AMD MI350X hardware via ROCm 7.2.3, which is critical when you fine-tune Qwen3 on AMD MI300X environments. The latest updates specifically target vLLM AI inference workloads. They reduce idle gaps between GPU kernels by merging gather operations during embedding-heavy tasks.

Configuring Trace Specifics

The profiler offers specific configuration flags to capture deeper execution details. Enabling these features adds performance overhead, so they should only be activated during targeted diagnostic runs.

record_shapes: Saves operator input shapes. This parameter is critical for identifying dynamic shape recompilation issues within torch.compile.
with_stack: Records file and line numbers for operations. Recent PyTorch optimizations reduce the overhead of this flag during large-scale execution traces.

For full implementation syntax and additional parameter configurations, consult the profiler documentation.

Begin your optimization workflow by isolating a single batch step using the baseline schedule. This immediately highlights data loading bottlenecks before you invest time optimizing individual CUDA kernels.

How to Find GPU Gaps in PyTorch 2.12 With torch.profiler

Trace Scheduling and Asynchronous Capture

Memory Snapshot API

Hardware Compatibility and Graph Capture

Configuring Trace Specifics

Keep Reading

$1B Nebius Agreement Secures GB300 Chips for Reflection AI

How to Fuse PyTorch MLP Kernels for a 30% Inference Speedup

How to Profile PyTorch Attention Kernels on A100 GPUs

Modular 3nm MTIA v3 Chips Enter Production for Meta Inference

Zero-Python LLMD Engine Compiles Native AI Inference Binaries