How to Fuse PyTorch MLP Kernels for a 30% Inference Speedup
Learn how to analyze PyTorch profiler traces and implement Liger kernel fusion to significantly reduce memory bandwidth bottlenecks in transformer models.
Hugging Face’s new technical guide on profiling and fusing MLPs demonstrates how kernel fusion cuts up to 30% of the wall-clock time for Multilayer Perceptron forward passes. Feed-Forward Networks account for roughly two-thirds of total FLOPs in standard transformer architectures. You can apply these memory optimizations to your own models using standard torch.compile passes or the new expert-tuned Liger kernels from the Hugging Face Hub.
The Memory Bandwidth Bottleneck
Standard PyTorch nn.Linear layers rely on the addmm operation to fold bias addition into the General Matrix Multiplication (GEMM) epilogue. The addition happens seamlessly as the matrix product is written to memory.
The performance bottleneck occurs in the pointwise operations that follow. Executing a GeLU activation and a subsequent multiplication typically requires the GPU to write the intermediate state to High Bandwidth Memory (HBM) and immediately read it back. In overhead-bound regimes, this write-then-read cycle severely limits overall AI inference throughput.
Analyzing the Traces
You can track exactly which GPU resources execute your operations using the built-in torch.profiler. Reviewing PyTorch profiling traces often presents dense blocks of operations, but focusing on the CUTLASS (CUDA Template Library for Linear Algebra Subroutines) kernel naming conventions will clarify hardware utilization. CUTLASS names specify exactly how the matrix multiplication is tiled and executed across the streaming multiprocessors.
When optimizing your model parameters, your goal is to merge the independent execution blocks into a single fused kernel.
Fusing Kernels: Inductor vs Liger
You have two primary paths for fusing these pointwise operations to eliminate intermediate HBM traffic.
The first is using torch.compile. The Inductor backend automatically identifies the GeLU activation and multiplication steps and compiles them into a single optimized Triton kernel. This approach achieves the fastest absolute execution time but carries a strict recompilation penalty if your batch size or sequence length changes during execution.
The second path uses LigerGEGLUMLP layers, available via the Hugging Face Hub through the kernels library. These hand-written Triton kernels provide the memory benefits of fusion without the compile-time latency of Inductor.
Performance Comparison
The benchmark results for an MLP forward pass on an NVIDIA A100-SXM4-80GB GPU show the precise tradeoffs between the two approaches.
| Implementation | Execution Time | Dynamic Shape Support |
|---|---|---|
Standard nn.Linear | Baseline | Yes |
torch.compile (Inductor) | 89.4 µs | No (Triggers recompilation) |
LigerGEGLUMLP | 92.8 µs | Yes (Zero compile penalty) |
Implementation Strategy
For static workloads where sequence lengths and batch sizes are guaranteed to remain constant, wrap your MLP modules in torch.compile. The Inductor compiler will generate the fastest possible Triton kernel for your specific hardware.
If your application processes dynamic inputs like streaming text generation or varied prompt sizes, replace your standard nn.Linear MLP blocks with LigerGEGLUMLP from the Hugging Face kernels library. You will sacrifice 3.4 µs per forward pass compared to the static compiled version, but you will completely bypass the recompilation stalls that otherwise degrade user-facing latency.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
DeepInfra Brings $0.08/1M Inference to Hugging Face Hub
Developers can now route Hugging Face API requests directly to DeepInfra's serverless GPU infrastructure for high-performance model inference.
How to Find GPU Gaps in PyTorch 2.12 With torch.profiler
Learn how to identify performance bottlenecks and idle GPU lanes using the native torch.profiler in PyTorch 2.12 across Blackwell and AMD hardware.
How to Scale PyTorch Training With AWS Building Blocks
Learn how to configure AWS infrastructure and Hugging Face tools to optimize large-scale foundation model pre-training and inference workflows.
Safetensors Becomes the New PyTorch Model Standard
Hugging Face's Safetensors library joins the PyTorch Foundation to provide a secure, vendor-neutral alternative to vulnerable pickle-based model serialization.
How to Deploy NVIDIA Dynamo 1.0 for Production AI Inference Across GPU Clusters
Learn how to use NVIDIA Dynamo 1.0 to orchestrate scalable AI inference with KV routing, multimodal support, and Kubernetes scheduling.