Ai Engineering 3 min read

How to Fuse PyTorch MLP Kernels for a 30% Inference Speedup

Learn how to analyze PyTorch profiler traces and implement Liger kernel fusion to significantly reduce memory bandwidth bottlenecks in transformer models.

Hugging Face’s new technical guide on profiling and fusing MLPs demonstrates how kernel fusion cuts up to 30% of the wall-clock time for Multilayer Perceptron forward passes. Feed-Forward Networks account for roughly two-thirds of total FLOPs in standard transformer architectures. You can apply these memory optimizations to your own models using standard torch.compile passes or the new expert-tuned Liger kernels from the Hugging Face Hub.

The Memory Bandwidth Bottleneck

Standard PyTorch nn.Linear layers rely on the addmm operation to fold bias addition into the General Matrix Multiplication (GEMM) epilogue. The addition happens seamlessly as the matrix product is written to memory.

The performance bottleneck occurs in the pointwise operations that follow. Executing a GeLU activation and a subsequent multiplication typically requires the GPU to write the intermediate state to High Bandwidth Memory (HBM) and immediately read it back. In overhead-bound regimes, this write-then-read cycle severely limits overall AI inference throughput.

Analyzing the Traces

You can track exactly which GPU resources execute your operations using the built-in torch.profiler. Reviewing PyTorch profiling traces often presents dense blocks of operations, but focusing on the CUTLASS (CUDA Template Library for Linear Algebra Subroutines) kernel naming conventions will clarify hardware utilization. CUTLASS names specify exactly how the matrix multiplication is tiled and executed across the streaming multiprocessors.

When optimizing your model parameters, your goal is to merge the independent execution blocks into a single fused kernel.

Fusing Kernels: Inductor vs Liger

You have two primary paths for fusing these pointwise operations to eliminate intermediate HBM traffic.

The first is using torch.compile. The Inductor backend automatically identifies the GeLU activation and multiplication steps and compiles them into a single optimized Triton kernel. This approach achieves the fastest absolute execution time but carries a strict recompilation penalty if your batch size or sequence length changes during execution.

The second path uses LigerGEGLUMLP layers, available via the Hugging Face Hub through the kernels library. These hand-written Triton kernels provide the memory benefits of fusion without the compile-time latency of Inductor.

Performance Comparison

The benchmark results for an MLP forward pass on an NVIDIA A100-SXM4-80GB GPU show the precise tradeoffs between the two approaches.

ImplementationExecution TimeDynamic Shape Support
Standard nn.LinearBaselineYes
torch.compile (Inductor)89.4 µsNo (Triggers recompilation)
LigerGEGLUMLP92.8 µsYes (Zero compile penalty)

Implementation Strategy

For static workloads where sequence lengths and batch sizes are guaranteed to remain constant, wrap your MLP modules in torch.compile. The Inductor compiler will generate the fastest possible Triton kernel for your specific hardware.

If your application processes dynamic inputs like streaming text generation or varied prompt sizes, replace your standard nn.Linear MLP blocks with LigerGEGLUMLP from the Hugging Face kernels library. You will sacrifice 3.4 µs per forward pass compared to the static compiled version, but you will completely bypass the recompilation stalls that otherwise degrade user-facing latency.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading