How to Fuse PyTorch MLP Kernels for a 30% Inference Speedup

Hugging Face’s new technical guide on profiling and fusing MLPs demonstrates how kernel fusion cuts up to 30% of the wall-clock time for Multilayer Perceptron forward passes. Feed-Forward Networks account for roughly two-thirds of total FLOPs in standard transformer architectures. You can apply these memory optimizations to your own models using standard torch.compile passes or the new expert-tuned Liger kernels from the Hugging Face Hub.

The Memory Bandwidth Bottleneck

Standard PyTorch nn.Linear layers rely on the addmm operation to fold bias addition into the General Matrix Multiplication (GEMM) epilogue. The addition happens seamlessly as the matrix product is written to memory.

The performance bottleneck occurs in the pointwise operations that follow. Executing a GeLU activation and a subsequent multiplication typically requires the GPU to write the intermediate state to High Bandwidth Memory (HBM) and immediately read it back. In overhead-bound regimes, this write-then-read cycle severely limits overall AI inference throughput.

Analyzing the Traces

You can track exactly which GPU resources execute your operations using the built-in torch.profiler. Reviewing PyTorch profiling traces often presents dense blocks of operations, but focusing on the CUTLASS (CUDA Template Library for Linear Algebra Subroutines) kernel naming conventions will clarify hardware utilization. CUTLASS names specify exactly how the matrix multiplication is tiled and executed across the streaming multiprocessors.

When optimizing your model parameters, your goal is to merge the independent execution blocks into a single fused kernel.

Fusing Kernels: Inductor vs Liger

You have two primary paths for fusing these pointwise operations to eliminate intermediate HBM traffic.

The first is using torch.compile. The Inductor backend automatically identifies the GeLU activation and multiplication steps and compiles them into a single optimized Triton kernel. This approach achieves the fastest absolute execution time but carries a strict recompilation penalty if your batch size or sequence length changes during execution.

The second path uses LigerGEGLUMLP layers, available via the Hugging Face Hub through the kernels library. These hand-written Triton kernels provide the memory benefits of fusion without the compile-time latency of Inductor.

Performance Comparison

The benchmark results for an MLP forward pass on an NVIDIA A100-SXM4-80GB GPU show the precise tradeoffs between the two approaches.

Implementation	Execution Time	Dynamic Shape Support
Standard `nn.Linear`	Baseline	Yes
`torch.compile` (Inductor)	89.4 µs	No (Triggers recompilation)
`LigerGEGLUMLP`	92.8 µs	Yes (Zero compile penalty)

Implementation Strategy

For static workloads where sequence lengths and batch sizes are guaranteed to remain constant, wrap your MLP modules in torch.compile. The Inductor compiler will generate the fastest possible Triton kernel for your specific hardware.

If your application processes dynamic inputs like streaming text generation or varied prompt sizes, replace your standard nn.Linear MLP blocks with LigerGEGLUMLP from the Hugging Face kernels library. You will sacrifice 3.4 µs per forward pass compared to the static compiled version, but you will completely bypass the recompilation stalls that otherwise degrade user-facing latency.

How to Fuse PyTorch MLP Kernels for a 30% Inference Speedup

The Memory Bandwidth Bottleneck

Analyzing the Traces

Fusing Kernels: Inductor vs Liger

Performance Comparison

Implementation Strategy

Keep Reading

Google's Frozen v2 Chip Hardwires Gemini for 10x Efficiency

How to Find GPU Gaps in PyTorch 2.12 With torch.profiler

Hidden Caching Costs Make Sonnet 4.6 Cheaper Than GPT-4.1

Native-Speed vLLM Backend Ships for 450+ Transformers Models

Meta's Muse Image Transformer Sparks 15B-Image Opt-Out Backlash