Transformers.js v4, Now Inside Chrome Extensions
Hugging Face has published an integration guide for running Transformers.js v4 and the 500MB Gemma 4 E2B model locally inside Manifest V3 Chrome extensions.
On April 23, 2026, Hugging Face published a technical guide on integrating Transformers.js v4 within Manifest V3 Chrome Extensions. The release includes a live demo of a Gemma 4 E2B browser assistant that handles web navigation and data processing entirely on-device. For developers building browser-based tools, this architecture shifts inference workloads directly to the client.
WebGPU Performance Gains
Version 4 of Transformers.js introduces a complete C++ rewrite of the WebGPU runtime. The updated engine supports over 200 architectures and nearly 3,000 models from the Hugging Face Hub. Fused kernels drop build times from 2 seconds to 200 milliseconds. The hardware acceleration yields up to 100x faster performance compared to earlier WebAssembly backends. This enables models like the 1.2 billion parameter LFM 2.5 to achieve high token speeds natively in the browser.
Manifest V3 Architecture
Chrome’s Manifest V3 enforces strict Content Security Policies that block dynamic remote code execution. Hugging Face outlines a multi-process architecture to run AI inference while maintaining compliance. A background service worker initializes the engine and hosts the model, preventing unloads during UI state changes.
A dedicated side panel HTML interface communicates with this worker via message passing to keep the main thread responsive. The extension relies on chrome.runtime.sendMessage to route prompt requests from the user interface to the background process hosting the ONNX runtime.
Asset Bundling and Caching
Environments lacking direct WebGPU access in the service worker utilize offscreen documents as a fallback. Developers must bundle ONNX and WASM helper files locally inside the extension’s distribution folder to satisfy CSP requirements.
The recommended Gemma 4 E2B model requires approximately 500MB of storage. It downloads once and persists in the browser’s IndexedDB cache. Support for 4-bit and 8-bit quantization reduces VRAM pressure on integrated GPUs. The standalone tokenizer library has also been optimized down to an 8.3 KB zero-dependency package, minimizing the initial extension install size.
Privacy and Deployment Risks
Running models entirely on the client side ensures sensitive user queries never leave the device. This local execution model provides an alternative to API-dependent multi-agent systems that require constant cloud connectivity.
Storing model weights in the client-side cache introduces extraction risks. Anyone with access to the local machine can theoretically extract the cached ONNX files. Community testing confirms these local models function on iOS and Android mobile browsers when experimental WebGPU flags are enabled, broadening the potential deployment surface beyond desktop environments.
If you build browser extensions, evaluate your inference requirements against client-side hardware capabilities. Offloading natural language tasks to a local WebGPU implementation eliminates server costs, but requires careful management of initial model download sizes and IndexedDB storage limits.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Serve DiffusionGemma Locally With vLLM
Learn how to deploy Google's 26B text diffusion model on local hardware to achieve massive parallel generation speeds using vLLM and Hugging Face.
How to Route GPU GitHub Actions to Hugging Face Jobs
Offload your training and GPU-heavy CI workloads to Hugging Face Jobs using their new ephemeral GitHub runners and action integrations.
Cascaded Speech Pipeline Brings Reachy Mini Inference Local
Hugging Face released an offline conversational stack for the Reachy Mini robot that replaces cloud APIs with a local pipeline built on Gemma 4 and Qwen3-TTS.
How to Cut Checkpoint Time by 85% With TRL Delta Weight Sync
Learn how to configure TRL Delta Weight Sync to reduce trillion-parameter model checkpointing times by 85 percent using Hugging Face Hub Buckets.
8K Context Reranking Hits Hugging Face With Ettin Cross-Encoders
Hugging Face released six open-source cross-encoders under the Ettin Reranker family with an 8,192-token context window for long-form document retrieval.