MolmoWeb Outpaces GPT-4o in Visual Web Navigation Tasks
Ai2 releases MolmoWeb, an open-source browser agent that uses visual screenshots to outperform proprietary models on major web navigation benchmarks.
The Allen Institute for AI has released MolmoWeb, an open-source visual web agent that navigates interfaces using pure pixel input rather than underlying HTML. Built on the Molmo 2 multimodal family, the model is available under an Apache 2.0 license in 4B and 8B parameter variants. For teams building web automation systems, this release shifts the reliance away from brittle DOM parsing toward vision-based action prediction.
Visual Navigation Architecture
MolmoWeb operates in a continuous observation-action loop. It takes a task instruction alongside a live browser screenshot, then predicts the next interface action like clicking, typing, scrolling, or tab switching. This visual approach mirrors human navigation and bypasses the complexity of parsing modern, dynamically rendered web applications. If you evaluate and test AI agents, a vision-first approach requires different tooling than standard text-based DOM logging.
Benchmark Results
The 8B parameter flagship variant establishes a new performance standard for open-weight models. On the WebVoyager benchmark, MolmoWeb-8B achieved 78.2% accuracy on the first attempt. Using a pass@4 test-time scaling strategy with four parallel attempts, that score increased to 94.7%. The model scored 35.3% on Online-Mind2Web, 42.3% on DeepShop, and 49.5% on WebTailBench.
These results push MolmoWeb-8B past OpenAI’s Computer-Using Agent on three of the four benchmarks. The model also exceeds the performance of GPT-4o-based SoM Agents on these standard evaluation sets.
Training Data and MolmoWebMix
Ai2 trained the models using supervised fine-tuning on 64 H100 GPUs. The system does not use reinforcement learning or distillation from proprietary models. The training relies entirely on MolmoWebMix, a newly released dataset containing 30,000 human task trajectories across over 1,100 websites. The dataset includes 590,000 individual subtask demonstrations and 2.2 million screenshot-question-answer pairs. Open access to this data provides a strong foundation if you want to build custom agents for domain-specific automation workflows.
Deployment Footprint and Current Limitations
The 4B variant prioritizes efficiency for local execution. Running with 4-bit quantization, the 4B model fits within the memory constraints of a free-tier GPU.
The architecture has specific failure modes to handle in production. Performance degrades when users provide ambiguous instructions or when the agent attempts to act before a webpage fully loads. Ai2 also explicitly excluded financial logins and payment tasks from the training data, meaning the model requires further adaptation for e-commerce checkouts or banking workflows.
You can self-host the inference client and model server using the provided GitHub repository. If your application relies on web scraping or automated testing, start by running the 4B model against your specific internal tools to measure the action latency compared to traditional script-based automation.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Chain Hugging Face Spaces Using the /agents.md Endpoint
You will learn how to orchestrate text-to-image and 3D modeling tools by chaining Hugging Face Spaces together using the universal markdown tool interface.
Holo3.1 Brings 140ms Local Computer Use Agents to 12GB GPUs
Hcompany released Holo3.1, an open-weights agent framework that runs computer-use tasks locally with 140ms latency and 74.2% OS-World accuracy.
Android XR Launches With Gemini 3.5 Wearable Agent Support
Google's Android XR platform introduces a two-tier hardware strategy for smart glasses, relying on Gemini 3.5 to process multimodal agentic workflows.
Volvo EX60 Routes External Camera Feeds to Gemini AI
Google and Volvo are integrating a specialized automotive version of Gemini into the EX60 SUV to process real-time external camera feeds for parking compliance.
Claude 4.7 UI Guidelines Require Strict Screenshot Downscaling
Anthropic's new best practices for computer use identify click accuracy bottlenecks, providing precise screenshot limits and token configurations for Opus 4.7.