MolmoWeb Outpaces GPT-4o in Visual Web Navigation Tasks

The Allen Institute for AI has released MolmoWeb, an open-source visual web agent that navigates interfaces using pure pixel input rather than underlying HTML. Built on the Molmo 2 multimodal family, the model is available under an Apache 2.0 license in 4B and 8B parameter variants. For teams building web automation systems, this release shifts the reliance away from brittle DOM parsing toward vision-based action prediction.

MolmoWeb operates in a continuous observation-action loop. It takes a task instruction alongside a live browser screenshot, then predicts the next interface action like clicking, typing, scrolling, or tab switching. This visual approach mirrors human navigation and bypasses the complexity of parsing modern, dynamically rendered web applications. If you evaluate and test AI agents, a vision-first approach requires different tooling than standard text-based DOM logging.

Benchmark Results

The 8B parameter flagship variant establishes a new performance standard for open-weight models. On the WebVoyager benchmark, MolmoWeb-8B achieved 78.2% accuracy on the first attempt. Using a pass@4 test-time scaling strategy with four parallel attempts, that score increased to 94.7%. The model scored 35.3% on Online-Mind2Web, 42.3% on DeepShop, and 49.5% on WebTailBench.

These results push MolmoWeb-8B past OpenAI’s Computer-Using Agent on three of the four benchmarks. The model also exceeds the performance of GPT-4o-based SoM Agents on these standard evaluation sets.

Training Data and MolmoWebMix

Ai2 trained the models using supervised fine-tuning on 64 H100 GPUs. The system does not use reinforcement learning or distillation from proprietary models. The training relies entirely on MolmoWebMix, a newly released dataset containing 30,000 human task trajectories across over 1,100 websites. The dataset includes 590,000 individual subtask demonstrations and 2.2 million screenshot-question-answer pairs. Open access to this data provides a strong foundation if you want to build custom agents for domain-specific automation workflows.

Deployment Footprint and Current Limitations

The 4B variant prioritizes efficiency for local execution. Running with 4-bit quantization, the 4B model fits within the memory constraints of a free-tier GPU.

The architecture has specific failure modes to handle in production. Performance degrades when users provide ambiguous instructions or when the agent attempts to act before a webpage fully loads. Ai2 also explicitly excluded financial logins and payment tasks from the training data, meaning the model requires further adaptation for e-commerce checkouts or banking workflows.

You can self-host the inference client and model server using the provided GitHub repository. If your application relies on web scraping or automated testing, start by running the 4B model against your specific internal tools to measure the action latency compared to traditional script-based automation.

MolmoWeb Outpaces GPT-4o in Visual Web Navigation Tasks

Visual Navigation Architecture

Benchmark Results

Training Data and MolmoWebMix

Deployment Footprint and Current Limitations

Keep Reading

How to Chain Hugging Face Spaces Using the /agents.md Endpoint

Holo3.1 Brings 140ms Local Computer Use Agents to 12GB GPUs

Android XR Launches With Gemini 3.5 Wearable Agent Support

Volvo EX60 Routes External Camera Feeds to Gemini AI

Claude 4.7 UI Guidelines Require Strict Screenshot Downscaling