How to Run In-Loop Model Evaluations With olmo-eval
Learn how to set up olmo-eval to test large language model checkpoints during the training process using vLLM, LiteLLM, and Docker-based agent sandboxes.
On June 12, 2026, the Allen Institute for AI released olmo-eval, an open-source evaluation workbench designed specifically for the iterative model development loop. The tool extends the previous OLMES standard from 2024 to address the daily requirements of training large language models. You can use it to test numerous interventions across data, architecture, and hyperparameters while your models are actively training.
The framework provides a registry of benchmark tasks and composable suites with named variants for specific settings. This allows you to evaluate checkpoints rapidly without writing custom testing harnesses for every structural change you make to the model.
Core Architecture and Harness Abstraction
The fundamental design of olmo-eval separates execution policy from task definition. This harness abstraction means you can run any task as a standard baseline or as a tool-augmented operation without modifying the underlying task code.
The tool utilizes composable suites with named variants to handle specific evaluation settings. For example, configuring a run with a variant string like humaneval:3shot:bpb instructs the workbench to pull the exact configuration, prompt format, and few-shot examples required for that specific test run. This strict separation prevents evaluation leakage and ensures that when you compare a checkpoint from epoch 10 against epoch 20, the measurement mechanics remain identical.
Environment Setup and Package Management
Installing and managing the workbench relies heavily on modern Python tooling. The project uses uv for package management, which provides fast, reproducible builds via a checked-in uv.lock file.
The core package is optimized for Linux, which is standard for most training clusters. macOS is supported for development and configuration tasks, though vLLM-specific features are disabled on Apple hardware. The infrastructure integrates directly with the Hugging Face hub for model and dataset access, and natively supports Beaker, the internal research platform used by Ai2.
Because exact code examples depend on your specific hardware configuration and dataset structure, review the olmo-eval documentation repository for the latest uv installation commands and Beaker job definitions.
Configuring Inference Backends
To evaluate models effectively in the loop, you need different execution strategies depending on the stage of the pipeline. olmo-eval supports three primary inference backends.
| Backend | Primary Use Case | Hardware Context |
|---|---|---|
| vLLM | High-throughput local execution for intermediate checkpoints. | Linux clusters with dedicated GPUs. |
| LiteLLM | Integration with commercial APIs to establish baselines. | Networked environments requiring API access. |
| Mock Provider | Rapid dry runs, debugging, and configuration testing. | Local development machines (macOS/Linux). |
The vLLM integration handles the bulk of the heavy lifting during training. It allows the evaluation suite to process thousands of prompts against a newly saved checkpoint efficiently. When you need to compare your checkpoint against a frontier model, the LiteLLM backend routes the exact same evaluation suite to external commercial APIs. If you are actively working to evaluate AI output and need to verify your test configuration without burning compute, the Mock Provider simulates responses instantly.
Running Advanced Evaluation Modes
Standard static benchmarks are often insufficient for testing modern model capabilities. The workbench includes native support for complex, multi-step testing scenarios.
For multi-turn agentic evaluation, olmo-eval supports tool calling and scaffolds execution inside sandboxed environments. You can configure the execution backend to spin up isolated environments using Docker, Podman, or Modal. This allows the model to execute generated code, interact with simulated file systems, and perform tool calls safely during the evaluation phase. If you frequently evaluate and test AI agents, this sandboxing prevents destructive actions on the host machine while capturing the full execution trace.
The framework also provides native LLM-as-Judge scoring. You can configure auxiliary providers to act as graders for open-ended generation tasks. The judge model can be served locally alongside the model being evaluated, or you can route the grading requests to a larger API-based model via the LiteLLM backend.
Inspecting Outputs and Storing Results
Visibility into the evaluation process is critical when debugging training runs. olmo-eval includes built-in tooling for inspecting the exact data fed to the model and the subsequent outputs.
You can view formatted prompts exactly as they are constructed by the harness, examine the token arrays before they hit the model, and inspect the raw model responses. The workbench handles storage for both aggregate metrics and instance-level predictions. Storing instance-level predictions allows you to track exactly which specific questions a model started answering correctly or incorrectly between checkpoints, rather than just observing a shift in the final percentage score.
Tradeoffs and Limitations
The workbench is highly optimized for the active development loop, which introduces certain constraints. It is explicitly designed for speed and comparative analysis of checkpoints. If your primary goal is running complex, long-running agent deployments in highly persistent sandboxes, frameworks dedicated entirely to agent benchmarking may offer more specialized orchestration tools.
Additionally, full hardware acceleration and vLLM support require a Linux environment. Developers working locally on macOS must rely on API backends or the mock provider, limiting the ability to run full-throughput local checkpoint evaluations on Apple Silicon.
Next Steps
Configure your environment using uv and initialize a basic mock provider run to validate your task definitions. Once the dry run completes successfully, swap the configuration to the vLLM backend and point the workbench at your latest training checkpoint to begin capturing instance-level evaluation data.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Outpacing Whisper: Cohere Transcribe Hits Top ASR Speed
Experience enterprise-grade audio intelligence with Cohere Transcribe, a new open-weights model topping the ASR leaderboard with 3x faster speeds than Whisper.
Cohere Transcribe debuts as open-source ASR model
Cohere Transcribe launches as a 2B open-source speech-to-text model with 14-language support, self-hosting, and vLLM serving.
DharmaOCR 7B Proves Domain Alignment Beats Parameter Scaling
Dharma-AI has released two specialized OCR models, demonstrating that targeted training history outpaces general-purpose frontier models on structured tasks.
Apache 2.0 Gets 218B Command A+ as Cohere Acquires Reliant AI
Cohere expanded its sovereign AI strategy by open-sourcing the 218-billion parameter Command A+ model and acquiring biopharma startup Reliant AI.
8K Context Reranking Hits Hugging Face With Ettin Cross-Encoders
Hugging Face released six open-source cross-encoders under the Ettin Reranker family with an 8,192-token context window for long-form document retrieval.