Hugging Face Defines the Scaffold vs Harness Agent Architecture
Hugging Face has published a new technical glossary formalizing the structural differences between an AI agent's scaffolding and its execution harness.
On May 25, 2026, Hugging Face released a technical glossary titled Harness, Scaffold, and the AI Agent Terms Worth Getting Right to address terminology fragmentation in the agent community. The publication follows growing confusion observed at ICLR 2026, where researchers and practitioners reported using conflicting definitions for foundational architectural concepts. Co-authored by researcher Ari Goldberg and reviewed by Hugging Face teams, the document establishes a formal structural model separating the behavior-defining layers of an agent from its underlying execution runtime.
Architectural Boundaries
The glossary formalizes a structural mental model for building these systems: an agent consists of a base model plus a harness. Within this model, Hugging Face draws a strict technical line between scaffolding and the harness itself.
Scaffolding acts as the behavior-defining layer that dictates how a model perceives and interacts with its environment. This encompasses system prompts, tool descriptions, context management, and the specific output formats or schemas the model must follow.
The harness serves as the execution layer or agentic runtime. Its responsibilities are strictly operational. The harness manages the control loop that decides when to invoke the model and when to halt. It handles actual tool execution, triggering the requested APIs or code. It also manages error states, including retries, timeouts, and malformed outputs, while enforcing operational guardrails.
The Impact on Evaluation and Tooling
Isolating the scaffold from the harness has immediate implications for evaluating and testing AI agents. According to the publication, improvements made exclusively to the scaffold layer have yielded 10 to 20 point performance increases on SWE-bench (Verified) tasks, all without altering the underlying model weights.
Despite this technical distinction, commercial products frequently blur the lines between these layers. Hugging Face notes that tools like Anthropic’s Claude Code, OpenAI’s Codex, and the Antigravity CLI often use “harness” as a catch-all term for the entire stack surrounding the base model. Claude Code’s official documentation explicitly describes the software as the “agentic harness around Claude.”
The glossary also contextualizes the rise of reusable agent capabilities, referencing Anthropic Skills. Unlike basic function calls, agent skills are distributed as structured packages of knowledge via .SKILL.md folders, bundling complex instructions and scripts for specific goals.
Updated Terminology for 2026
The release standardizes several other concepts that evolved during the early 2026 surge in terminal and coding agents, such as Google’s Gemini Nano terminal agent and the IBM Open Agent Leaderboard.
| Term | 2026 Definition |
|---|---|
| Sub-agent | An agent invoked by another agent for a specific subtask, maintaining its own independent model and scaffold. |
| Policy | The specific behavior an agent executes, representing a combination of learned model weights and the surrounding harness and scaffold. |
| Tool Search | The capability for an agent to search a repository for tools dynamically at runtime, rather than loading all tools into the system prompt upfront. |
If you build multi-agent systems, explicitly separating your scaffolding logic from your execution harness allows you to version and test your prompts and schemas independently of your API retry logic. Decoupling the behavior definition from the runtime engine prevents tight coupling that complicates debugging when agents fail at complex tasks.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Build Advanced AI Agents with OpenClaw v2026
Learn to master OpenClaw v2026.3.22 by configuring reasoning files, integrating ClawHub skills, and deploying secure agent sandboxes.
Open Agent Leaderboard Evaluates Full Scaffolding and Task Costs
IBM and Hugging Face launched a benchmark that evaluates autonomous agents as complete systems, measuring both task success rates and the USD cost per run.
Google's Agents CLI: A Terminal Path to Agent Platform
Google Cloud has introduced Agents CLI, a command-line tool that gives AI coding assistants a machine-readable interface for agent scaffolding and deployment.
ServiceNow Ships a Benchmark for Testing Enterprise Voice Agents
ServiceNow AI released EVA, an open-source benchmark for evaluating voice agents on both task accuracy and spoken interaction quality.
NVIDIA Ships Nemotron 3 Content Safety 4B for On-Device Filtering
NVIDIA released Nemotron 3 Content Safety 4B, a multilingual multimodal moderation model for text and images, on Hugging Face.