Ai Agents 3 min read

Microsoft Foundry Ships ASSERT for Stochastic Agent Evaluation

Microsoft has released ASSERT, an open-source framework that translates text-based behavioral policies into automated evaluation tests for AI agents.

On June 2, 2026, Microsoft launched Adaptive Spec-driven Scoring for Evaluation and Regression Testing (ASSERT), an open-source framework for automating AI agent evaluation. The tool translates plain-text behavioral policies into executable adversarial tests and scoring rubrics. This allows teams to define granular safety and performance constraints without writing custom test harnesses for every possible agent interaction.

Traditional software testing relies on deterministic paths and exact string matches. ASSERT is built for non-deterministic language models. The framework asserts against pass rates and behavioral patterns over multiple iterations. When you evaluate and test AI agents, this approach measures the aggregate reliability of the system rather than the outcome of a single execution path.

Integration with Agent Control Specification

ASSERT operates within the broader Microsoft Foundry ecosystem as part of a new Agent Optimizer loop. Developers write a rule, such as restricting access to personal financial data without explicit approval. ASSERT generates the corresponding adversarial test scenarios. The newly announced Agent Control Specification (ACS) enforces these rules as runtime guardrails, while a public preview tool called Rubric handles success criteria like tone, cost, and safety.

ComponentFunction within Foundry Ecosystem
ASSERTGenerates stochastic test scenarios from natural language.
ACSEnforces defined policies as granular runtime guardrails.
RubricDefines and scores success criteria for tone, cost, and safety.

The architecture creates a continuous observe-evaluate-optimize cycle for production deployments. Microsoft positions this workflow as a direct solution to the black box problem of autonomous agents. The framework allows non-technical compliance officers to dictate safe operating parameters in plain text, which the system then enforces programmatically. This capability is designed to establish reliable boundaries for what Microsoft describes as async coworkers, marking a shift from immediate chat completions to independent, background task execution.

Regression Tracking and Accompanying Models

ASSERT includes native regression testing to track how model swaps or prompt modifications alter agent behavior over time. If a system update degrades task performance or weakens safety compliance, the framework catches the behavioral drift before deployment.

Microsoft released ASSERT alongside two distinct model updates. MAI-Thinking-1 debuts as the company’s homegrown reasoning model. Simultaneously, Project Polaris enters the ecosystem as a specialized coding model, scheduled to replace GPT-4 Turbo as the default engine for GitHub Copilot in August 2026. Teams migrating to these new models can use ASSERT to measure behavioral shifts between the architectures.

Transitioning from synchronous chatbots to asynchronous multi-agent systems requires strict behavioral validation. If your production environment relies on non-deterministic models, implement a spec-driven testing framework to ensure your agents adhere to baseline safety policies before they reach end users.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading