Microsoft Foundry Ships ASSERT for Stochastic Agent Evaluation

On June 2, 2026, Microsoft launched Adaptive Spec-driven Scoring for Evaluation and Regression Testing (ASSERT), an open-source framework for automating AI agent evaluation. The tool translates plain-text behavioral policies into executable adversarial tests and scoring rubrics. This allows teams to define granular safety and performance constraints without writing custom test harnesses for every possible agent interaction.

Traditional software testing relies on deterministic paths and exact string matches. ASSERT is built for non-deterministic language models. The framework asserts against pass rates and behavioral patterns over multiple iterations. When you evaluate and test AI agents, this approach measures the aggregate reliability of the system rather than the outcome of a single execution path.

Integration with Agent Control Specification

ASSERT operates within the broader Microsoft Foundry ecosystem as part of a new Agent Optimizer loop. Developers write a rule, such as restricting access to personal financial data without explicit approval. ASSERT generates the corresponding adversarial test scenarios. The newly announced Agent Control Specification (ACS) enforces these rules as runtime guardrails, while a public preview tool called Rubric handles success criteria like tone, cost, and safety.

Component	Function within Foundry Ecosystem
ASSERT	Generates stochastic test scenarios from natural language.
ACS	Enforces defined policies as granular runtime guardrails.
Rubric	Defines and scores success criteria for tone, cost, and safety.

The architecture creates a continuous observe-evaluate-optimize cycle for production deployments. Microsoft positions this workflow as a direct solution to the black box problem of autonomous agents. The framework allows non-technical compliance officers to dictate safe operating parameters in plain text, which the system then enforces programmatically. This capability is designed to establish reliable boundaries for what Microsoft describes as async coworkers, marking a shift from immediate chat completions to independent, background task execution.

Regression Tracking and Accompanying Models

ASSERT includes native regression testing to track how model swaps or prompt modifications alter agent behavior over time. If a system update degrades task performance or weakens safety compliance, the framework catches the behavioral drift before deployment.

Microsoft released ASSERT alongside two distinct model updates. MAI-Thinking-1 debuts as the company’s homegrown reasoning model. Simultaneously, Project Polaris enters the ecosystem as a specialized coding model, scheduled to replace GPT-4 Turbo as the default engine for GitHub Copilot in August 2026. Teams migrating to these new models can use ASSERT to measure behavioral shifts between the architectures.

Transitioning from synchronous chatbots to asynchronous multi-agent systems requires strict behavioral validation. If your production environment relies on non-deterministic models, implement a spec-driven testing framework to ensure your agents adhere to baseline safety policies before they reach end users.

Microsoft Foundry Ships ASSERT for Stochastic Agent Evaluation

Integration with Agent Control Specification

Regression Tracking and Accompanying Models

Keep Reading

Cursor's Autoinstall Bootstraps RL Training Environments

Claude 4 Opus Hits Microsoft Foundry With 30% Faster Throughput

Autonomous Hermes Agent Automates Thai Ministry Breach

Predictable Agent Hallucinations Enable Autonomous Botnets

Claude Voice Mode Adds Opus Support and Workspace Agents