Microsoft Foundry Ships ASSERT for Stochastic Agent Evaluation
Microsoft has released ASSERT, an open-source framework that translates text-based behavioral policies into automated evaluation tests for AI agents.
On June 2, 2026, Microsoft launched Adaptive Spec-driven Scoring for Evaluation and Regression Testing (ASSERT), an open-source framework for automating AI agent evaluation. The tool translates plain-text behavioral policies into executable adversarial tests and scoring rubrics. This allows teams to define granular safety and performance constraints without writing custom test harnesses for every possible agent interaction.
Traditional software testing relies on deterministic paths and exact string matches. ASSERT is built for non-deterministic language models. The framework asserts against pass rates and behavioral patterns over multiple iterations. When you evaluate and test AI agents, this approach measures the aggregate reliability of the system rather than the outcome of a single execution path.
Integration with Agent Control Specification
ASSERT operates within the broader Microsoft Foundry ecosystem as part of a new Agent Optimizer loop. Developers write a rule, such as restricting access to personal financial data without explicit approval. ASSERT generates the corresponding adversarial test scenarios. The newly announced Agent Control Specification (ACS) enforces these rules as runtime guardrails, while a public preview tool called Rubric handles success criteria like tone, cost, and safety.
| Component | Function within Foundry Ecosystem |
|---|---|
| ASSERT | Generates stochastic test scenarios from natural language. |
| ACS | Enforces defined policies as granular runtime guardrails. |
| Rubric | Defines and scores success criteria for tone, cost, and safety. |
The architecture creates a continuous observe-evaluate-optimize cycle for production deployments. Microsoft positions this workflow as a direct solution to the black box problem of autonomous agents. The framework allows non-technical compliance officers to dictate safe operating parameters in plain text, which the system then enforces programmatically. This capability is designed to establish reliable boundaries for what Microsoft describes as async coworkers, marking a shift from immediate chat completions to independent, background task execution.
Regression Tracking and Accompanying Models
ASSERT includes native regression testing to track how model swaps or prompt modifications alter agent behavior over time. If a system update degrades task performance or weakens safety compliance, the framework catches the behavioral drift before deployment.
Microsoft released ASSERT alongside two distinct model updates. MAI-Thinking-1 debuts as the company’s homegrown reasoning model. Simultaneously, Project Polaris enters the ecosystem as a specialized coding model, scheduled to replace GPT-4 Turbo as the default engine for GitHub Copilot in August 2026. Teams migrating to these new models can use ASSERT to measure behavioral shifts between the architectures.
Transitioning from synchronous chatbots to asynchronous multi-agent systems requires strict behavioral validation. If your production environment relies on non-deterministic models, implement a spec-driven testing framework to ensure your agents adhere to baseline safety policies before they reach end users.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Cursor's Autoinstall Bootstraps RL Training Environments
Learn how Cursor uses previous model generations to automate reinforcement learning environment setups, mock dependencies, and verify target commands.
Holo3.1 Brings 140ms Local Computer Use Agents to 12GB GPUs
Hcompany released Holo3.1, an open-weights agent framework that runs computer-use tasks locally with 140ms latency and 74.2% OS-World accuracy.
Gemini Spark Preview Enables Headless DOM Navigation Workflows
Google's new Gemini Spark agent leverages the Gemini 2.0 Ultra architecture to execute autonomous, multi-step workflows across Chrome and Google Workspace.
IBM Pivots to Agent Logic to Control Multi-Step AI Workflows
A joint technical publication from IBM and Hugging Face details how strict state management and formal logic layers can govern long-running enterprise agents.
iOS 27 Siri Leaks Reveal Gemini Backbone and AI Extensions
Leaked technical details for Apple's iOS 27 reveal a redesigned Siri operating as a standalone chatbot powered by Google's Gemini models.