TML-Interaction-Small Achieves 0.40s Full-Duplex Latency
Thinking Machines Lab has released a research preview of TML-Interaction-Small, a 276-billion-parameter Mixture-of-Experts model for full-duplex conversation.
On May 11, 2026, Thinking Machines Lab introduced a research preview of TML-Interaction-Small, a 276-billion-parameter model designed for simultaneous listening and speaking. The startup’s debut release bypasses traditional turn-based dialogue systems in favor of a full-duplex architecture. For developers building real-time voice or video applications, the model shifts the technical bottleneck from external activity detection to continuous, time-aligned token streams.
Encoder-Free Early Fusion
The underlying model is a Mixture-of-Experts (MoE) system that activates 12 billion parameters per inference. Standard voice-enabled AI pipelines rely on external voice activity detectors to determine when a user has finished speaking. Thinking Machines replaces this harness with an encoder-free early fusion approach.
Raw audio and visual signals pass directly through lightweight embedding layers within the transformer itself. This allows the model to process communication in 200-millisecond chunks. By analyzing these micro-turns natively, the system can react to interruptions, generate backchannel audio like conversational filler, and adjust its output mid-sentence based on new input.
Dual-Agent Architecture
To prevent heavy compute tasks from blocking the live audio stream, TML-Interaction-Small uses a dual-agent configuration. A fast Interaction Model manages timing, latency, and immediate dialogue. A parallel Background Agent handles intensive workloads like tool execution and web browsing.
The background agent feeds its results back into the live stream asynchronously. If you are building applications where multi-agent systems must maintain strict latency budgets, this separation of interaction and reasoning layers provides a reference for isolating synchronous user interfaces from asynchronous tool calls.
Turn-Taking Latency and Benchmarks
Thinking Machines evaluated the model on its internal datasets and the FD-bench suite, which measures conversational timing and interaction quality. The architecture achieves sub-second turn-taking speeds, outperforming recent real-time releases from Google and OpenAI.
| Metric | TML-Interaction-Small | Gemini-3.1-flash-live | GPT-realtime-2.0 |
|---|---|---|---|
| Turn-Taking Latency | 0.40s | 0.57s | 1.18s |
| TimeSpeak Accuracy | 64.7% | Not specified | 4.3% |
| Temporal Action-Counting | 35.4% | Not specified | 1.3% |
The 0.40-second latency places the model ahead of the baseline established when Gemini 3.1 Flash Live debuted. The disparity in the TimeSpeak benchmark highlights the difficulty turn-based architectures face when evaluated on continuous, time-aware contexts.
Availability Timeline
TML-Interaction-Small is currently restricted to a selected group of researchers, with no public API available. Thinking Machines plans to open API access for enterprise partners in Q2 2026. An open beta will follow in Q3, ahead of a targeted production release in Q4. The company also allocated $5 million in safety grants alongside the research preview.
If you maintain external voice activity detection modules or prompt-based timing logic, the shift toward early-fusion models will eventually deprecate those middleware layers. Prepare to refactor voice applications to stream raw sensor data directly to the model rather than waiting for discrete silence thresholds.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How Cursor Built Composer 2 on Top of Kimi K2.5
Cursor's Composer 2 is built on Kimi K2.5. What continued pretraining, reinforcement learning, and self-summarization mean, and how they work.
Arcee Releases 400B Open-Source Trinity Model for Agents
The Trinity-Large-Thinking model offers a low-cost, open-source alternative for OpenClaw users following Anthropic's recent subscription policy changes.
Sci-Fi Training Data Caused Claude Opus 4 Blackmail Attempts
Anthropic's latest research reveals that early Claude models attempted blackmail during safety evaluations because they mimicked science fiction tropes.
EMO Pretraining Decouples Mixture-of-Experts Subsets
AI2 and UC Berkeley researchers introduced EMO, a pretraining constraint that groups MoE experts by semantic domain to allow independent subnet deployment.
Grok Training Partly Relied on OpenAI Model Distillation
Elon Musk testified in federal court that xAI partly relied on model distillation from OpenAI to validate and train the Grok chatbot.