Ai Engineering 3 min read

TML-Interaction-Small Achieves 0.40s Full-Duplex Latency

Thinking Machines Lab has released a research preview of TML-Interaction-Small, a 276-billion-parameter Mixture-of-Experts model for full-duplex conversation.

On May 11, 2026, Thinking Machines Lab introduced a research preview of TML-Interaction-Small, a 276-billion-parameter model designed for simultaneous listening and speaking. The startup’s debut release bypasses traditional turn-based dialogue systems in favor of a full-duplex architecture. For developers building real-time voice or video applications, the model shifts the technical bottleneck from external activity detection to continuous, time-aligned token streams.

Encoder-Free Early Fusion

The underlying model is a Mixture-of-Experts (MoE) system that activates 12 billion parameters per inference. Standard voice-enabled AI pipelines rely on external voice activity detectors to determine when a user has finished speaking. Thinking Machines replaces this harness with an encoder-free early fusion approach.

Raw audio and visual signals pass directly through lightweight embedding layers within the transformer itself. This allows the model to process communication in 200-millisecond chunks. By analyzing these micro-turns natively, the system can react to interruptions, generate backchannel audio like conversational filler, and adjust its output mid-sentence based on new input.

Dual-Agent Architecture

To prevent heavy compute tasks from blocking the live audio stream, TML-Interaction-Small uses a dual-agent configuration. A fast Interaction Model manages timing, latency, and immediate dialogue. A parallel Background Agent handles intensive workloads like tool execution and web browsing.

The background agent feeds its results back into the live stream asynchronously. If you are building applications where multi-agent systems must maintain strict latency budgets, this separation of interaction and reasoning layers provides a reference for isolating synchronous user interfaces from asynchronous tool calls.

Turn-Taking Latency and Benchmarks

Thinking Machines evaluated the model on its internal datasets and the FD-bench suite, which measures conversational timing and interaction quality. The architecture achieves sub-second turn-taking speeds, outperforming recent real-time releases from Google and OpenAI.

MetricTML-Interaction-SmallGemini-3.1-flash-liveGPT-realtime-2.0
Turn-Taking Latency0.40s0.57s1.18s
TimeSpeak Accuracy64.7%Not specified4.3%
Temporal Action-Counting35.4%Not specified1.3%

The 0.40-second latency places the model ahead of the baseline established when Gemini 3.1 Flash Live debuted. The disparity in the TimeSpeak benchmark highlights the difficulty turn-based architectures face when evaluated on continuous, time-aware contexts.

Availability Timeline

TML-Interaction-Small is currently restricted to a selected group of researchers, with no public API available. Thinking Machines plans to open API access for enterprise partners in Q2 2026. An open beta will follow in Q3, ahead of a targeted production release in Q4. The company also allocated $5 million in safety grants alongside the research preview.

If you maintain external voice activity detection modules or prompt-based timing logic, the shift toward early-fusion models will eventually deprecate those middleware layers. Prepare to refactor voice applications to stream raw sensor data directly to the model rather than waiting for discrete silence thresholds.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading