Boosting Kimi K2.5 Speed 3x via Cloudflare Infire Optimization

On April 16, 2026, Cloudflare detailed an overhauled AI inference stack designed specifically to run extra-large language models for agentic workflows. The core of the update is a highly optimized version of its proprietary Infire engine. By combining hardware orchestration and software-level improvements, Cloudflare increased the inference speed of Moonshot AI’s Kimi K2.5 by 3x. For developers deploying autonomous agents, the backend changes alter the performance profile of high-context, tool-heavy applications.

Inference Engine Architecture

The Infire inference engine, written in Rust, received several structural upgrades to maximize hardware utilization. Cloudflare introduced Disaggregated Prefill (PD) to separate the processing of input tokens from the generation of output tokens. If you build systems with massive system prompts, this separation prevents long context processing from bottlenecking the decoding phase.

The stack utilizes speculative decoding to accelerate structured outputs. A smaller draft model generates candidate tokens, and the larger target model validates them in a single forward pass. This reduces the latency of repeated tool calls.

Combined with multi-GPU support and KV-cache optimizations, the updated engine achieves up to 20% higher tokens-per-second throughput on unconstrained systems. Cold starts for extra-large models now execute in under 20 seconds. Load times are strictly bounded by drive speed rather than software overhead.

Scaling Kimi K2.5 in Production

Cloudflare recently hosted Kimi K2.5 on Workers AI to handle large-scale enterprise tasks. The model features a 256k context window and natively supports vision and multi-turn tool calling. Software optimizations over the past month made the model 3x faster on Cloudflare’s infrastructure.

The efficiency gains directly impact operating costs. Cloudflare migrated its internal security review agent, which processes 7 billion tokens daily, to Kimi K2.5 on Workers AI. The migration resulted in a 77% cost reduction compared to previous mid-tier proprietary models.

Infrastructure Primitives for Agents

Accompanying the infrastructure upgrades are specialized tools for agent orchestration. Cloudflare introduced AI Search, a retrieval primitive that allows dynamic instance creation for uploaded files. Agents also gain access to Artifacts, a Git-compatible versioned storage system for handling code and data handoffs.

Long-running agents require persistent environments. Project Think provides persistent workspaces, sub-agent coordination, and sandboxed code execution inside Dynamic Workers.

To manage tool calling efficiency, Cloudflare released Code Mode. This specialized Model Context Protocol server compresses the token footprint needed to interact with Cloudflare’s 2,500 APIs. The implementation reduces the required context from 1.17 million tokens down to approximately 1,000 tokens, achieving a 99.9% reduction in tool-calling overhead.

If you run agentic workflows with high tool-call frequencies, evaluate your prefill and decode isolation strategies. The cost and speed improvements demonstrated with Kimi K2.5 dictate that optimizing the underlying inference architecture is a strict requirement for scaling long-context applications in production.

Boosting Kimi K2.5 Speed 3x via Cloudflare Infire Optimization

Inference Engine Architecture

Scaling Kimi K2.5 in Production

Infrastructure Primitives for Agents

Keep Reading

Build AI Agent Search with Cloudflare AI Search

TPU v5p Inference Speeds Triple With DFlash Block-Diffusion

Google Gemini API Adds Flex and Priority Tiers for Scale

NVIDIA Introduces SPEED-Bench for Speculative Decoding

How to Cut CPU Costs with Cloudflare Workers Cache