Ai Agents 4 min read

Workers AI Now Lets Agents Run Kimi K2.5

Cloudflare Workers AI now serves Kimi K2.5 with 256k context, tool calling, prompt caching metrics, session affinity, and batch inference.

Cloudflare has added Kimi K2.5 to Workers AI, marking its first public step into serving larger open models for agent workloads. The March 19 launch puts Moonshot AI’s @cf/moonshotai/kimi-k2.5 behind Cloudflare’s inference platform with a 256,000 token context window, tool calling, vision, structured outputs, and batch support, detailed in Cloudflare’s Workers AI rollout and the model page. If you build coding agents, research agents, or large-context automation on Cloudflare, this changes the platform’s practical ceiling.

Model capabilities and pricing

Kimi K2.5 on Workers AI is priced at $0.60 per million input tokens, $0.10 per million cached input tokens, and $3.00 per million output tokens. The exposed cached-token pricing matters as much as the base rates, because long-running agents often resend large prompt prefixes, memory blocks, tool schemas, and repository context.

ModelContext windowInputCached inputOutputTool callingVisionBatch
@cf/moonshotai/kimi-k2.5256,000 tokens$0.60 / 1M$0.10 / 1M$3.00 / 1MYesYesYes

For teams already working on context engineering, the combination is the real story. A 256k window is useful, but the production win comes from making repeated context cheaper instead of paying full freight every turn.

Agent infrastructure, not just a new model

Cloudflare packaged the Kimi K2.5 launch with three platform changes aimed directly at agent traffic.

First, Workers AI now surfaces cached token accounting and applies discounted pricing to those tokens. Second, Cloudflare added x-session-affinity, which helps route requests for the same session to the same model instance, improving prefix-cache hit rates. Third, the Asynchronous Batch API now uses a pull-based queuing model intended to consume work when capacity is available while protecting synchronous traffic. Cloudflare documents prompt caching here and batch inference here.

If your agent architecture relies on persistent instructions, conversation state, or large tool definitions, session affinity is the operational detail to pay attention to. Prefix caching only helps when follow-up requests land where the cached prefix already exists.

Cloudflare’s internal benchmark is about cost control

Cloudflare tied this launch to its own security-review agent, which it says processes more than 7 billion tokens per day and found more than 15 confirmed issues in a single codebase. On that workload, moving to Kimi K2.5 on Workers AI cut cost by 77% compared with Cloudflare’s estimate for a mid-tier proprietary model, which it put at $2.4 million annually.

Those numbers are specific to Cloudflare’s workload, but they point to a broader pattern. For agent systems that repeatedly inspect large codebases, token economics matter more than leaderboard positioning. This is the same reason AI code review and agent evaluation increasingly depend on usage profiles, not single-run prompt quality.

Serving large models at the edge requires different optimizations

Cloudflare says Kimi K2.5 runs on its Infire inference engine and that it wrote custom kernels for the model. It also calls out data, tensor, and expert parallelization, plus disaggregated prefill, where prefill and generation run on different machines.

Those details matter because large-model serving breaks in predictable places. Prefill becomes expensive with long contexts. Interactive latency fights with background throughput. GPU utilization falls if every request pattern is treated the same. Cloudflare’s batch redesign is an attempt to separate those concerns instead of letting async agent jobs starve live traffic.

Cloudflare did not publish GPU type, TTFT, or tokens-per-second figures for this launch, so the practical comparison point is cost and feature coverage rather than latency benchmarks.

Upstream model positioning

Moonshot describes Kimi K2.5 as a native multimodal model built with continued pretraining over about 15T mixed visual and text tokens, with support for up to 100 sub-agents and 1,500 tool calls in its agent swarm design. Moonshot also reports benchmark results including 76.8 on SWE-Bench Verified, 78.5 on MMMU-Pro, 88.8 on OmniDocBench 1.5, 86.6 on VideoMMMU, and 60.6 on BrowseComp in its Kimi K2.5 technical writeup.

For developers, the more relevant implication is deployment fit. If you are already choosing between proprietary coding models and open-model infrastructure, this release gives Workers AI a stronger story for stateful, tool-using agents. It also connects naturally to Cloudflare’s broader push into hosted agent runtimes, including dynamic code execution for agents.

If you run agents on Cloudflare, test one workload with repeated long prefixes, enable x-session-affinity, and measure cached-token share before changing anything else. That is where this launch has the clearest production payoff.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading