Gimlet Labs Raises $80M Series A for AI Inference

Gimlet Labs raised an $80 million Series A on March 23, bringing total funding to $92 million and putting fresh attention on a specific inference thesis: frontier AI workloads should run across multiple chip architectures, not a single GPU fleet. For teams operating large-scale serving systems, Gimlet’s funding announcement matters because it pairs new capital with concrete claims about heterogeneous inference performance, customer growth, and production deployment.

Menlo Ventures led the round, with Factory, Eclipse, Prosperity7, and Triatomic participating. Gimlet says demand has accelerated since its October 2025 debut, with its customer base tripling and new wins at one top-three frontier lab and one top-three hyperscaler, though those customers remain unnamed.

Product strategy

Gimlet’s product is Gimlet Cloud, which it describes as a multi-silicon inference cloud. The company says it can automatically map agentic workloads to different chips, slice a single model across architectures, and deploy either in Gimlet-managed datacenters or in a customer’s own environment.

The technical pitch is more specific than the headline. Gimlet’s launch architecture centers on an intelligent workload orchestrator, a hardware-agnostic compiler, and autonomous kernel generation. Its broader platform work spans an MLIR-based compiler, SLA-aware datacenter scheduling, and support for heterogeneous hardware targets.

If you build AI agents or long-running tool-using systems, this is the key design point. End-to-end agent workloads are not one uniform inference problem. Prefill, decode, tool calls, retrieval, and orchestration stress different parts of the system, which is why single-silicon optimization often leaves cost or latency on the table.

Performance claims

Gimlet attaches aggressive numbers to the thesis. The company says customers can see 3–10x faster speed at the same cost and power envelope, plus order-of-magnitude better performance per watt. It also frames the problem at infrastructure scale, saying inference is reaching quadrillions of tokens per month and AI datacenter spending this year is heading toward $650 billion in CapEx.

Those top-line figures are not yet accompanied by a single public benchmark suite for the March 23 claims, but Gimlet has published recent technical examples that show how it wants the system to work in practice.

Heterogeneous inference examples

One example is multivendor prefill and decode disaggregation. Gimlet says combinations such as NVIDIA B200 + Intel Gaudi 3 can deliver 1.7x TCO improvement over single-vendor disaggregation on common workloads.

A second example is speculative decoding on d-Matrix Corsair, where Gimlet says gpt-oss-120b paired with a 1.6B speculative decoder achieved 2–5x end-to-end request speedup versus running the same speculative decoder on GPU in interactivity-optimized setups, and up to 10x speedup in energy-optimized configurations.

The hardware detail matters here. d-Matrix Corsair is described with 2 GB of on-chip SRAM and up to 150 TB/s memory bandwidth. This aligns with Gimlet’s March analysis of SRAM-centric inference, where the company argues that different inference stages are constrained by different resource bottlenecks, especially near-compute versus far-compute memory.

Supported silicon

The March 23 release names a wide silicon range: NVIDIA, AMD, Intel, ARM, Cerebras, and d-Matrix.

That breadth is the point of the company, not a compatibility footnote. Gimlet is betting that the winning inference layer sits above chip vendors and treats hardware as a scheduling and compilation target. This is adjacent to the operational concerns behind work on production inference across GPU clusters, but Gimlet extends the idea beyond one vendor’s stack.

Why developers should care

Inference infrastructure is starting to look more like workload routing than straightforward model serving. If your system mixes chat, retrieval, tool use, and structured outputs, you are already dealing with heterogeneous bottlenecks even if your fleet is homogeneous. The practical question is whether your serving layer knows enough about workload shape to route prefill, decode, and auxiliary stages differently.

This affects API economics too. Teams focused on reducing LLM costs in production usually start with prompt trimming, caching, and smaller models. Gimlet’s approach targets the infrastructure layer underneath those tactics. When agent systems generate 5–15x more tokens than traditional chat workloads, which Gimlet claimed at launch, serving architecture becomes a first-order product decision.

It also matters for agent evaluation. Latency, tail behavior, and tool turnaround shape perceived intelligence in production just as much as model quality does.

Gimlet’s Series A is a financing event, but the deeper signal is architectural. If you run high-volume inference or agentic workloads, start measuring prefill, decode, and tool stages separately. The teams that benefit from heterogeneous inference first will be the ones that instrument their workloads well enough to route them.

Gimlet Labs Raises $80M Series A for AI Inference

Product strategy

Performance claims

Heterogeneous inference examples

Supported silicon

Why developers should care

Keep Reading

How to Run In-Loop Model Evaluations With olmo-eval

Apple's iOS 27 Ships Generative Extend and Spatial Reframing

US Export Directive Forces Anthropic to Suspend Fable 5 and Mythos 5

Varya 14B Distills Wan 2.2 for $0.005/Sec Video Generation

Writer Research Ties AI Memory Tools to 39% Performance Drop