Google launches TPU 8t for training and TPU 8i for inference
Google's eighth-generation TPUs split into the 8t for frontier training and the 8i for low-latency inference, with Broadcom and MediaTek as fab partners.
Google announced the eighth generation of Tensor Processing Units at the Google Cloud Next conference, splitting its silicon strategy into two specialized architectures. The TPU 8t focuses on massive-scale model training, while the TPU 8i targets the high-concurrency demands of real-time AI agents. For developers managing high-volume workloads, this dual-track hardware approach changes the cost and latency profiles of deploying models in production.
TPU 8t for Pretraining Scale
The TPU 8t is built for frontier model pretraining. A single superpod delivers 121 FP4 exaflops of compute, representing a 2.8x increase over the previous TPU v7 Ironwood generation. Google expanded the scaling limit, allowing clusters to reach 9,600 chips per superpod.
Data flow dictates training speed at this scale. The TPU 8t includes 216 GB of High Bandwidth Memory (HBM) per chip with 6.5 TB/s of bandwidth. It uses a new system called TPUDirect to pull data into the chip with 10x faster storage access, keeping utilization high during massive training runs. The architecture integrates SparseCore to accelerate the irregular memory lookups common in large language models. Networking relies on the new Virgo Network fabric arranged in a 3D torus topology.
TPU 8i for Agent Inference
Serving AI models requires fundamentally different hardware priorities than training. The TPU 8i addresses the specific latency constraints of multi-agent coordination patterns. Google claims an 80% improvement in performance-per-dollar compared to the previous generation.
The memory architecture is designed to keep active working sets entirely on-chip. Each TPU 8i pairs 288 GB of HBM with 384 MB of on-chip SRAM, tripling the SRAM capacity of the v7 chips. This configuration minimizes trips to main memory.
Network topology also shifts for these workloads. The traditional 3D torus is replaced by a new architecture called Boardfly. This increases the port count per chip to reduce network diameter, cutting latency by up to 50%. An on-chip Collectives Acceleration Engine (CAE) offloads global operations to reduce on-chip latency by an additional factor of five. If you manage AI inference infrastructure, these hardware-level optimizations directly impact time-to-first-token.
Architecture Comparison
| Hardware | Primary Workload | Peak Compute / Scaling | Memory Topology | Networking Fabric |
|---|---|---|---|---|
| TPU 8t | Pretraining | 121 FP4 exaflops (9,600 chips/pod) | 216 GB HBM + SparseCore | Virgo (3D Torus) |
| TPU 8i | Agent Inference | 80% perf/$ gain | 288 GB HBM + 384 MB SRAM | Boardfly |
System Integration and Supply Chain
Google is moving away from x86 host processors. Both the 8t and 8i are hosted on custom Axion ARM-based processors. This full-stack integration removes the legacy bottlenecks associated with traditional CPU hosts.
The manufacturing strategy splits production between two silicon vendors. Broadcom developed the TPU 8t training chip. MediaTek secured the contract for the TPU 8i inference chip, marking a significant expansion for the company into data center hardware.
Capacity commitments are already scaling. Anthropic signed an agreement for up to 3.5 GW of next-generation TPU power. Google will also deploy NVIDIA Vera Rubin NVL72 rack-scale systems in the second half of 2026, integrating them onto the same Virgo networking fabric used by the TPU 8t clusters.
Hardware specialization forces a reevaluation of deployment architectures. If you build systems requiring high-frequency agent interactions, the TPU 8i shifts the economic threshold for serving complex models at scale. Evaluate your current inference provider’s roadmap to see if lower latency guarantees are arriving later this year, and adjust your routing logic to take advantage of the reduced overhead.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Fine-Tuning vs RAG: When to Use Each Approach
RAG changes what the model knows. Fine-tuning changes how it behaves. Here's when to use each approach, their real tradeoffs, and why the answer is usually both.
Intel’s Xeon 6 and Custom IPUs Coming to Google Cloud
Intel and Google expand their partnership to co-develop custom IPUs and deploy Xeon 6 processors for high-performance AI and hyperscale workloads.
Google Gemini API Adds Flex and Priority Tiers for Scale
Google launches Flex and Priority inference tiers for the Gemini API, offering developers new ways to optimize costs and reliability for AI workflows.
Boosting Drug Discovery via Paired Protein Language Model
Researchers at NUS unveil PPLM, a novel AI architecture that models protein-protein interactions with 17% higher accuracy than previous methods.
Scaling AI Gateway to Power Cloudflare's New Agentic Web
Cloudflare transforms its AI Gateway into a unified inference layer, offering persistent memory and dynamic runtimes to optimize multi-model agent workflows.