Ai Engineering 3 min read

Google launches TPU 8t for training and TPU 8i for inference

Google's eighth-generation TPUs split into the 8t for frontier training and the 8i for low-latency inference, with Broadcom and MediaTek as fab partners.

Google announced the eighth generation of Tensor Processing Units at the Google Cloud Next conference, splitting its silicon strategy into two specialized architectures. The TPU 8t focuses on massive-scale model training, while the TPU 8i targets the high-concurrency demands of real-time AI agents. For developers managing high-volume workloads, this dual-track hardware approach changes the cost and latency profiles of deploying models in production.

TPU 8t for Pretraining Scale

The TPU 8t is built for frontier model pretraining. A single superpod delivers 121 FP4 exaflops of compute, representing a 2.8x increase over the previous TPU v7 Ironwood generation. Google expanded the scaling limit, allowing clusters to reach 9,600 chips per superpod.

Data flow dictates training speed at this scale. The TPU 8t includes 216 GB of High Bandwidth Memory (HBM) per chip with 6.5 TB/s of bandwidth. It uses a new system called TPUDirect to pull data into the chip with 10x faster storage access, keeping utilization high during massive training runs. The architecture integrates SparseCore to accelerate the irregular memory lookups common in large language models. Networking relies on the new Virgo Network fabric arranged in a 3D torus topology.

TPU 8i for Agent Inference

Serving AI models requires fundamentally different hardware priorities than training. The TPU 8i addresses the specific latency constraints of multi-agent coordination patterns. Google claims an 80% improvement in performance-per-dollar compared to the previous generation.

The memory architecture is designed to keep active working sets entirely on-chip. Each TPU 8i pairs 288 GB of HBM with 384 MB of on-chip SRAM, tripling the SRAM capacity of the v7 chips. This configuration minimizes trips to main memory.

Network topology also shifts for these workloads. The traditional 3D torus is replaced by a new architecture called Boardfly. This increases the port count per chip to reduce network diameter, cutting latency by up to 50%. An on-chip Collectives Acceleration Engine (CAE) offloads global operations to reduce on-chip latency by an additional factor of five. If you manage AI inference infrastructure, these hardware-level optimizations directly impact time-to-first-token.

Architecture Comparison

HardwarePrimary WorkloadPeak Compute / ScalingMemory TopologyNetworking Fabric
TPU 8tPretraining121 FP4 exaflops (9,600 chips/pod)216 GB HBM + SparseCoreVirgo (3D Torus)
TPU 8iAgent Inference80% perf/$ gain288 GB HBM + 384 MB SRAMBoardfly

System Integration and Supply Chain

Google is moving away from x86 host processors. Both the 8t and 8i are hosted on custom Axion ARM-based processors. This full-stack integration removes the legacy bottlenecks associated with traditional CPU hosts.

The manufacturing strategy splits production between two silicon vendors. Broadcom developed the TPU 8t training chip. MediaTek secured the contract for the TPU 8i inference chip, marking a significant expansion for the company into data center hardware.

Capacity commitments are already scaling. Anthropic signed an agreement for up to 3.5 GW of next-generation TPU power. Google will also deploy NVIDIA Vera Rubin NVL72 rack-scale systems in the second half of 2026, integrating them onto the same Virgo networking fabric used by the TPU 8t clusters.

Hardware specialization forces a reevaluation of deployment architectures. If you build systems requiring high-frequency agent interactions, the TPU 8i shifts the economic threshold for serving complex models at scale. Evaluate your current inference provider’s roadmap to see if lower latency guarantees are arriving later this year, and adjust your routing logic to take advantage of the reduced overhead.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading