The Context: An Arms Race
In the midst of the rapid development of AI, leading tech giants have initiated an unprecedented arms race in computing power; Meta, Amazon, Google, and Microsoft alone have a combined capital expenditure approaching 5 trillion.
To understand the depth of NVIDIA's "moat", particularly as Google’s TPU matures in both technology and ecosystem, we must ask: will it bring a substantial shock to NVIDIA’s market share? Notable signals include Apple training its Foundation Model on over 8,000 TPUs, Anthropic committing to use 1 million TPUs, and even OpenAI beginning to rent TPUs for inference.
This represents a clash of two distinct design philosophies: NVIDIA pursues a general-purpose route ("do everything"), while Google pursues a specialised route ("do one thing—AI—to the extreme"). Can Google’s specialised chips truly shake NVIDIA’s status?
To understand the difference between a GPU and a TPU, one must first look at their architectures.
From the era of gaming graphics cards to the current Blackwell and future Rubin, NVIDIA has adhered to a core design: SIMT (Single Instruction, Multiple Threads).
● The Cost of Flexibility: A chip like the H100 contains thousands of CUDA Cores, each equipped with independent control units and registers. While they execute the same instruction simultaneously, they can process different data, take different code branches, and access different memory addresses.
● Structure: These cores are organised into Streaming Multiprocessors (SMs), containing Tensor Cores for matrix multiplication, Warp Schedulers, and L1 cache, underpinned by HBM memory with bandwidths up to 3.35TB/s or even 8TB/s.
● The Trade-off: To support this flexibility, every CUDA Core requires a full suite of control logic (instruction decoders, dispatch units), which occupies significant chip area and power. Essentially, the GPU devotes a considerable portion of its transistors to "thinking" rather than "calculating".
This flexibility allows for complex branching logic (like if-else statements). However, in hardware, this leads to Warp Divergence. If threads in a Warp need to go down different branches, the GPU serialises the execution (doing one branch while others wait, then swapping), doubling the clock cycles required. Consequently, while GPUs can do everything from rendering to science to AI, they sacrifice energy efficiency by stacking control units and caches.
Google has taken an extreme route: the Systolic Array.
● The Flow: The core idea is to let data flow between calculation units. Data flows in from one side, is processed, and passed directly to the next unit, avoiding frequent trips back to memory or registers.
● Matrix Multiplication Unit (MXU): The heart of the TPU is the MXU, a massive array (e.g., 128x128). Since 90% of deep learning is matrix multiplication, Google dedicated the chip area almost exclusively to this. Weight matrices are loaded and stay stationary; input data flows across them like a wave.
● Efficiency: In a GPU, every calculation requires reading from registers and writing back. In a TPU, data is read from HBM once and passed between 128+ units without returning to memory.
The Impact on Efficiency: Reading data from HBM consumes enough energy to perform 1,000 multiplication operations. By reusing data, the TPU reverses this ratio. Google’s internal data shows TPU v4 is 1.2 to 1.7 times more energy-efficient than the NVIDIA A100, and the latest Ironwood (TPU v7) doubles the efficiency of the v6e. The architecture is designed so that as data moves closer to the calculation unit (from HBM to VMEM to VREGs), bandwidth increases and reuse frequency rises.
● NVIDIA's View: AI algorithms evolve rapidly (e.g., Transformers today, something new tomorrow). Flexibility is paramount, even at the cost of efficiency. Independent instruction pointers are needed for complex control flows.
● Google's View: Deep learning is fundamentally matrix multiplication. Even complex algorithms can be converted into standard matrix operations via the XLA compiler. Therefore, cut the control logic and spend the transistors on calculation units.
Physics of Energy: Performing a math operation costs ~1 pJ (picojoule). Reading from registers costs 5-10 pJ, and reading from HBM costs 1,000+ pJ. Calculation is nearly free; data movement is expensive.
● GPU: Mitigates this with massive caches (L1/L2) and HBM, but essentially still moves data back and forth.
● TPU: Uses systolic arrays to pass data between adjacent units via microscopic wires, making the energy cost of transport negligible. For massive matrix ops, TPU energy efficiency can be 2-3 times that of a GPU.

Training massive models (like GPT-5) requires thousands of cards working in concert, making the interconnect network the bottleneck.
● Architecture: All data passes through switches. Within a node, 8 GPUs connect via NVLink (memory semantic interconnection allowing direct memory access between GPUs). Across nodes, it relies on InfiniBand or RoCE switches.
● Pros: Highly flexible and robust. If a line fails, switches re-route automatically. Suitable for any communication pattern.
● Cons: Extremely expensive and power-hungry. Networking equipment (switches, optical modules, cables) can comprise 20-30% of a cluster's cost.
● Architecture: Chips connect directly to their neighbours (Up, Down, Left, Right, Front, Back) via ICI (Inter-Chip Interconnect) interfaces, forming a 3D Torus topology without switches.
● Pros: Cost and power savings. Google claims TPU networking costs less than one-fifth of GPU networking.
● Cons: Latency increases with distance (hop count), and a single node failure can break the loop.
● The Solution (OCS): To solve reliability issues, Google uses Optical Circuit Switching (OCS). These are non-electronic switches that use rotating micro-mirrors to reflect light beams. They can reconfigure the topology in seconds to bypass faulty cabinets or optimise for specific model architectures (e.g., 3D parallelism).
NVIDIA sells general compute for varied clients (e.g., MoE models where data jumps randomly between distant chips), requiring the flexibility of a switch network. Google is vertically integrated; the hardware team knows the software team's needs. The XLA compiler plans the computation graph to ensure frequent communication happens between physically adjacent chips.

NVIDIA currently holds nearly 90% of the commercial AI chip market. However, when including cloud provider ASICs (Google TPU, AWS Trainium, Azure Maia), the picture shifts. TrendForce predicts cloud ASICs will grow at 44.6% annually through 2026, compared to 16.1% for GPUs.
Real-world shifts:
● Anthropic: Committed to using up to 1 million TPUs.
● OpenAI: Has begun renting TPUs for inference, seeking alternatives despite close ties to NVIDIA.
● Midjourney: Reduced inference costs by 65% after migrating to TPUs.
● Training: NVIDIA remains dominant due to the high risk of switching. No one wants to delay a model like GPT-5 by months to save 20% on hardware. Only Google dares to train its flagship (Gemini 3) entirely on TPUs.
● Inference: The landscape is changing fast. Inference is stable, repetitive, and cost-sensitive. Specialised chips like Ironwood (TPU v7), optimised for inference with 192GB memory, offer better total cost of ownership (TCO) due to massive energy savings.
Hardware is only half the moat; the other half is the CUDA ecosystem.
● NVIDIA's Strength: 17 years of development, millions of lines of code, and highly optimised libraries (cuDNN, TensorRT). The "path of least resistance" for developers is PyTorch + CUDA.
● Google's Attack (XLA): Google is betting on a compiler-based approach via XLA (Accelerated Linear Algebra).
○ Instead of hand-writing kernels (like CUDA), developers describe the computation graph in high-level languages like JAX. XLA compiles this into machine code.
○ Operator Fusion: In CUDA, a sequence like Add -> ReLU -> Softmax might require reading/writing to memory three times. XLA fuses these into a single kernel, keeping data in registers and reading from memory only once.
○ The Bet: Google believes that in the long run, compilers will outperform human engineers at optimisation.
The moat is deep, but Google is encroaching on both the hardware and software fronts.
● Impact: TPUs will not kill NVIDIA, but they will likely impact its pricing power and 75% gross margins.
● Short Term: NVIDIA's dominance in training is unlikely to be shaken within the next 5 years.
● Long Term: The ASIC route (TPU) has proven its superior energy efficiency and cost-effectiveness. High-end clients (Apple, OpenAI) are diversifying. Furthermore, as compiler technology (XLA) improves, the "lock-in" effect of CUDA will weaken.
Ultimately, this competition is beneficial. It prevents monopoly, drives technical progress, reduces costs, and paves the way for the democratisation of AI.
Manufacturer: Microchip
IC MCU 8BIT 16KB FLASH 28SDIP
Product Categories: 8bit MCU
Lifecycle:
RoHS:
Manufacturer: Texas Instruments
IC DGTL MEDIA PROCESSR 684FCBGA
Product Categories: DSP
Lifecycle:
RoHS:
Manufacturer: Texas Instruments
IC DSP ARM SOC BGA
Product Categories: SOC
Lifecycle:
RoHS:
Manufacturer: Microchip
IC MCU 16BIT 16KB FLASH 20DIP
Product Categories: 16bit MCU
Lifecycle:
RoHS:
Looking forward to your comment
Comment