Why this story matters now

On December 17, 2025, Reuters reported that Google is working with Meta to make PyTorch feel “native” on Google’s Tensor Processing Units (TPUs) via an internal effort dubbed TorchTPU—a direct attempt to reduce Nvidia’s long‑standing software edge built around CUDA. Google is also weighing open‑sourcing parts of the stack and has begun selling TPUs directly into customer data centers, not just via Google Cloud. Reuters.

<<stat label="Papers using NVIDIA hardware (2024)" value="91%" source="state-of-ai-compute-index-2024">

<<endstat>>

That headline number from the State of AI Compute Index hints at why this matters: moving PyTorch off “CUDA first” muscle memory and onto TPUs (without pain) would chip away at a core pillar of Nvidia’s software moat.

Concept illustration of a PyTorch bridge spanning a software moat toward a TPU chip, with developers crossing.

The moat: CUDA’s tight embrace of PyTorch

For years, PyTorch has been the default language of modern AI research and engineering, while CUDA has been the fastest path to production on Nvidia GPUs. PyTorch’s own docs center CUDA semantics (e.g., torch.cuda and stream management), reflecting a CUDA‑forward developer experience that reinforced Nvidia’s advantage. PyTorch docs.

  • PyTorch’s share of research implementations is estimated at 70%+, according to the PyTorch Foundation’s 2024 review. PyTorch Foundation.
  • Nvidia’s advantage isn’t just FLOPS; it’s the integrated software stack—CUDA, cuDNN, NCCL, and a decade of tooling and know‑how—that makes GPUs easy to scale. Even bullish analysts describe this as a “systems moat.” (Overview perspectives: Reuters; Barron’s).

<<stat label="PyTorch share of research implementations" value="70%+" source="pytorch-2024-year-in-review">

<<endstat>>

What changed: Google and Meta are pushing PyTorch onto TPUs

Reuters’ Dec 17 report says Google is collaborating with Meta (the steward of PyTorch) on “TorchTPU,” aimed at full‑fidelity PyTorch on TPUs and a smoother on‑ramp for PyTorch shops that don’t want to re‑platform to JAX. Google has also elevated TPU go‑to‑market—selling chips into customer data centers and broadening access beyond its own cloud. Reuters.

The “native TPU backend” for PyTorch

Beyond headlines, the PyTorch/XLA team proposed—on October 20, 2025—an RFC to evolve from today’s torch_xla (lazy tensor, explicit mark_step) toward a more “native” TPU backend that aligns with PyTorch’s eager‑first design and integrates with torch.compile. The aim: a TPU device that feels like tensor.to("tpu") with JIT compilation largely hidden and cached. PyTorch/XLA RFC #9684.

  • This complements PyTorch/XLA’s migration to the OpenXLA compiler stack and PJRT runtime, which decouples frameworks (PyTorch, JAX, TensorFlow) from hardware backends (TPU, GPU, CPU). PyTorch/XLA releases; OpenXLA.

TorchAx: bridging PyTorch syntax to JAX→XLA

Google’s TorchAx lets developers write PyTorch‑style code and execute it through JAX on TPUs—exposing a device='jax' target and interoperating between torch.Tensor and jax.Array. That reduces friction for teams comfortable with PyTorch but eager to ride JAX/XLA performance on TPUs. TorchAx docs.

Inference got a lift first: JetStream and vLLM TPU

Inference is often where enterprises feel cost pressure. In April 2024 Google open‑sourced JetStream, an LLM inference engine for XLA devices starting with TPUs, reporting up to 3× more inferences per dollar on Gemma 7B versus its prior TPU stack. JetStream supports models from PyTorch (via PyTorch/XLA) and JAX. Google Cloud Blog.

<<stat label="JetStream LLM inferences per dollar (vs prior TPU stack)" value="Up to 3x" source="gcloud-jetstream-2024">

<<endstat>>

In 2025, the vLLM project (popular for high‑throughput generation) announced a redesigned TPU backend. It can serve PyTorch‑defined models on TPUs by lowering through JAX under the hood, improving throughput and model coverage while keeping the vLLM UX consistent. vLLM blog; vLLM TPU docs.

Meanwhile, Google and Hugging Face launched Optimum‑TPU for training and serving open models (Llama, Gemma, Mistral) on v5e/v6e with TGI and JetStream options. Google Cloud; Optimum‑TPU PyPI; HF partnership update.

The hardware runway: Trillium (v6e)

Google’s sixth‑generation TPU, Trillium (v6e), reached GA in late 2024 with a 4.7× per‑chip compute uplift over v5e and 67% better energy efficiency—headroom JetStream/vLLM can exploit. Google Cloud; release notes.

<<stat label="Trillium (v6e) compute per chip vs v5e" value="4.7x" source="gcloud-trillium-2024">

<<endstat>>

Meta’s angle: cost, leverage, and Llama everywhere

Meta is PyTorch’s primary backer and the force behind Llama. According to Reuters (Nov 25, 2025), Meta has discussed renting Google Cloud TPUs in 2026 and potentially buying TPUs for its own data centers starting in 2027—moves that would diversify away from Nvidia and reduce inference costs while increasing negotiating leverage. Reuters.

From a developer‑experience standpoint, one critique has been that PyTorch does not run “natively” on TPUs today (relying on PyTorch/XLA). That gap is exactly what the PyTorch/XLA RFC and Google’s TorchTPU initiative target. PyTorch/XLA RFC #9684; contextual commentary: The Register.

<<callout type="note" title="What’s announced vs. what’s shipping">

  • Reuters’ Dec 17 report describes Google’s TorchTPU as an internal project in active collaboration with Meta, with some components under consideration for open source. Google has confirmed the goal (more developer choice) but hasn’t publicly product‑named or GA’d a “native TPU backend” yet. Reuters.
  • Today, production options include PyTorch/XLA, JetStream for TPU inference, vLLM TPU, and Optimum‑TPU integrations on v5e/v6e. The PyTorch/XLA RFC signals where “native” is heading.

What this means for AI builders

If you’ve standardized on PyTorch and want a credible non‑GPU path—especially for inference economics—TPUs are becoming far easier to adopt.

  • Minimal‑change paths: PyTorch/XLA on TPU, with PJRT runtime; Optimum‑TPU for HF workflows; vLLM TPU for high‑throughput serving; JetStream for cost‑efficiency.
  • Medium‑change paths: TorchAx to keep PyTorch‑style code while lowering through JAX→XLA for TPU performance.
  • Watchlist: A “native” TPU device in PyTorch (per the PyTorch/XLA RFC), tighter torch.compile integration, and potential open‑sourcing from TorchTPU.

<<callout type="tip" title="A quick, low‑risk test plan for PyTorch shops">

  1. Stand up a small v5e or v6e slice and install PyTorch/XLA 2.8+; use PJRT (PJRT_DEVICE=TPU). Start with reference notebooks and profiling tools. PyTorch/XLA.
  2. Benchmark inference with JetStream and vLLM TPU on the same models (e.g., Llama, Gemma) to compare cost/latency/throughput trade‑offs. JetStream; vLLM TPU.
  3. For Hugging Face pipelines, try Optimum‑TPU (TGI + JetStream PT option) and evaluate operational fit on GKE or Vertex AI. Optimum‑TPU.
  4. Explore TorchAx on a feature branch to gauge how much of your PyTorch code can run via JAX→XLA without refactors. TorchAx.

Will this dent Nvidia’s moat?

Short term, Nvidia’s position remains formidable. CUDA’s deep integration with PyTorch and the breadth of mature libraries still make GPUs the default choice, and Nvidia continues to expand that stack. But two structural shifts are notable:

  • Portability pressure is rising. OpenXLA, PJRT, and projects like TorchAx reduce switching costs by letting popular frameworks speak to multiple accelerators. OpenXLA.
  • Inference economics dominate at scale. JetStream, vLLM TPU, and Trillium’s perf/efficiency gains strengthen TPU’s value proposition for serving, where dollars‑per‑million‑tokens matter most. Google Cloud JetStream; Trillium.

At‑a‑glance—software paths from PyTorch to accelerators

PathHow it worksMaturity todayBest fit
PyTorch on Nvidia GPUsNative CUDA path via torch.cuda, cuDNN/NCCLProduction defaultTraining + inference
PyTorch/XLA on TPUPyTorch frontend, XLA compiler via PJRTWidely used; “native” TPU backend proposedTraining + inference on TPUs
TorchAx on TPUPyTorch syntax lowered through JAX→XLAEmerging; good for experimentsEasing code migration to TPU
vLLM TPUHigh‑throughput serving; PyTorch/JAX models lowered to XLAActively developed; strong throughputCost‑efficient LLM serving
Optimum‑TPU (HF)Turnkey training/serving for open modelsProduction‑ready building blocksHF pipelines on GKE/Vertex

The bottom line

Google and Meta’s push is less about a single library and more about ending the “CUDA or bust” assumption for PyTorch teams. If PyTorch can feel truly native on TPUs—with mature tooling, predictable performance, and turnkey inference paths—Nvidia’s software moat narrows. That won’t flip the market overnight, but it does create genuine multi‑vendor leverage for builders.


Sources