Why this story matters now
On December 17, 2025, Reuters reported that Google is working with Meta to make PyTorch feel “native” on Google’s Tensor Processing Units (TPUs) via an internal effort dubbed TorchTPU—a direct attempt to reduce Nvidia’s long‑standing software edge built around CUDA. Google is also weighing open‑sourcing parts of the stack and has begun selling TPUs directly into customer data centers, not just via Google Cloud. Reuters.
<<stat label="Papers using NVIDIA hardware (2024)" value="91%" source="state-of-ai-compute-index-2024">
<<endstat>>
That headline number from the State of AI Compute Index hints at why this matters: moving PyTorch off “CUDA first” muscle memory and onto TPUs (without pain) would chip away at a core pillar of Nvidia’s software moat.

The moat: CUDA’s tight embrace of PyTorch
For years, PyTorch has been the default language of modern AI research and engineering, while CUDA has been the fastest path to production on Nvidia GPUs. PyTorch’s own docs center CUDA semantics (e.g., torch.cuda and stream management), reflecting a CUDA‑forward developer experience that reinforced Nvidia’s advantage. PyTorch docs.
- PyTorch’s share of research implementations is estimated at 70%+, according to the PyTorch Foundation’s 2024 review. PyTorch Foundation.
- Nvidia’s advantage isn’t just FLOPS; it’s the integrated software stack—CUDA, cuDNN, NCCL, and a decade of tooling and know‑how—that makes GPUs easy to scale. Even bullish analysts describe this as a “systems moat.” (Overview perspectives: Reuters; Barron’s).
<<stat label="PyTorch share of research implementations" value="70%+" source="pytorch-2024-year-in-review">
<<endstat>>
What changed: Google and Meta are pushing PyTorch onto TPUs
Reuters’ Dec 17 report says Google is collaborating with Meta (the steward of PyTorch) on “TorchTPU,” aimed at full‑fidelity PyTorch on TPUs and a smoother on‑ramp for PyTorch shops that don’t want to re‑platform to JAX. Google has also elevated TPU go‑to‑market—selling chips into customer data centers and broadening access beyond its own cloud. Reuters.
The “native TPU backend” for PyTorch
Beyond headlines, the PyTorch/XLA team proposed—on October 20, 2025—an RFC to evolve from today’s torch_xla (lazy tensor, explicit mark_step) toward a more “native” TPU backend that aligns with PyTorch’s eager‑first design and integrates with torch.compile. The aim: a TPU device that feels like tensor.to("tpu") with JIT compilation largely hidden and cached. PyTorch/XLA RFC #9684.
- This complements PyTorch/XLA’s migration to the OpenXLA compiler stack and PJRT runtime, which decouples frameworks (PyTorch, JAX, TensorFlow) from hardware backends (TPU, GPU, CPU). PyTorch/XLA releases; OpenXLA.
TorchAx: bridging PyTorch syntax to JAX→XLA
Google’s TorchAx lets developers write PyTorch‑style code and execute it through JAX on TPUs—exposing a device='jax' target and interoperating between torch.Tensor and jax.Array. That reduces friction for teams comfortable with PyTorch but eager to ride JAX/XLA performance on TPUs. TorchAx docs.
Inference got a lift first: JetStream and vLLM TPU
Inference is often where enterprises feel cost pressure. In April 2024 Google open‑sourced JetStream, an LLM inference engine for XLA devices starting with TPUs, reporting up to 3× more inferences per dollar on Gemma 7B versus its prior TPU stack. JetStream supports models from PyTorch (via PyTorch/XLA) and JAX. Google Cloud Blog.
<<stat label="JetStream LLM inferences per dollar (vs prior TPU stack)" value="Up to 3x" source="gcloud-jetstream-2024">
<<endstat>>
In 2025, the vLLM project (popular for high‑throughput generation) announced a redesigned TPU backend. It can serve PyTorch‑defined models on TPUs by lowering through JAX under the hood, improving throughput and model coverage while keeping the vLLM UX consistent. vLLM blog; vLLM TPU docs.
Meanwhile, Google and Hugging Face launched Optimum‑TPU for training and serving open models (Llama, Gemma, Mistral) on v5e/v6e with TGI and JetStream options. Google Cloud; Optimum‑TPU PyPI; HF partnership update.
The hardware runway: Trillium (v6e)
Google’s sixth‑generation TPU, Trillium (v6e), reached GA in late 2024 with a 4.7× per‑chip compute uplift over v5e and 67% better energy efficiency—headroom JetStream/vLLM can exploit. Google Cloud; release notes.
<<stat label="Trillium (v6e) compute per chip vs v5e" value="4.7x" source="gcloud-trillium-2024">
<<endstat>>
Meta’s angle: cost, leverage, and Llama everywhere
Meta is PyTorch’s primary backer and the force behind Llama. According to Reuters (Nov 25, 2025), Meta has discussed renting Google Cloud TPUs in 2026 and potentially buying TPUs for its own data centers starting in 2027—moves that would diversify away from Nvidia and reduce inference costs while increasing negotiating leverage. Reuters.
From a developer‑experience standpoint, one critique has been that PyTorch does not run “natively” on TPUs today (relying on PyTorch/XLA). That gap is exactly what the PyTorch/XLA RFC and Google’s TorchTPU initiative target. PyTorch/XLA RFC #9684; contextual commentary: The Register.
<<callout type="note" title="What’s announced vs. what’s shipping">
- Reuters’ Dec 17 report describes Google’s TorchTPU as an internal project in active collaboration with Meta, with some components under consideration for open source. Google has confirmed the goal (more developer choice) but hasn’t publicly product‑named or GA’d a “native TPU backend” yet. Reuters.
- Today, production options include PyTorch/XLA, JetStream for TPU inference, vLLM TPU, and Optimum‑TPU integrations on v5e/v6e. The PyTorch/XLA RFC signals where “native” is heading.
What this means for AI builders
If you’ve standardized on PyTorch and want a credible non‑GPU path—especially for inference economics—TPUs are becoming far easier to adopt.
- Minimal‑change paths: PyTorch/XLA on TPU, with PJRT runtime; Optimum‑TPU for HF workflows; vLLM TPU for high‑throughput serving; JetStream for cost‑efficiency.
- Medium‑change paths: TorchAx to keep PyTorch‑style code while lowering through JAX→XLA for TPU performance.
- Watchlist: A “native” TPU device in PyTorch (per the PyTorch/XLA RFC), tighter
torch.compileintegration, and potential open‑sourcing from TorchTPU.
<<callout type="tip" title="A quick, low‑risk test plan for PyTorch shops">
- Stand up a small v5e or v6e slice and install PyTorch/XLA 2.8+; use PJRT (
PJRT_DEVICE=TPU). Start with reference notebooks and profiling tools. PyTorch/XLA. - Benchmark inference with JetStream and vLLM TPU on the same models (e.g., Llama, Gemma) to compare cost/latency/throughput trade‑offs. JetStream; vLLM TPU.
- For Hugging Face pipelines, try Optimum‑TPU (TGI + JetStream PT option) and evaluate operational fit on GKE or Vertex AI. Optimum‑TPU.
- Explore TorchAx on a feature branch to gauge how much of your PyTorch code can run via JAX→XLA without refactors. TorchAx.
Will this dent Nvidia’s moat?
Short term, Nvidia’s position remains formidable. CUDA’s deep integration with PyTorch and the breadth of mature libraries still make GPUs the default choice, and Nvidia continues to expand that stack. But two structural shifts are notable:
- Portability pressure is rising. OpenXLA, PJRT, and projects like TorchAx reduce switching costs by letting popular frameworks speak to multiple accelerators. OpenXLA.
- Inference economics dominate at scale. JetStream, vLLM TPU, and Trillium’s perf/efficiency gains strengthen TPU’s value proposition for serving, where dollars‑per‑million‑tokens matter most. Google Cloud JetStream; Trillium.
At‑a‑glance—software paths from PyTorch to accelerators
| Path | How it works | Maturity today | Best fit |
|---|---|---|---|
| PyTorch on Nvidia GPUs | Native CUDA path via torch.cuda, cuDNN/NCCL | Production default | Training + inference |
| PyTorch/XLA on TPU | PyTorch frontend, XLA compiler via PJRT | Widely used; “native” TPU backend proposed | Training + inference on TPUs |
| TorchAx on TPU | PyTorch syntax lowered through JAX→XLA | Emerging; good for experiments | Easing code migration to TPU |
| vLLM TPU | High‑throughput serving; PyTorch/JAX models lowered to XLA | Actively developed; strong throughput | Cost‑efficient LLM serving |
| Optimum‑TPU (HF) | Turnkey training/serving for open models | Production‑ready building blocks | HF pipelines on GKE/Vertex |
The bottom line
Google and Meta’s push is less about a single library and more about ending the “CUDA or bust” assumption for PyTorch teams. If PyTorch can feel truly native on TPUs—with mature tooling, predictable performance, and turnkey inference paths—Nvidia’s software moat narrows. That won’t flip the market overnight, but it does create genuine multi‑vendor leverage for builders.
Sources
- Reuters (Dec 17, 2025): Google works to erode Nvidia’s software advantage with Meta’s help
- Reuters (Nov 25, 2025): Meta in talks to spend billions on Google’s chips, The Information reports
- PyTorch/XLA RFC (Oct 20, 2025): Evolving PyTorch/XLA for a more native experience on TPU
- OpenXLA: Project overview
- PyTorch docs: CUDA semantics
- PyTorch Foundation (2024): Year in Review
- Google Cloud (Apr 10, 2024): Accelerate AI inference with Cloud TPUs and GPUs (JetStream)
- Google Cloud (May 14, 2024): Introducing Trillium (6th‑gen TPUs) and Release notes
- vLLM (Oct 16, 2025): vLLM TPU redesign—unified backend for PyTorch and JAX
- TorchAx: What is TorchAx
- Google Cloud (2025): What’s new with AI Hypercomputer—advancing PyTorch support + Optimum‑TPU
- State of AI Report Compute Index update (Jan 21, 2025): “91% of AI papers used NVIDIA in 2024”