haifeng_jin

TPUs Are Not for Sale, But Why?

An analysis of Google’s unique approach to AI hardware

April 2024

Nvidia’s stock price has skyrocketed because of its GPU’s dominance in the AI hardware market. However, at the same time, TPUs, well-known AI hardware from Google, are not for sale. You can only rent virtual machines on Google Cloud to use them. Why did Google not join the game of selling AI hardware?

DISCLAIMER: The views expressed in this article are solely those of the author and do not necessarily reflect the opinions or viewpoints of Google or its affiliates. The entirety of the information presented within this article is sourced exclusively from publicly available materials.

A popular theory

One popular theory I heard is that Google wants to attract more customers to its cloud services. If they sell it to other cloud service providers, they are less competitive in the cloud service market.

According to cloud service customers, this theory does not make much sense. No corporate-level customer wants to be locked to one specific cloud service provider. They want to be flexible enough to move to another whenever needed. Otherwise, if the provider increases the price, they can do nothing about it.

If they are locked to Google Cloud for using TPUs, they would rather not use it. This is why many customers don’t want to use TPUs. They only started to feel less locked in recently when OpenXLA, an intermediate software to access TPUs, supported more frameworks like PyTorch.

So, using TPUs to attract customers to Google Cloud is not a valid reason for not selling them. Then, what is the real reason? To answer this question better, we must look into how Google started the TPU project.

Why did Google start the TPU project?

The short answer is for proprietary usage. There was a time when GPUs could not meet the computing requirements for AI hardware.

Let’s try to estimate when the TPU project was started. Given it was first announced to the public in 2016, it would be a fair guess that it started around 2011. If that is true, they started the project pretty early since we did not see a significant improvement in computer vision until 2012 by AlexNet.

With this timeline, we know GPUs were less potent than today when the project started. Google saw this AI revolution early and wanted faster hardware for large-scale computing. Their only choice is to build a new solution for it.

That was why Google started this project, but there are more questions. Why were GPUs not good enough back in the day? What potential improvements did Google see that are significant enough to start their new hardware project?

The answer lies in the microarchitecture of GPUs and TPUs. Let’s examine the design of the cores on GPUs and TPUs.

The design idea of GPUs

First, let’s do a quick recap of the background knowledge of CPUs. When an instruction comes, it is decoded by the instruction decoder and fed into the arithmetic logic unit (ALU) together with data from the registers. The ALU does all the computing and returns the results to one of the registers. If you have multiple cores in the CPU, they can work in parallel.

What is a GPU? It is short for the graphics processing unit. It was designed for graphics computing and later discovered suitable for machine learning. Most of the operations in a GPU are matrix operations, which could run in parallel. This also means there are not many operations they need to support compared with a CPU.

The more specialized the chip is for a given task, the faster it is on the task.

The key idea of the GPU’s initial design was to have a feature-reduced CPU with smaller but more cores for faster parallel computing. The number of instructions supported on a GPU is much less than on a CPU, which makes the area taken by a single core on a chip much smaller. This way, they can pack more cores onto the chip for large-scale parallel computing.

Why do fewer features mean a smaller area on the chip? In software, more features mean more code. In hardware, all features are implemented using logical circuits instead of code. More features mean the circuit is more complex. For example, a CPU must implement more instructions on the chip.

Smaller also means faster. A simpler design of the logic gates leads to a shorter cycle time.

The design idea of TPUs

TPU further developed this idea of specialized chips for deep learning. The defining feature of a TPU is its matrix-multiply unit (MXU). Since matrix multiplication is the most frequent operation in deep learning, TPU builds a specialized core for it, the MXU.

This is even more specialized than a GPU core, capable of many matrix operations, while the MXU only does one thing: matrix multiplication.

It works quite differently from a traditional CPU/GPU core. All the dynamics and generality are removed. It has a grid of nodes all connected together. Each node only does multiplication and addition in a predefined manner. The results are directly pushed to the next node for the next multiplication and addition. So, everything is predefined and fixed.

This way, we save time by removing the need for instruction decoding since it just multiplies and adds whatever it receives. There is no register for writing and reading since we already know where the results should go, and there is no need to store it for arbitrary operations that come next.

Besides the MXU, the TPU has also been designed for better scalability. It has dedicated ports for high-bandwidth inter-chip interconnection (ICI). It is designed to sit on the racks in Google’s data centers and to be used in clusters. Since it is for proprietary usage only, they don’t need to worry about selling single chips or the complexity of installing the chips on the racks.

Are TPUs still faster today?

It doesn’t make sense others didn’t come up with the same simple idea of building dedicated cores for tensor operations (matrix multiplication). Even if they did not, it doesn’t make sense that they don’t copy.

From the timeline, it seems Nvidia came up with the same idea at about the same time. A similar product from Nvidia, the Tensor Cores, was first announced to the public in 2017, one year after Google’s TPU announcement.

It is unclear whether TPUs are still faster than GPUs today. I cannot find public benchmarks of the latest generations of TPUs and GPUs, and it is unclear to me which generation and metrics should be used for benchmarking.

However, we can use one universal application-oriented metric: dollars per epoch. I found one interesting benchmark from Google Cloud that aligns different hardware to the same axis: money. TPUs appear cheaper on Google Cloud if you have the same model, data, and number of epochs.

Large models, like Midjourney, Claude, and Gemini, are all very sensitive to the training cost because they consume too much computing power. As a result, many of them use TPUs on Google Cloud.

Why are TPUs cheaper?

One important reason is the software stack. You are using not only the hardware but also the software stack associated with it. Google has better vertical integration for its software stack and AI hardware than GPUs.

Google has dedicated engineering teams to build a whole software stack for it with strong vertical integration, from the model implementation (Vertex Model Garden) to the deep learning frameworks (Keras, JAX, and TensorFlow) to a compiler well-optimized for the TPUs (XLA).

The software stack for GPUs is very different. PyTorch is the most popular deep learning framework used with Nvidia GPUs, and it was mainly developed by Meta. The most widely used model pools with PyTorch are transformers and diffusers developed by HuggingFace. It is much harder to do perfect vertical integration for the software stack across all these companies.

One caveat is that fewer models are implemented with JAX and TensorFlow. Sometimes, you may need to implement the model yourself or use it from PyTorch on TPUs. Depending on the implementation, you may experience some friction when using PyTorch on TPUs. So, there might be extra engineering costs besides the hardware cost itself.

Why not start selling TPUs?

We understand the project was started for proprietary usage and acquired a pretty good user base on Google Cloud because of its lower price. Why did not Google just start to sell it to customers directly, just like Nvidia’s GPUs?

The short answer is to stay focused. Google is in fierce competition with OpenAI for generative AI. At the same time, it is in the middle of multiple waves of tech layoffs to lower its cost. A wise strategy now would be to focus its limited resources on the most important projects.

If Google ever wants to start selling its TPUs, it will be competing with two strong opponents, Nvidia and OpenAI, at the same time, which may not be a wise move at the moment.

The huge overhead of selling hardware

Selling hardware directly to customers creates huge overheads for the company. Conversely, renting TPUs on their cloud services is much more manageable.

When TPUs are only served on the cloud, they can have a centralized way to install all the TPUs and related software. There is no need to deal with various installation environments or the difficulty of deploying a TPU cluster.

They know exactly how many TPUs to make. The demands are all internal, so there is no uncertainty. Thus, managing the supply chain is much easier.

Sales also become much easier since it is just selling the cloud service. There is no need to build a new team experienced in selling hardware.

The advantages of the TPU approach

Without all the overhead of selling hardware directly to the customers, Google got a few advantages in return.

First, they can have a more aggressive TPU architecture design. The TPUs have a unique way of connecting the chips. Unlike multiple GPUs that connect to the same board, TPUs are organized in cubes. They arranged 64 TPUs in a 4 by 4 by 4 cube to interconnect them with each other for faster inter-chip communication. There are 8960 chips in a single v5p Pod. They can be easily used together. This is the advantage of fully control your hardware installation environment.

Second, they can iterate faster to push out new generations. Since they only need to support a small set of use cases for proprietary usages, it drastically reduces their research and development cycle for every generation of the chips. I wonder if Nvidia came up with the TensorCore idea earlier than Google, but because of the overhead of selling hardware to external customers, they could only announce it one year later than Google.

From the perspective of serving its most important purpose, competing in GenAI, these advantages put Google in a very good position. Most importantly, with this in-house hardware solution, Google saved huge money by not buying GPUs from Nvidia at a monopoly price.

The downside of the TPU approach

So far, we have discussed many advantages of Google’s AI hardware approach, but is there any downside? Indeed, there is a big one. Google became a tech island.

Every pioneer in tech will become an island isolated from the rest of the world, at least for a while. This is because they started early when the corresponding infrastructure was not ready. They need to build everything from scratch. Due to the migration cost, they will stick with their solution even if everyone else uses something else.

This is exactly what Google is experiencing right now. The rest of the world is innovating with models from HuggingFace and PyTorch. Everyone is quickly tweaking each other’s models to develop better ones. However, Google cannot join this process easily since its infra is largely built around TensorFlow and JAX. When putting a model from external into production, it must be re-implemented with Google’s framework.

This “tech island” problem slows Google down in taking good solutions from the external world and further isolates it from others. Google will either start bringing more external solutions like HuggingFace, PyTorch, and GPUs or always ensure its in-house solutions are the best in the world.

What does the future of AI hardware look like?

Finally, let’s peek into the future of AI hardware. What would the future AI hardware look like? The short answer is mode collapse as the hardware becomes more specialized.

Hardware will be further coupled with the applications. For example, support more precision formats for better language model serving. Like with bfloat16, TF32, they may better support int8 and int4. Nvidia announced their second generation of the Transformer Engine, which works with Blackwell GPU. This made optimizing their hardware for transformer models easier without changing the user code. A lot of codesign is happening.

On the other hand, software cannot easily jump out of the transformer realm. If they do, they will be slow due to a lack of hardware support. On the contrary, they implement their models with the hardware in mind. For example, the FlashAttention algorithm is designed to leverage the memory hierarchy of GPUs for better performance.

We see a big mode collapse coming soon. The hardware and software are so well optimized for each other for the current models. Neither of them can easily leave the current design or algorithm. If there is a new model completely different from the transformers, it needs to be 10x better to get widely adopted. It must incentivize people to make new hardware as fast and cheap as transformers.

Summary

In conclusion, the TPU project started for proprietary usage when the GPU’s computing power was insufficient. Google wants to focus on GenAI instead of competing in the AI hardware market to avoid slowing the iteration speed and sacrificing its innovative design. Faster computing at a lower cost helped Google significantly in doing AI research and developing AI applications. However, it also made Google a tech island.

Looking into the future, AI hardware will be even more optimized for certain applications, like the transformer models. Neither the hardware nor the models could easily jump out of this mode collapse.