Google researchers published a paper online on April 4 local time, announcing for the first time the technical details of the supercomputer used to train artificial intelligence models, and claiming that the system is faster and consumes less power than Nvidia's supercomputing system.
This new Google article, titled "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embedding," features a custom TPU chip of Google's own design. More than 90 percent of Google's current AI training efforts use these chips.
Google claims to have connected more than 4,000 TPU chips into a supercomputer, and these chips work together to train models. At present, various technology companies are competing for AI supercomputing systems to support the computing power of large artificial intelligence models. Google's AI chatbot Bard is struggling to catch up with OpenAI's ChatGPT, which means that a lot of data can no longer be stored on a single chip.
Google says its AI supercomputer can easily and dynamically configure connections between chips, helping to boost supercomputer performance. "Circuit switching will make it easier to bypass faulty components," wrote Google researcher Norm Jouppi and Google Distinguished Engineer David Patterson. "This flexibility even allows us to change the topology of supercomputer interconnects to speed up machine learning models."
Google said in the paper that for a system of the same size, its TPU chip is 1.7 times faster and 1.9 times more energy efficient than a system based on the Nvidia A100 chip. But Google didn't compare its fourth-generation TPU chip to Nvidia's latest version of its current flagship AI GPU chip, the H100, which has 80 billion transistors and uses a newer N4 (4-nanometer) chip manufacturing process.
Google also hinted that the company is developing a next-generation TPU that will compete with the H100, but did not provide any details.
Nvidia is also trying to improve chip design by using AI to develop more powerful GPU chips. Last week, Nvidia demonstrated this new chip "macro layout" technology called AutoDMP in a paper, using AI to optimize the layout of transistors.
According to previous market analysis data, in terms of large AI models, Nvidia's A100 occupies about 95% of the market share. Nvidia declined to comment on whether Google has fully switched to using self-developed chips to train artificial intelligence models. However, an Nvidia technical person told the first financial reporter: "Google uses both Nvidia's chips and Google's own chips. In many cases, competition and cooperation coexist."
Although Google is only now releasing details about its supercomputer, since 2020, Google has been accessing the supercomputing system at the company's data center in Mays County, Oklahoma. AI image generation company Midjourney also uses the system to train its models, Google said.
"Google has been trying to get rid of the dominance of Nvidia chips, but it is not as easy as it sounds." Gartner chip analyst Sheng Linghai told the first financial reporter, "Nvidia has been working for decades to be able to secure its current position in the industry. Google's TPU is still mainly used for its own use."