xAI’s Colossus supercomputer cluster uses 100,000 Nvidia Hopper GPUs — and it was all made possible using Nvidia’s Spectrum-X Ethernet networking platform
- Nvidia and xAI collaborate on Colossus development
- xAI has markedly cut down ‘flow collisions’ during AI model training
- Spectrum-X has been crucial in training the Grok AI model family
Nvidia has shed light on how xAI’s ‘Colossus’ supercomputer cluster can keep a handle on 100,000 Hopper GPUs – and it’s all down to using the chipmaker’s Spectrum-X Ethernet networking platform.
Spectrum-X, the company revealed, is designed to provide massive performance capabilities to multi-tenant, hyperscale AI factories using its Remote Directory Memory Access (RDMA) network.
The platform has been deployed at Colossus, the world’s largest AI supercomputer, since its inception. The Elon Musk-owned firm has been using the cluster to train its Grok series of large language models (LLMs), which power the chatbots offered to X users.
The facility was built in collaboration with Nvidia in just 122 days, and xAI is currently in the process of expanding it, with plans to deploy a total of 200,000 Nvidia Hopper GPUs.
Training Grok takes serious firepower
The Grok AI models are extremely large, with Grok-1 measuring in as 314 billion parameters and Grok-2 outperforming Claude 3.5 Sonnet and GPT-4 Turbo at the time of launch in August.
Naturally, training these models requires significant network performance. Using Nvidia’s Spectrum-X platform, xAI recorded zero application legacy degradation or packet loss as a result of ‘flow collisions’, or bottlenecks within AI networking paths.
xAI revealed it has been able to maintain 95% data throughput enabled by Spectrum-X’s congestion control capabilities. The company added this level of performance cannot be delivered at this scale via standard Ethernet.
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
Using traditional Ethernet, this typically creates thousands of flow collisions while delivering only 60% data throughput, according to Nvidia.
A spokesperson for xAI said the combination of Hopper GPUs and Spectrum-X has allowed the company to “push the boundaries of training AI models” and created a “super-accelerated and optimized AI factory”
“AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency,” said Gilad Shainer, senior vice president of networking at Nvidia.
“The NvidiaSpectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions.”
Part of the Spectrum-X platform includes the Spectrum SN5600 Ethernet switch – this supports port speeds of up to 800Gb/s and is based on the Spectrum-4 switch ASIC, according to Nvidia.
xAI opted to combine the Spectrum-X SN5600 switch with NVIDIA BlueField-3 SuperNICs for higher performance.
You might also like
Nvidia and xAI collaborate on Colossus development xAI has markedly cut down ‘flow collisions’ during AI model training Spectrum-X has been crucial in training the Grok AI model family Nvidia has shed light on how xAI’s ‘Colossus’ supercomputer cluster can keep a handle on 100,000 Hopper GPUs – and it’s…
Recent Posts
- Bitcoin just hit $100,000
- ChatGPT search can’t find the real news, even with a publisher holding its hand
- Quordle today – hints and answers for Thursday, December 5 (game #1046)
- Humane wants to put the AI Pin’s software inside your phone, car, and smart speaker
- NYT Strands today — hints, answers and spangram for Thursday, December 5 (game #277)
Archives
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- September 2018
- October 2017
- December 2011