The ‘most powerful AI training system in the world’ just went online

By Andrew Tarantola Published September 3, 2024

Elon Musk talks to the press as he arrives to to have a look at the construction site of the new Tesla Gigafactory near Berlin. — Maja Hitij/Getty Images / Getty Images

The race for AI supremacy is once again accelerating as xAI CEO Elon Musk announced via Twitter that his company successfully brought its Colossus AI training cluster, which Musk bills as the world’s “most powerful,” online over the weekend.

This weekend, the @xAI team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days.

Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months.

Excellent…

— Elon Musk (@elonmusk) September 2, 2024

“This weekend, the @xAI team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days. Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months. Excellent work by the team, Nvidia and our many partners/suppliers,” Musk wrote in a post on X.

Recommended Videos

Musk’s “most powerful” claim is based on the number of GPUs employed by the system. With 100,000 Nvidia H100s driving it, Colossus is estimated to be larger than any other AI system developed to date.

Musk began purchasing tens of thousands of GPUs in April 2023 to accelerate his company’s AI efforts, shortly after penning an open letter calling for an industrywide, six month “pause” on AI development. In March of that year, Musk claimed that the company would leverage AI to “detect & highlight manipulation of public opinion” on Twitter, though the GPU supercomputer will likely also be leveraged to train its large language model (LLM), Grok.

Grok was introduced by xAI in 2023 in response to the success of rivals like ChatGPT, Gemini, Llama 3.1, and Claude. The company released the updated Grok-2 as a beta in August. “We have introduced Grok-2, positioning us at the forefront of AI development,” xAI wrote in a recent blog post. “Our focus is on advancing core reasoning capabilities with our new compute cluster. We will have many more developments to share in the coming months.”

Musk claims that he can also develop Tesla into “a leader in AI & robotics,” however, a recent report from CNBC suggests that Musk has been diverting shipments of Nvidia’s highly sought-after GPUs from the electric automaker to xAI and Twitter. Doing so could delay Tesla’s efforts to install the compute resources needed to develop its autonomous vehicle technology and the Optimus humanoid robot.

“Elon prioritizing X H100 GPU cluster deployment at X versus Tesla by redirecting 12k of shipped H100 GPUs originally slated for Tesla to X instead,” an Nvidia memo from December obtained by CNBC reads. “In exchange, original X orders of 12k H100 slated for [January] and June to be redirected to Tesla.”

Topics

Tech News

Andrew Tarantola

Computing Writer

Andrew Tarantola is a journalist with more than a decade reporting on emerging technologies ranging from robotics and machine…

Computing

Grok 2.0 takes the guardrails off AI image generation

Elon Musk as Wario in a sketch from Saturday Night Live.

Elon Musk's xAI company has released two updated iterations of its Grok chatbot model, Grok-2 and Grok-2 mini. They promise improved performance over their predecessor, as well as new image-generation capabilities that will enable X (formerly Twitter) users to create AI imagery directly on the social media platform.

“We are excited to release an early preview of Grok-2, a significant step forward from our previous model, Grok-1.5, featuring frontier capabilities in chat, coding, and reasoning. At the same time, we are introducing Grok-2 mini, a small but capable sibling of Grok-2. An early version of Grok-2 has been tested on the LMSYS leaderboard under the name 'sus-column-r,'” xAI wrote in a recent blog post. The new models are currently in beta and reserved for Premium and Premium+ subscribers, though the company plans to make them available through its Enterprise API later in the month.

Computing

Meta’s next AI model to require nearly 10 times the power to train

mark zuckerberg speaking

Facebook parent company Meta will continue to invest heavily in its artificial intelligence research efforts, despite expecting the nascent technology to require years of work before becoming profitable, company executives explained on the company's Q2 earnings call Wednesday.

Meta is "planning for the compute clusters and data we'll need for the next several years," CEO Mark Zuckerberg said on the call. Meta will need an "amount of compute… almost 10 times more than what we used to train Llama 3," he said, adding that Llama 4 will "be the most advanced [model] in the industry next year." For reference, the Llama 3 model was trained on a cluster of 16,384 Nvidia H100 80GB GPUs.

Computing

We just learned something surprising about how Apple Intelligence was trained

Apple Intelligence update on iPhone 15 Pro Max.

A new research paper from Apple reveals that the company relied on Google's Tensor Processing Units (TPUs), rather than Nvidia's more widely deployed GPUs, in training two crucial systems within its upcoming Apple Intelligence service. The paper notes that Apple used 2,048 Google TPUv5p chips to train its AI models and 8,192 TPUv4 processors for its server AI models.

Nvidia's chips are highly sought for good reason, having earned their reputation for performance and compute efficiency. Their products and systems are typically sold as standalone offerings, enabling customers to construct and operate them as the best see fit.