xAI confirmed over the weekend that Colossus 2, its expansion campus in Memphis, has been brought up to one million Nvidia GPUs across active training and inference fabrics — a milestone Elon Musk has been telegraphing since the original 100,000-unit phase came online in late 2024. The disclosure, made via a Friday post on X and corroborated by sources at Tennessee Valley Authority and Memphis Light, Gas & Water, makes Colossus 2 the largest contiguous AI training facility ever assembled by a single operator, surpassing Microsoft’s Mount Pleasant build-out and the OpenAI/Oracle Stargate Phase 1 cluster in Abilene that went live earlier this month.
The headline number, however, masks a more interesting story: Memphis is now the primary bottleneck for xAI’s training cadence, and the company’s mix of hyperscale ambition, behind-the-meter gas turbines, and grid-tied capacity is becoming a case study other AI labs are watching closely.
What Is Actually Inside Colossus 2
The one-million figure aggregates several Nvidia generations. According to a Memphis Chamber filing referenced by The Commercial Appeal, the campus mixes roughly 200,000 H100 units retained from Colossus 1, around 350,000 H200s deployed during 2025, and an estimated 450,000 Blackwell-class GPUs (a blend of B200 and the higher-density Blackwell Ultra B300 parts that began shipping in volume this quarter). xAI has not broken out the precise B300 count, but Nvidia’s own Q1 FY27 commentary flagged “a single hyperscale customer” as the largest individual buyer of B300 systems through April.
Networking is dominated by Nvidia’s Spectrum-X Ethernet fabric paired with NVLink Switch trays inside each rack island. xAI engineers have publicly described an east-west bandwidth target of 800 Gbps per GPU at full mesh, which is what allows the cluster to act as one logical machine for Grok-4 training rather than a federation of smaller pods.
Total nameplate IT load for the campus is now approximately 750 megawatts, with sustained operating draw closer to 580 MW once cooling and overhead are factored in. That puts Colossus 2 in the same weight class as a mid-size aluminum smelter — a comparison Memphis Light, Gas & Water staff have used in public board meetings.
The Power Story Is the Real Story
The constraint xAI has been dancing around since 2024 is electrons, not silicon. Tennessee Valley Authority approved an additional 150 MW of grid-tied capacity for the campus in March, after months of regulatory back-and-forth, but the substation and 161-kV interconnect upgrades required to deliver that power are not scheduled to complete until Q3 2026.
In the interim, xAI continues to lean on a fleet of natural-gas turbines installed on-site — reported by the Southern Environmental Law Center as numbering 35 units, generating roughly 420 MW. Those turbines remain controversial. A Shelby County air-quality complaint filed in February alleged that several units were operating without final Title V permits; xAI maintains the deployment is compliant under temporary authorizations and is in active dialogue with regulators.
What this means operationally: even with one million GPUs racked, the cluster cannot run all of them at full training intensity simultaneously. xAI’s current scheduling model rotates training and inference loads across power envelopes, a pattern Musk acknowledged on X by calling current utilization “power-shaped, not silicon-shaped.”
Competitive Implications
For competitors, the Colossus 2 milestone is less about who has the biggest cluster and more about who can actually power and cool it. OpenAI’s Stargate Phase 1 at Abilene draws roughly 100 MW today and is targeting 1.2 GW by 2027, using a different mix of gas and grid. Anthropic’s training capacity sits primarily on Amazon’s Trainium fleet plus Google TPU v6 allocations, both of which spread load across multiple regional data centers — a hedge against single-site power constraints that Memphis now exemplifies.
The practical question for the industry is whether single-site mega-clusters retain a meaningful training-efficiency advantage as cross-region training fabrics mature. Google has been the most aggressive in publishing results from geographically distributed training runs; Microsoft is reportedly piloting similar approaches between its Mount Pleasant and Atlanta sites. If those efforts work, the Colossus 2 model — concentrate everything in one place, then fight for power — may turn out to be a transitional architecture rather than the destination.
For now, though, xAI has the largest contiguous training engine in the world. Whether it can keep it fed is the next chapter.
(Sources: xAI / Elon Musk posts on X, April 24–25, 2026; Tennessee Valley Authority quarterly filing; Memphis Light, Gas & Water board minutes; The Commercial Appeal; Nvidia Q1 FY27 earnings commentary; Southern Environmental Law Center filings.)
Discussion
Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.