xAI Launches Grok 4: Claims Top Scores Across Every Major AI Benchmark

Elon Musk’s artificial intelligence company xAI released Grok 4 on Sunday, staking a claim to benchmark leadership across the major evaluations that the industry uses to rank frontier models. The announcement, made via xAI’s developer blog and amplified through Musk’s X account, came just weeks after Anthropic published Claude Opus 4.7 and OpenAI quietly updated GPT-5 with enhanced reasoning capabilities.

Benchmark Numbers

According to xAI’s internal evaluations, Grok 4 scores 92.4% on MMLU (Massive Multitask Language Understanding), surpassing the previously reported top score of 91.8% held by GPT-5. On GPQA Diamond — the graduate-level scientific reasoning benchmark widely considered hardest to game — Grok 4 posts 72.1%, a 4-point improvement over its predecessor Grok 3. In the SWE-bench Verified software engineering test, Grok 4 resolves 63.7% of real-world GitHub issues autonomously, matching the highest published result in the field.

The model also introduces a native 2-million-token context window, double what Grok 3 offered. xAI claims this enables full ingestion of large codebases, legal contracts, or multi-year financial filings without summarization loss.

Independent evaluators at LMSYS and Scale AI have not yet published third-party confirmation of the scores. xAI’s previous benchmark claims for Grok 3 were broadly validated by external testers within 72 hours of launch, suggesting the numbers will face scrutiny this week.

The Colossus Infrastructure Behind It

Grok 4 was trained on Colossus, xAI’s Memphis-based supercomputer, which the company confirmed has now scaled to 200,000 NVIDIA H100 and H200 GPUs — double the 100,000-GPU configuration that was operational in late 2025. The expansion cost approximately $6 billion in hardware alone, according to estimates from semiconductor analysts at Bernstein Research.

That infrastructure investment is visible in the model’s training compute. xAI disclosed that Grok 4 used roughly 10 times the training FLOPs of Grok 3, a jump that tracks with the scale of Colossus’s expansion. By comparison, GPT-4 to GPT-5 represented an estimated 5x compute increase.

xAI has not disclosed the total capital deployed in Grok 4’s development, but the company raised $6 billion at an $80 billion valuation in March 2026, with that funding earmarked explicitly for model training and infrastructure.

Availability and Pricing

Grok 4 is available today to X Premium+ subscribers at no additional charge, and through xAI’s enterprise API at $15 per million input tokens and $60 per million output tokens — pricing that sits above GPT-5’s standard tier but below Anthropic’s Claude Opus 4.7 at the high end.

An xAI spokesperson confirmed that Grok 4 will also be the underlying model powering Aurora, xAI’s AI assistant embedded in Tesla vehicles, with a phased rollout beginning in May 2026.

Competitive Context

The Grok 4 release accelerates an already compressed frontier-model release cycle. Anthropic published Claude Opus 4.7 on April 17, and OpenAI updated GPT-5 with expanded tool use in early April. Google’s Gemini Ultra 2 is reportedly in final internal testing, with a launch expected before Google I/O.

Analysts at Morgan Stanley noted in a Sunday research note that “the interval between frontier model releases has compressed from roughly 12 months in 2023 to approximately 6-8 weeks in Q2 2026, creating sustained pricing pressure and forcing enterprise buyers to delay multi-year contract commitments.” The report estimates that the AI model market now generates over $40 billion annually in API revenue across the major providers.

For xAI, benchmark leadership — even if temporary — matters as much for talent recruitment and partnership signaling as for direct revenue. The company currently employs approximately 2,400 researchers and engineers, and Musk has stated publicly that xAI’s goal is to have the “most truth-seeking AI” rather than the most commercially conservative one, a positioning that resonates with a specific segment of enterprise buyers who have grown frustrated with content restrictions in competing models.

Lois Vance

Contributing writer at Clarqo, covering technology, AI, and the digital economy.

xAI Launches Grok 4: Claims Top Scores Across Every Major AI Benchmark

Benchmark Numbers

The Colossus Infrastructure Behind It

Availability and Pricing

Competitive Context

Related Articles