OpenAI's o3 and o4-mini Redefine What AI Reasoning Looks Like

When OpenAI shipped o3 and o4-mini in April 2026, the benchmarks landed like a thunderclap across the research community. A model that scores 91.6% on the AIME 2024 math competition — a test that stumps most graduate students — is no longer a curiosity. It is a signal that the frontier of AI reasoning has moved, fast.

The Numbers That Matter

OpenAI’s o3 achieved an unprecedented 87.5% on ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence), a benchmark specifically designed to resist pattern-matching. Previous models topped out around 50%. On SWE-bench Verified — a live coding challenge drawn from real GitHub issues — o3 resolved 71.7% of tasks autonomously, a figure that outperforms most junior engineers on targeted tasks.

The smaller sibling, o4-mini, is not far behind. It hits 93.4% on AIME 2025 while running at a fraction of the compute cost, making it the most cost-efficient reasoning model OpenAI has released. Both models support multimodal reasoning with tool use — they can browse the web, write and execute code, and interpret images within a single chain of thought.

Why This Architecture Is Different

Earlier OpenAI models (GPT-4o, o1) performed reasoning as a largely linear process: read the prompt, generate a response. The o3/o4 family uses what OpenAI describes as extended thinking with backtracking — the model can pause, evaluate competing approaches, discard dead ends, and restart a reasoning chain mid-task. This mirrors how a human mathematician works through a problem on a whiteboard.

The practical result is that the models are dramatically better at multi-step tasks where earlier steps constrain later ones: complex SQL queries, formal proofs, legal document analysis, and financial modeling. Early enterprise testers at Goldman Sachs and McKinsey have reportedly integrated o3 into internal research pipelines, according to people familiar with the deployments.

The Cost and Access Picture

OpenAI priced o3 at $10 per million input tokens and $40 per million output tokens — expensive relative to GPT-4o but competitive with Anthropic’s Claude Opus 4 for equivalent reasoning tasks. o4-mini sits at $1.10 / $4.40 per million tokens, making it accessible for high-volume applications.

Both models are available via API and through ChatGPT Pro. OpenAI has also enabled o3 within its Operator and Deep Research features, allowing autonomous multi-hour research tasks that string together dozens of tool calls with minimal human intervention.

Competitive Pressure Across the Industry

The release intensified an already heated race. Google’s Gemini 2.5 Pro had held the top slot on several coding benchmarks since February; o3 has since surpassed it on the majority of public leaderboards. Anthropic’s Claude 4 Opus remains competitive on instruction-following and safety evaluations but trails on raw math performance.

The deeper competitive implication is structural: if reasoning models can reliably handle graduate-level STEM problems, the addressable market for AI in scientific research, drug discovery, and quantitative finance expands dramatically. OpenAI is betting its next growth wave lives there.

What Comes Next

OpenAI has indicated o3-Pro — a further scaled version with longer thinking budgets — is in internal testing. The company is also working on inference-time compute scaling, allowing users to trade cost for accuracy by allocating more “thinking tokens” to harder problems. If the current trajectory holds, the gap between AI reasoning and human expert performance in structured domains could close within 18 months.

For the enterprise buyers weighing adoption, the relevant question is no longer whether these models are capable. It is whether their internal workflows are ready for a co-pilot that sometimes outperforms the pilot.

Lois Vance

Contributing writer at Clarqo, covering technology, AI, and the digital economy.