Claude Opus 4.7
Anthropic
Default to Claude Opus 4.7 inside Claude Code for repo-scale edits. Codex is the strongest first-party OpenAI alternative for repo-scale work; pick Cursor instead when you need multi-model routing in the same workflow.
Directional ratings for the AI models people are actually using right now. Three buckets — code, image and video, writing and agents — refreshed after every release.
Anthropic
Default to Claude Opus 4.7 inside Claude Code for repo-scale edits. Codex is the strongest first-party OpenAI alternative for repo-scale work; pick Cursor instead when you need multi-model routing in the same workflow.
OpenAI
Sora 2 for video, Midjourney v7 for still images. Imagen 4 inside Gemini 3 is the best one-surface answer when you want both from a single account.
Anthropic
Claude Opus 4.7 leads for long-form editorial and stable agent loops. GPT-5.5 wins when you need broader tool integrations, especially around Microsoft and OpenAI surfaces.
Anthropic enables background agent tasks up to 60 minutes inside Claude Code.
Anthropic · Claude Opus 4.7
OpenAI exposes shareable tool registries for the Responses API.
OpenAI · GPT-5.5
Image generation inside Gemini 3 switches to Imagen 4 with stricter prompt adherence.
Google DeepMind · Gemini 3 Pro
R2 now batches independent tool calls in a single step.
DeepSeek · DeepSeek R2
Google replaces Gemini 2.5 Pro with Gemini 3 Pro across API and Vertex.
Google DeepMind · Gemini 3 Pro
Reasoning successor to R1, priced 60% below the GPT-5.5 reasoning tier.
DeepSeek · DeepSeek R2
GPT-5.5 replaces GPT-5 as the default model on chatgpt.com and the API.
OpenAI · GPT-5.5
Spark consolidates the Microsoft 365 and Windows assistant into one surface.
Microsoft · Copilot Spark
| Model | Provider | Code | Image / Video | Writing / Agents | Released |
|---|---|---|---|---|---|
| Claude Opus 4.7Default for coding and agent loops in Claude Code. | Anthropic | 92 Leads SWE-Bench Verified and holds the longest reliable agent loops in coding.↗ | — | 90 Top-tier long-form prose, strongest tool-use compliance across published evals.↗ | |
| GPT-5.5Strong all-rounder; image generation via integrated DALL-E successor. | OpenAI | 87 Competitive on SWE-Bench, weaker on multi-file refactors than Claude Opus 4.7.↗ | 82 Best general-purpose image generation; video remains Sora-2 surface.↗ | 86 Reliable agent runner, strong function calling, slightly looser editorial voice.↗ | |
| Sora 2Video-first; 60s coherent clips, audio bed. | OpenAI | — | 91 Best long-form video coherence and prompt adherence in published comparisons.↗ | — | |
| Gemini 3 ProStrong multimodal context, integrated Imagen 4 and Veo 3 routing. | Google DeepMind | 84 Good repo-scale reasoning, behind Claude on agent loop stability.↗ | 88 Imagen 4 + Veo 3 combination is the most consistent image-plus-video pair.↗ | 84 Long-context wins on research-style tasks; weaker tool-call discipline.↗ | |
| Copilot SparkWraps GPT-5.5 with Microsoft IDE and Office tooling. | Microsoft | 83 Best IDE-integrated experience; raw model behind Claude on hard tasks.↗ | 70 Uses GPT image generation under the hood; trails dedicated image leaders.↗ | 80 Solid for Office-tethered workflows; less flexible as a standalone agent.↗ | |
| Grok 4Real-time X integration, looser safety posture. | xAI | 78 Improving fast; SWE-Bench still trails the top three.↗ | 74 Aurora image gen is competent; no first-party long-form video.↗ | 76 Strong with real-time data, weaker on stable long-form structure.↗ | |
| Llama 4 405BOpen-weights flagship; the practical pick for self-hosting. | Meta | 80 Best open-weights coder; closes the gap to GPT-5.5 on SWE-Bench Lite.↗ | 68 Companion Emu 3 model lags Imagen 4 and Midjourney 7.↗ | 78 Long-form is verbose but stable; tool-use is recent and improving.↗ | |
| DeepSeek R2Cheapest top-tier reasoning model in the matrix. | DeepSeek | 85 Excellent on competitive-programming and SWE-Bench Lite; weaker on long agent loops.↗ | — | 80 Strong reasoning, terser prose; cost-effective for batch agents.↗ | |
| Mistral Large 3Strong European hosting and data-residency story. | Mistral | 76 Improved on Large 2; still behind the top three on agent loops.↗ | — | 79 Tight, neutral prose; tool use is reliable but not best-in-class.↗ | |
| GLM-5Open-weights, strong on Chinese-language tasks. | Z.ai | 74 Competitive on HumanEval; weaker on English-language repo edits.↗ | — | 72 Solid agent runner; English long-form trails the top tier.↗ | |
| Kimi K2Long-context specialist; up to 2M tokens. | Moonshot | 73 Long-context wins on repo-wide reading; raw editing trails the leaders.↗ | — | 78 Excellent at synthesising large corpora; loose prose voice.↗ | |
| Midjourney v7Image only; no chat or tool use. | Midjourney | — | 90 Best aesthetic image fidelity; weaker on strict prompt adherence.↗ | — | |
| Runway Gen-4Video editing primitives plus generation. | Runway | — | 86 Strong on directed edits and motion control; behind Sora 2 on raw coherence.↗ | — |
No models match those filters.
OpenAI's first-party coding agent for CLI, IDE, cloud, and GitHub workflows.
Strongest agentic SWE-Bench results in the matrix.
Best multi-model IDE; lets you swap providers per task.
Tight VS Code and PR-review integration.
Best terminal-native option; strongest cost control.
VS Code extension with explicit plan/act split.
Cline fork with multi-mode workflow.
Directional ratings curated from public benchmarks, model cards, and hands-on use. Scores are 0–100 within a bucket, not across buckets. Each rating links to the primary evidence. Refreshed after any model release. Not authoritative; AI-assisted editorial.