New month, new ratings. And for the first time since I started writing these, the top of my table is a tie.
Here is what changed. The Information ran a piece on June 1 — Inside OpenAI's Decision to Combine Codex and ChatGPT — and reading it was a strange experience, because it described, in OpenAI's own internal language, something I had already felt in my own commits over the last few weeks. Codex has pulled back ahead of Claude Code for me. Just barely. By a hair I would not bet a dollar on twice. But it pulled ahead, and I am not going to pretend otherwise just because Claude has held my number-one spot for a year.
So this month I want to do two things. First, be honest about the shift. Second — and this is the part that actually matters — tell you what I have learned about using these two together, because that is where the real leverage has been hiding the whole time.
Same format as always. Eight models. Four tiers. The only difference is that the S Tier is now genuinely level at the top.
The S Tier: Claude and ChatGPT — Tied at #1
I stopped trying to crown one of them this month. After enough hours, the question "which is better" started returning the wrong answer, because the honest answer is "the two of them, pointed at the same problem, in the right order."
Let me give you the division of labor I have actually settled into, because it is specific and it is earned:
Claude Code is the better engineer and architect. When a project is a blank page — when I need the initial framing, the data model, the file structure, the "understand the whole system before you touch it" first pass — Claude Code is who I hand it to, and it is not close. Opus 4.7 still holds context across multi-file, multi-day work better than anything else I run. It is the model I trust to set the foundation that everything else stands on.
Codex is the better designer, UX hand, and bug-checker. Once the bones exist, Codex 5.5 is the one I reach for to make it feel right — the spacing, the states, the edge cases, the "this button is technically correct but emotionally wrong" details. It is also a relentless bug-finder; it will spin up its own environment, run the thing, and come back with the problem instead of a guess. The Information quoted developers calling it exactly that — "relentless" — and that matched my experience to the word.
And here is the move that made the combo click: Claude Code is excellent at cleaning up Codex's UX mistakes. Codex occasionally over-designs, or solves the wrong half of a layout problem with great confidence. Hand that mess back to Claude with clear intent and it untangles it cleanly. The two models cover each other's blind spots, and I have shipped some of the best work of my year by letting them argue.
So why a tie and not a flip? Because the capability gap on raw coding went to Codex this month, but the tiebreaker I have used all along — whose long-term incentives I trust — still lands with Anthropic. No ads. The political even-handedness work. Claude Code and Claude Cowork stay paid-only, which keeps the incentives clean. Capability says Codex by a hair; trust says Claude by a hair. Net it out and you get the most honest thing I can write: a tie, and a recommendation to use both.
The news under all this was loud. OpenAI is folding Codex, ChatGPT, and its Atlas browser into a single desktop "superapp," shipping in the coming weeks — a bet that the center of gravity has moved to Codex and they want it in front of all 900-million-plus weekly users. The numbers back the bet: Codex passed five million weekly active users at the end of May, after four million a month earlier and three million two weeks before that, and it overtook Claude Code in Google search interest in mid-May for the first time. Greg Brockman reportedly told staff that Codex's enterprise revenue was growing 50% week over week; Sam Altman put overall Codex usage at 5% growth per day. OpenAI even reorganized internally around it, merging the Codex and ChatGPT teams under one "core product and platform" group.
On the other side of the ledger, Anthropic's revenue is doing its own quiet talking — more than $47 billion annualized as of last month, roughly five times where it started the year, with Claude Code's popularity cited as a big reason. Both companies are racing toward a public offering, and both need the capital. Neither is slowing down. That is exactly the environment where a tie is the truthful call.
The A Tier: Gemini and Llama
Gemini holds its spot, and the May momentum carried into June. Gemini 3.5 Flash is still doing the impossible-seeming thing — frontier-class intelligence at Flash speed and price — and Gemini 3.5 Pro is now landing out of testing. For brainstorming, for image and video generation, for anything sitting inside Google Workspace, it is A Tier without an argument.
But I will repeat what I said last month, because the month did not change it: on the long, stateful, build-the-whole-system tasks, Gemini still drifts where Claude and now Codex finish. It is the closest it has ever been. It is not yet in the combo I described above. And the Apple-Siri story is still the Apple-Siri story — every window has slipped, and iOS 27 in September is the only date left worth watching.
Llama holds. Meta keeps deploying it across every surface it owns — one open model family in front of billions of users — but Behemoth still has not shipped publicly, and the public flagship story has not moved since February. Enormous footprint, flagship still not in the room.
The B Tier: Mistral and Perplexity
Mistral keeps owning its question: what do you reach for when you care about cost, open-source licensing, and European data sovereignty? Mistral Small 4 is still the best open-source value near the frontier, and Forge — custom models on a company's own data — keeps turning up in real deployments. For anyone building for a European market, still the default.
Perplexity stays my number one for research and keeps pulling away. When I need to get a pile of facts right across a batch of pages and cite the sources, Deep Research on Opus is still the surface I trust to start. Comet is a genuinely good browser, Perplexity Computer has graduated from beta to tool in my head, and they held the no-ads line another quarter.
Both shops run narrow playbooks and run them well. In a market this size, that is a winning position, not a consolation prize.
The Nope Tier: Grok and DeepSeek
Grok did not get better in the only way that would matter. The deepfake and CSAM litigation against xAI is now in federal court, with the initial case management conference set for June 18 in the Northern District of California, and the underlying allegations have not gotten less ugly with time. xAI's answer is still more capital and more velocity, not more engagement with the questions on the table. The model improves on the technical axis. The platform has not earned back a cent of trust. I am not running it for anything that matters, and I would not tell you to either.
DeepSeek's V4 is somehow still pending — we are now many missed windows deep. The distillation accusations from OpenAI and Anthropic still stand, the CCP-level censorship is still baked in at the model layer, and the security findings are still unaddressed. The technical promise is real and the trust math has not changed. If you handle sensitive data or serve users who care about censorship, it is the wrong answer regardless of what V4 looks like when it finally ships.
What Changed This Month
Two things, in order of importance.
First: the coding-tool race genuinely flipped, by a hair, and then immediately taught me that the flip was the wrong thing to fixate on. Codex pulled ahead of Claude Code on raw coding — the usage numbers, the search interest, my own commits, and The Information's reporting all point the same way. But the moment I stopped asking "which one" and started running both in their best roles, I got more out of the pair than I ever got out of either as a solo number one. The headline is not "Codex won." The headline is "use both, in order."
Second: OpenAI is consolidating, and the superapp is the tell. Folding Codex, ChatGPT, and Atlas into one product is a bet that the agentic coding surface is the front door to everything else — knowledge work, not just code. Watch whether that focus sharpens the product or blurs it. Anthropic is making the opposite bet: keep Claude Code and Cowork clean, paid, and aimed at builders. Two coherent strategies, pulling apart. That divergence is the thing to watch for the rest of the year.
And the personal change: I retired the idea that this series has to name a single winner. Some months it will. This month the truthful answer is a tie at the top and a workflow that uses both. When you build with these tools at volume, you stop collecting opinions about them and start building a relationship with each one. The relationship I have now is with the pair.
The Updated Rankings
The full rankings — with the "Who Powers What" ecosystem map, the political bias table, and links to every model — live at johncderrick.com/ai-models. I update it monthly.
If you want to stop reading about these models and start building real systems with them, the Prompt Library has ten copy-paste prompts that work across Claude, ChatGPT, and Gemini. The combo workflow I described above started with prompts a lot like those.
The Protocol: I use these models every day, all of them. The rankings are not theoretical — they are operational. This month Claude Code framed and built, Codex designed and bug-checked, Claude cleaned up the UX, Gemini brainstormed, and Perplexity did the research. The full rankings live at johncderrick.com/ai-models and the prompts to put them to work live at johncderrick.com/prompts. Check the dates. If they are current, so are the opinions.