The State of the AI Models — My Rankings as of May 2026

I have been gone for a while. Not because I ran out of things to say — because I was heads-down building, and the building won.

Here is what "heads-down" actually looked like, in the most general terms I can give you: about fifty days of near-continuous building, and at the end of it a body of work — thousands of commits, dozens of self-running automations, hundreds of pages — that not long ago would have been a multi-year roadmap for an entire team. One person. Call it five years of work in fifty days.

I am not writing that to flex. I am writing it because I have been doing this for a year and that compression still stopped me cold. This is the part of the AI story the benchmark charts miss completely. The models did not just get a little smarter this past year — they crossed the line where a single person can hold the surface area that used to require a department. That is the multiplier effect, and once you feel it on a real project it is hard to unfeel. And to be clear, it did not happen overnight. It took me two or three months of steady AI tinkering before the light bulb actually went off for what had been a straightforward publishing business — months of friction first, where the tools were a novelty and not much else, before they tipped over into real leverage. The multiplier is real, but you have to earn it before it shows up. The honest reaction, the one I keep landing on, is: not even sure what to think. Somewhere between proud and unsettled. That is the feeling I wanted to share, and it is the most useful possible context for this month's rankings.

I am keeping the what and the where under wraps for now — I am not ready to show my hand, and I will pull the curtain back with more detail when the time is right. Although, if you have ever poked around my portfolio, you can probably already guess what I have been pouring myself into. 😉 For this post, the project is just the proving ground. The models are the point.

Because here is the thing about this particular ranking post: I did not read about these models in May. I built with them, at volume, until they broke and I fixed it and they broke again. The opinions below are not benchmark-watching. They are fifty days of load-testing the entire frontier against real, shipping work — and most of that work ran through Claude Code.

Same format as always. Eight models. Four tiers. Here is where things stand.

The S Tier: Claude and ChatGPT

Claude is still my number one, and after the sprint I just finished, the gap is wider in my head than it has ever been on paper.

Opus 4.7 was the engine. The architecture-heavy pieces — the background automations, the AI assistant, the self-generating daily audio, the schedulers and data pipelines — all went to Opus 4.7, and it held context across multi-file, multi-day work in a way no other model I tried could match. Sonnet 4.6 stayed the daily driver for the routine 80%: fast, cheap, good enough. The new wrinkle this month is that Fast mode now supports Opus 4.7, so on the days when I was pushing dozens of commits an hour, I could keep the deep model and stop waiting on tokens. That mattered more than it sounds.

The off-screen news was loud too. Andrej Karpathy — founding member of OpenAI — announced he is joining Anthropic, which is the kind of talent move that tells you where the serious builders think the center of gravity is. Anthropic also shipped Claude for Small Business (Claude wired into QuickBooks, HubSpot, Canva, Google Workspace, and the rest), expanded the PwC alliance toward hundreds of thousands of professionals, and the no-ads promise held through another quarter. Claude builds my software, my workers, my podcast, and most of this site. That has not changed in a year and after this sprint I am more certain of it, not less.

ChatGPT remains a strong number two, and GPT-5.5 is still the reason it is not number three.

For overflow, for a second opinion, for handing a task to a different brain and seeing what comes back, GPT-5.5 was excellent all month. The agentic and tool-use gains from April held up under real load. In May, OpenAI's news skewed toward security — Daybreak, Codex Security, and a GPT-5.5 "Trusted Access for Cyber" tier aimed at authorized red-teaming — which is a genuinely interesting direction. The 900-million-weekly-user ecosystem is still the largest in AI and Microsoft's Copilot stack is still the deepest enterprise integration anyone ships.

But the trust math from the last two months did not move. The Pentagon contract is still in force. Ads for free users are still in force. And the Musk v. Altman trial gave us the spectacle of Musk testifying under oath that xAI "partly" distills OpenAI's models — entertaining, but not a point in anyone's favor. If you are choosing between Claude and ChatGPT and capability is close, the tiebreaker is still whose long-term incentives you trust. It keeps landing in the same place.

The A Tier: Gemini and Llama

Gemini had the biggest month of anyone, and it still did not change my order.

Google I/O 2026 was the event of May. Gemini 3.5 Flash went generally available — frontier-class intelligence at Flash speed and price, and it actually beats Gemini 3.1 Pro on several coding and agentic benchmarks. Gemini 3.5 Pro is in testing and lands next month. Gemini Omni — reasoning in, real video out — is a legitimately new capability. AI Mode in Search now runs on 3.5 Flash, and the Antigravity agent platform got a real upgrade. On paper this is the most tempting Gemini has ever been for building.

So I tested it. During the sprint I threw real production code and real multi-file refactors at 3.5 Flash, because if anything was going to close the gap with Claude for the kind of work I do, this was the candidate. And it did close some of it — the speed is unreal and the first-draft quality is up. But on the long, stateful, "understand the whole system before you touch it" tasks, Claude still finished and Gemini still drifted. Brainstorming, image and now video generation, deep Workspace context — A Tier, no argument. Building real systems solo — not yet, but this is the closest it has gotten, and I will keep testing.

The Apple-Siri story is still the Apple-Siri story. Every window I have written about has slipped, and May did not deliver the clean "Gemini powers Siri" moment either. At this point the September iOS 27 target is the only date worth watching, and even that I will believe when I see it.

Llama holds. Behemoth still has not shipped publicly. Meta keeps deploying Llama across every surface it owns — billions of users on one open model family — but the public flagship story has not changed since February. Enormous deployment footprint, frontier flagship still not in the room.

The B Tier: Mistral and Perplexity

Mistral keeps owning its question: what do you reach for when you care about cost, open-source licensing, and European data sovereignty? Mistral Small 4 is still the best open-source value at the frontier, and Forge — custom models on a company's own data — keeps showing up in real deployments. For anyone building for a European market, it is still the default.

Perplexity earned its keep during the sprint specifically. When I needed to get a pile of country-by-country travel and regulatory facts right across a batch of new pages, Deep Research on Opus was the surface I trusted to get me started and cite its sources. Comet is a genuinely good browser, Perplexity Computer has graduated from beta to tool in my head, and they held the no-ads line another quarter. Still my number one for research, still pulling away.

Both shops run narrow playbooks and run them well. In a market this size, that is a winning position, not a consolation prize.

The Nope Tier: Grok and DeepSeek

Grok did not get better in the only way that would matter.

The deepfake and CSAM litigation against xAI is now consolidating in federal court — the initial case management conference is set for June 18 in the Northern District of California — and the underlying allegations have not gotten less ugly with time. xAI's answer continues to be more capital and more velocity, not more engagement with the questions on the table. The model improves on the technical axis. The platform has not earned back a cent of the trust it spent. I am not running it for anything that matters, and I would not tell you to either.

DeepSeek's V4 is somehow still pending — we are now many missed windows deep. The distillation accusations from OpenAI and Anthropic still stand, the CCP-level censorship is still baked in at the model layer, and the security findings are still unaddressed. The technical promise is real and the trust math has not changed. If you handle sensitive data or serve users who care about censorship, it is the wrong answer regardless of what V4 looks like when it finally ships.

What Changed This Month

Two things, in order of importance.

First: the frontier took a breath at the very top, and the action moved underneath it. April's intelligence ceiling held into May — no new flagship blew past it — so the real competition shifted to architecture, efficiency, and product defaults. Gemini 3.5 Flash, Claude for Small Business and Fast mode on Opus 4.7, AI Mode in Search. The headline is no longer "biggest model." It is "best model in the place you actually work."

Second, and quieter: talent moved. Karpathy to Anthropic is not a press cycle — it is a signal about where the builders want to build. Watch the people, not just the benchmarks.

And the personal change, the one that reframed all of it for me: I stopped evaluating these models this month and started living inside them for fifty days straight. The order barely moved. My conviction about the order moved a lot. When you build with a tool at that volume, you stop having opinions about it and start having a relationship with it. That is the only kind of ranking I know how to write.

The Updated Rankings

The full rankings — with the "Who Powers What" ecosystem map, the political bias table, and links to every model — live at johncderrick.com/ai-models. I update it monthly.

If you want to stop reading about these models and start building real systems with them, the Prompt Library has ten copy-paste prompts for AI assistants, email triage, knowledge bases, CRMs, and more. They work across Claude, ChatGPT, and Gemini. The fifty days I just described started with prompts a lot like those.

The Protocol: I use these models every day, all of them. The rankings are not theoretical — they are operational, and this month they are about as operational as it gets. Claude did the building. ChatGPT handled the overflow. Gemini brainstormed and got auditioned for the real work. Perplexity did the research. The full rankings live at johncderrick.com/ai-models and the prompts to put them to work live at johncderrick.com/prompts. Check the dates. If they are current, so are the opinions.

The State of the AI Models — My Rankings as of May 2026

The S Tier: Claude and ChatGPT

The A Tier: Gemini and Llama

The B Tier: Mistral and Perplexity

The Nope Tier: Grok and DeepSeek

What Changed This Month

The Updated Rankings

Practical Guides for Small Business

Need a Fractional CTO?

The State of the AI Models — My Rankings as of May 2026

The S Tier: Claude and ChatGPT

The A Tier: Gemini and Llama

The B Tier: Mistral and Perplexity

The Nope Tier: Grok and DeepSeek

What Changed This Month

The Updated Rankings

Practical Guides for Small Business

Need a Fractional CTO?

Get In Touch

Be the First to Know