The Agent Illusion: The 5 Stages of AI Development and Where We Really Are

by Dan Roque | Reading Time: 12 minutes | In Future of Work

Let me guess.

You've seen the headlines about "reasoning models" that can solve PhD-level problems. You've watched the demos of AI moving cursors, clicking buttons, and "using computers like a human." You've heard "agentic AI" repeated so many times it's started to sound like meaningless corporate noise.

And somewhere in the back of your mind, you're wondering: Wait, didn't OpenAI say Level 1 was just chatbots? So, if we've got models that can reason and act... aren't we already past that?

You're right to ask. And the answer is both simpler and more complicated than the hype cycle wants you to believe.

So, let's step away from the doom-scrolling for a minute. Grab some chalk. Let's walk up to the virtual board and actually look at what's turning inside these machines right now—March 2026—because the gap between what's being said and what's actually true is wider than ever.

AI is a tool we need to learn, not a magic trick to fear. And the only way to stop fearing it is to understand where the gears are actually grinding.

And for that, OpenAI’s five-level framework is so strategically vital to our understanding of Artificial Intelligence. It isn’t just a checklist for engineers; it’s a tool for us to ground our expectations in reality rather than hype.

Instead of drifting in a sea of buzzwords, this roadmap gives us a shared language to track progress. Our mission today is to peek "under the hood" at the actual mechanisms driving this evolution. We’re going to look past the chat box and see the architectural shifts moving us from software that talks to software that thinks—and eventually, software that builds organizations. To understand where we’re going, we have to start with the "toddlers" currently on our screens.

The Chalkboard Breakdown: Where We Really Stand

Level 1: The Smooth Talking Toddlers

It makes perfect sense that conversational AI was our starting line. Language is the primary way humans share knowledge, so teaching a machine to speak "human" was the ultimate gatekeeper to more complex tasks.

Just a few years back, most of us were interacting with Level 1 systems. GPT-4, Claude 3.5 Sonnet, Gemini 1.5—these are statistical pattern-matchers wearing a trench coat pretending to be reasoning engines. When we ask this generation of large language models a question, it's not "thinking" in any human sense. It's doing something closer to linguistic superglue: predicting which word statistically should come next based on the trillion-ish examples it's already seen.

The mechanism? Transformer architecture + next-token prediction + internet-scale training data.

This is why these models are simultaneously brilliant and breathtakingly stupid. They can write a sonnet in the style of Emily Dickinson but still confidently tell you that the capital of Canada is Toronto (it's Ottawa, and the model knows this, but somewhere in its statistical weights, Toronto just feels more capital-y).

The hallucination problem isn't a bug we haven't fixed yet—it's a feature of the architecture. These systems are designed to produce plausible text, not true text. They're mimicking understanding, not demonstrating it.

Think of Level 1 as the world's most articulate toddler. Fluent, surprising, occasionally profound, and completely incapable of explaining why they just said what they said.

But let's state this clearly: We are not in Level 1 anymore.

Level 1 systems—pure next-token predictors, fluent but fundamentally reactive—have been superseded. GPT-4 and its 2023-2024 era contemporaries were the peak of this paradigm. They could write sonnets and summarize emails… but they could not think through a novel problem.

We've moved on.

Level 2: The PhD Without a Toolkit

Now we get to the transition everyone's excited about—the jump from "chatty" to "reasoning."

With the release of OpenAI's o1 series (codenamed "Strawberry") and subsequent models throughout 2024-2025, we achieved Level 2. And no, this isn't hype—it's documented performance. This is where we firmly are at the time of writing (Early 2026), with the likes of Claude Opus 4.6, Gemini 3 Pro, DeepSeek v3.2, Kimi K2.5, Grok 4.1, and others releasing in that Q4 2025 to Q1 2026 window to challenge OpenAI's ChatGPT 5.2 directly.

Here's what changed: Instead of just predicting the next word instantly, these models engage in private chain-of-thought reasoning. They talk to themselves before they talk to you. They allocate more "thinking time" to hard problems. They backtrack when they hit dead ends. They check their own work.

The mechanism: Test-time compute scaling. We give the model more computing power when it's answering to simulate reasoning steps internally.

The results are stark: On the American Invitational Mathematics Examination (AIME), GPT-4o solved about 12% of problems, while o1 reached 74% with a single attempt and 83% with 64-sample consensus; Graduate-level physics problems became solvable; Complex code generation with multiple interdependent components became reliable.

But—and this is crucial—Level 2 systems are still brains in jars. They reason brilliantly over whatever knowledge was baked into their training. They cannot look anything up. They cannot take actions. They cannot use tools.

Imagine the world's most brilliant physicist, locked in a windowless room with no phone, no internet, no books, and no way to touch anything. That's Level 2.

Level 3: The Intern Gets Hands

This is where 2026 gets interesting. And messy.

Agentic AI—systems that can take actions, use tools, and persist across time—is no longer theoretical. Anthropic's "Computer Use" feature, released for Claude in late 2024 and refined through 2025, demonstrated something unprecedented: an AI that could look at a screenshot of your computer, move the cursor, click buttons, and navigate interfaces exactly like a human would.

No special APIs. No integrations. Just visual understanding + action planning.

The mechanism: Visual language models (to "see" the screen) + tightly looped action planning (do something, observe result, adjust, repeat) + persistent context across sessions.

Early 2026 models—Claude Opus 4.6, GPT-5's agentic layers, Gemini 2.0's tool-use framework—have shown remarkable capabilities: Navigating complex spreadsheets to extract and transform data; Running simulated businesses with multi-year strategies; Booking travel itineraries across multiple sites; and Debugging code by actually running it, seeing errors, and fixing them.

So… we're in Level 3, right?

Well, yesn’t. Here's where the framework breaks down.

The "So What?" Layer: The Messy Middle

We are not in Level 3. We are stumbling toward Level 3, with one foot firmly planted in Level 2, and the other foot discovering that the ground ahead is made of loose gravel.

Here's what that actually looks like:

The Reasoning Is PhD-Level. The Action Is Toddler-Level.

That Claude Opus 4.6 demo where it ran a vending machine business? Genuinely impressive. It developed sophisticated strategies—heavy early investment, then pivoting to profitability—and executed them over simulated years.

The same model, asked to fill out a moderately complex web form with non-standard formatting? It clicked the wrong button, got confused, and needed human intervention three times.

The mechanism disconnect: Reasoning and action use different parts of the architecture, and they're not fully integrated. The "brain" can plan. The "hands" are still learning to grip.

We Have No Error Recovery

When a Level 2 system makes a mistake, it produces wrong text. You notice, you correct, you move on.

When a Level 3 system makes a mistake, it might:

Delete the wrong database
Purchase $47,000 worth of office supplies you don't need
Lock you out of your own accounts
Book 47 tables at a restaurant
Or order you 260 McNuggets due to an error loop (This one actually happened. McDonald’s ended its AI drive-thru trial after viral failures — including one incident where the system kept adding Chicken McNuggets until the order reportedly hit 260)

The unsolved problem: How does a system know it's made a mistake? How does it recover? How does it explain what went wrong? These aren't engineering tweaks—they're open research questions.

Goal Specification Is Harder Than It Looks

Tell a Level 3 agent: "Make my business more profitable."

But… what does that mean? Cut costs? Raise prices? Fire underperformers? Invest in marketing? All of the above? In what order? With what constraints?

Current systems have no mechanism for resolving this ambiguity. They'll pick something and execute and optimize for what they picked, often with consequences no one anticipated.

The timeline is real, but the reliability isn't.

Levels 4 & 5: The Speculation Zone

Let's be brutally honest about what's ahead.

Level 4 (Innovative AI): Systems that generate genuinely novel hypotheses and inventions—does not exist. Not partially. Not almost. It doesn't exist. We have systems that are brilliant at recombination—taking existing ideas and combining them in new ways, but that's not true innovation. That's sophisticated collage. Genuine novelty requires something we don't know how to architect. Maybe it emerges from scale. Maybe it requires new paradigms. Maybe it's farther away than anyone wants to admit.

Level 5 (Organizational AI): Systems running entire companies autonomously—is even more speculative. The alignment problem at this scale isn't remotely close to being solved; It's not even well-defined. Ensuring that a network of agents making thousands of interdependent decisions over literal years while reliably pursuing human-aligned goals? That's not an engineering challenge; That's a fundamental unknown.

Timeline estimates are all over the map: The optimistic 5-10 years, the mainstream 20-30 years, the skeptical 50+ years, or even the brutal “never” (with current paradigms). The honest answer? We don't know. Anyone who tells you differently is selling something.

So... What is the Roadmap For?

OpenAI's five-level framework is useful. It gives us a vocabulary. It helps us talk about progress.

But it's not really a roadmap anymore. It's a sketch. And the sketch is already outdated.

We are not cleanly in any level. We're straddling the gap between Level 2 and Level 3, with one foot on solid ground and the other discovering that the terrain ahead is made of loose gravel. The staircase is painted on fog. The steps exist — but we're still figuring out where they actually lead.

That's not a reason to panic. It's a reason to be precise.

The skills that matter right now aren't the ones the hype cycle talks about. They're quieter, less glamorous, and more important:

Reasoning supervision: Can you actually verify a complex AI output — or are you just trusting it because it sounds confident? Goal decomposition: Can you break a vague business objective into steps specific enough for an agent to execute without going sideways? Error handling: Can you design workflows that catch mistakes before they compound? Judgment: Do you know when to let the system run and when to put your hand on the brake?

This is what it means to work alongside AI in 2026 — not delegation, not fear, but informed collaboration with a technology that is simultaneously more capable and more fragile than the headlines suggest.

The emergency stop exists for a reason. Learn where it is.

Works Cited

Anthropic. "Introducing Claude Sonnet 4.6." Anthropic, 17 Feb. 2026, www.anthropic.com/news/claude-sonnet-4-6. Accessed 10 Mar. 2026.

Cook, Jodie. “OpenAI’s 5 Levels of ‘Super AI’ (AGI to Outperform Human Capability).” Forbes,16 July 2024, www.forbes.com/sites/jodiecook/2024/07/16/openais-5-levels-of-super-ai-agi-to-outperform-human-capability/. Accessed 10 Mar. 2026.

Crossa, Sebastian. "GPT-5.3 Codex: Agentic AI & The Future of Coding." LLM Stats, 5 Feb. 2026, llm-stats.com/blog/research/gpt-5-3-codex-launch. Accessed 12 Mar. 2026.

Duenas, Tom, and Diana Ruiz. The Path to Superintelligence: A Critical Analysis of OpenAI’s Five Levels of AI Progression. Preprint, 25 Aug. 2024, ResearchGate, www.researchgate.net/publication/383395776_The_Path_to_Superintelligence_A_Critical_Analysis_of_OpenAI%27s_Five_Levels_of_AI_Progression. Accessed 10 Mar. 2026.

Kim, Takyoung, et al. "ReIn: Conversational Error Recovery with Reasoning Inception." arXiv.org, 19 Feb. 2026, arxiv.org/abs/2602.17022. Accessed 11 Mar. 2026.

Leijtens, Remon. “The 5 Stages of AI.” Remon Design, 20 Mar. 2025, www.remon.design/the-5-stages-of-ai/. Accessed 11 Mar. 2026.

"OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions." ScienceDirect, vol. 5, no. 6, Nov.-Dec. 2025, www.sciencedirect.com/science/article/pii/S2666914525001423. Accessed 12 Mar. 2026.

OpenAI. “Planning for AGI and Beyond.” OpenAI, 24 Feb. 2023, openai.com/index/planning-for-agi-and-beyond/. Accessed 12 Mar. 2026.

OpenAI. Staying Ahead in the Age of AI: A Leadership Guide. PDF file, n.d., OpenAI, cdn.openai.com/pdf/ae250928-4029-4f26-9e23-afac1fcee14c/staying-ahead-in-the-age-of-ai.pdf. Accessed 12 Mar. 2026.

"OpenAI推出升级版o1-pro模型：更强推理能力但定价高昂 [OpenAI Launches Upgraded o1-pro Model: Stronger Reasoning Capabilities but High Pricing]." ZOL.com.cn, 20 Mar. 2025, ai.zol.com.cn/962/9621535.html. Accessed 12 Mar. 2026.

"OpenAI 时间表公开：2026-2028，打法彻底换了 [OpenAI Timeline Revealed: 2026-2028, Strategy Completely Changed]." 36Kr, 29 Oct. 2025, www.36kr.com/p/3529519807518854. Accessed 12 Mar. 2026.

raia AI. “Understanding OpenAI’s Five-Level System on the Path to AGI.” raia AI, n.d., www.raiaai.com/blogs/understanding-openai-s-five-level-system-on-the-path-to-agi. Accessed 13 Mar. 2026.

Rusanov, Andriy. "In a One-Year Simulation, Claude Opus 4.6 Operated a Vending Machine Better Than ChatGPT 5.2 and Gemini 3." Mezha, 11 Feb. 2026, mezha.ua/en/news/claude-opus-4-6-vending-308487/. Accessed 14 Mar. 2026.

Sanders, Hank. “260 McNuggets? McDonald’s Ends A.I. Drive-Through Tests Amid Errors.” The New York Times, 21 Jun. 2024, nytimes.com/2024/06/21/business/mcdonalds-ai-drive-thru-white-castle.html. Accessed 14 Mar. 2026.

Sherry, Ben. “5 Steps That OpenAI Thinks Will Lead to Artificial Intelligence Running a Company.” Inc.com, 23 July 2024, www.inc.com/ben-sherry/5-steps-that-openai-thinks-will-lead-to-artificial-intelligence-running-a-company.html. Accessed 14 Mar. 2026.

Solis, Brian. “AInsights: OpenAI Defines Five Stages to Track Progress Toward Human-Level Intelligence.” Brian Solis, 6 Aug. 2024, briansolis.com/2024/08/ainsights-openai-defines-five-stages-to-track-progress-toward-human-level-intelligence/. Accessed 14 Mar. 2026.

Swayne, Matt. “OpenAI’s Five Levels of AI – And Where Are We Now?” The AI Insider, 12 July 2024, theaiinsider.tech/2024/07/12/what-are-openais-five-levels-of-ai-and-where-are-we-now/. Accessed 15 Mar. 2026.

"Release Notes." Google AI for Developers, ai.google.dev/gemini-api/docs/changelog. Accessed 15 Mar. 2026.

Watch and/or Listen

Youtube | Spotify | Apple Podcasts

CasiornThinks