The headlines have been breathless: AI agents will transform knowledge work, automate professional services, replace junior analysts. Vendors promise autonomous systems that can handle complex workflows end-to-end. Investment is pouring into agentic AI at unprecedented rates.
Then comes the reality check.
In January 2026, researchers at Mercor released APEX-Agents, the most rigorous benchmark yet for measuring AI agent performance on real professional work. They tested leading models—GPT-5.2, Gemini 3, Claude Opus 4.5—on tasks drawn from investment banking, management consulting, and corporate law. Tasks designed by actual professionals from firms like Goldman Sachs and McKinsey.
The results? The best-performing model achieved 24% accuracy. Even with multiple attempts, no model exceeded 40%.
Every AI lab, effectively, received a failing grade.
What the Benchmark Actually Measures
APEX-Agents isn’t another academic exercise testing whether models can answer trivia questions or write poetry. It measures whether AI agents can do economically valuable work—the kind of tasks that junior and mid-level professionals perform daily.
The benchmark comprises 480 tasks across 33 detailed project environments. Each environment includes roughly 166 files and provides access to realistic workplace tools: documents, spreadsheets, presentations, PDFs, email, calendars, and code execution capabilities. A typical task takes a skilled human professional between one and two hours to complete.
These aren’t simplified versions of real work. They’re actual work: analysing financial models, drafting legal memoranda, preparing client presentations, synthesising information scattered across multiple sources. The kind of messy, context-dependent tasks that define knowledge work.
And that’s precisely why the models struggle.
Why AI Agents Fail at Knowledge Work
I’ve spent two decades consulting for professional services firms, and the benchmark results align with what I’ve observed in the field. The gap between AI demos and production reality isn’t about raw intelligence—it’s about context, ambiguity, and sustained attention.
The context problem is fundamental. Workplace information is scattered across tools, threads, and documents. An investment banking analyst might need to cross-reference a financial model in Excel, a term sheet in a PDF, email threads discussing deal terms, and Slack messages clarifying client preferences. Humans navigate this effortlessly, building mental models that connect disparate information sources. AI agents lose the thread.
Ambiguity compounds the challenge. Professional work constantly requires judgment calls. When a client’s instructions are unclear—and they often are—experienced professionals know how to interpret intent, ask clarifying questions, and make reasonable assumptions. AI agents either halt when instructions are ambiguous or, worse, proceed with interpretations that “technically comply with inputs while undermining organizational objectives.”
Multi-step reasoning introduces cascading failures. Every additional step in a workflow creates another opportunity for error. Given enough steps, small mistakes accumulate into unusable outputs. Researchers describe this as the fundamental reliability problem: “In systems as unreliable as today’s LLMs, every additional step introduces another chance for failure.”
The Governance Reality Check
Beyond pure capability, the benchmark results underscore a governance challenge that many organisations are only beginning to appreciate.
According to recent surveys, 52% of senior leaders cite security and compliance concerns as the main barrier to deploying AI agents. Rather than rushing to automate, 69% of organisations still require humans to verify AI decisions before execution.
That caution looks increasingly justified. When agents achieve only 24% accuracy on professional tasks, the cost of verification—having humans check every output—approaches the cost of simply having humans do the work themselves. The productivity gains promised by automation evaporate if you need to review everything the agent produces.
The failures of 2025’s agent deployments weren’t edge cases. They were structural. Agents acted in ways that couldn’t be explained, constrained, or reliably corrected. That ambiguity was tolerable in demos and pilots. It’s no longer tolerable when agents operate inside real workflows, regulated environments, and customer-facing processes.
Where Agents Actually Work
The benchmark results don’t mean AI agents are useless. They mean we need more precision about where agents create value and where they don’t.
Structured, repeatable tasks with clear success criteria remain strong candidates. Extracting specific data from standardised documents. Formatting outputs according to defined templates. Executing well-defined workflows with predictable inputs. These are the domains where agents consistently perform well—and where the 24% benchmark doesn’t apply because the tasks are fundamentally different from open-ended professional judgment.
Tasks where occasional mistakes create manageable consequences rather than catastrophic failures are also suitable. Drafting initial versions of documents for human review. Summarising large volumes of information for human synthesis. Generating options for human decision-makers to evaluate. The key is designing workflows where agent output is input to human judgment, not a replacement for it.
Where agents struggle—and will continue to struggle—are decisions requiring significant professional judgment, navigation of ambiguous or conflicting information, and sustained reasoning across extended workflows. These tasks demand exactly the capabilities the benchmark reveals are weakest: tracking context across domains, managing ambiguity, and maintaining coherence over many steps.
I advised a consulting firm last year that tried to automate client proposal generation. The agents could produce plausible-looking documents, but partners spent more time correcting subtle errors and misaligned recommendations than they would have spent writing from scratch. The firm retreated to using agents for research synthesis—gathering, organising, and summarising background information—while keeping the actual proposal writing and strategic framing with humans. That division of labour works, and it respects the current capability boundaries.
The Trajectory Matters
Here’s what makes the benchmark results genuinely interesting rather than simply discouraging: the improvement trajectory is steep.
Mercor’s CEO observed that “right now it’s fair to say it’s like an intern that gets it right a quarter of the time, but last year it was the intern that gets it right five or 10% of the time. That kind of improvement year after year can have an impact so quickly.”
The drivers of improvement are worth understanding. Better reasoning models—the “thinking” variants that achieved the highest APEX scores—demonstrate that architectural advances translate directly to professional task performance. Enhanced context windows allow agents to maintain coherence across longer workflows. And techniques like self-refinement, where models critique and improve their own outputs through multiple iterations, are pushing accuracy higher without requiring fundamental capability breakthroughs.
If accuracy improves from 24% to 48% to 72% over successive years—and nothing in the current research suggests that’s implausible—the calculus changes dramatically. Tasks that require too much human oversight to be economical today become viable tomorrow. The frontier keeps moving.
This creates a strategic challenge for technology leaders. You can’t ignore agentic AI because the trajectory is real. You can’t over-invest in current capabilities because they’re not yet production-ready for complex work. The answer is graduated deployment: starting with tasks that work today while building infrastructure and expertise for tasks that will work tomorrow.
Practical Implications for 2026
If you’re responsible for AI strategy in your organisation, here’s how I’d interpret the benchmark findings.
Reset expectations with leadership. The 24% figure is a useful corrective to vendor hype. Share it with executives who expect agents to replace knowledge workers this year. Agents will augment workers; replacement is further away than the marketing suggests.
Audit your deployment plans against task complexity. For each proposed agent application, ask: Does this task require sustained context across multiple sources? Does it involve significant ambiguity or judgment? Does it extend across many sequential steps? If the answers are yes, proceed with caution—or defer until capabilities mature.
Design for human-agent collaboration, not replacement. The most successful deployments I’ve seen position agents as force multipliers for human professionals, not substitutes. Agents handle the time-consuming but relatively structured parts of workflows; humans handle the judgment-intensive parts. This division respects current capability limitations while capturing real productivity gains.
Invest in evaluation infrastructure. You can’t improve what you can’t measure. The APEX benchmark is open source—consider adapting it to your specific context. Build internal benchmarks that measure agent performance on your actual tasks, with your actual data, against your actual quality standards.
Watch the trajectory, not just the snapshot. Current limitations are real, but so is the improvement rate. Build the organisational capability to rapidly adopt new agent capabilities as they mature. The enterprises that benefit most from the next capability jump will be those with infrastructure, governance, and expertise already in place.
The Honest Assessment
The benchmark tells us something we should have suspected: real work is harder than demos suggest. Professional judgment, contextual reasoning, and sustained attention are genuinely difficult capabilities—for humans and for AI.
But the benchmark also tells us something encouraging: we now have rigorous ways to measure progress. We can track whether next year’s models actually perform better on economically valuable tasks, not just on academic benchmarks disconnected from production reality.
Twenty-four percent isn’t impressive. But it’s measurable. It’s improvable. And it’s honest about where we actually stand.
For technology leaders navigating the hype cycle, that honesty is worth more than a thousand optimistic projections. Deploy where agents work today. Prepare for where they’ll work tomorrow. And resist the temptation to confuse the two.
The agents will certainly get better. The question is whether your organisation will be ready when they do.