The landscape agentsdelegationverificationreview 2026·06·12 · 4 min · dated

Colleague or contractor: the two ways to hand work to an AI

Edited by Luke Topfer | last reviewed 2026·06·12 |re-check by 2026·12·01

Overview

On 2 June 2026, OpenAI reported that non-developers — analysts, marketers, operators, researchers — now make up about 20% of the people using Codex, its software agent, and that they’re growing more than three times as fast as the developers it was built for. A coding tool’s fastest-growing audience doesn’t code.

That number is the visible edge of something every major assistant now ships: a second way of working, where the AI doesn’t answer you in the chat — it goes away, works alone for minutes or hours, and comes back with finished work. Claude calls its version Cowork; ChatGPT has agent mode and deep research; Gemini has Deep Research and a growing family of background agents.

What you’ll be able to do after this module: tell which of the two modes a task actually needs, and budget for the step nobody budgets for — checking work you didn’t watch happen.

The content

Start by naming the two modes plainly. In colleague mode the AI works beside you: you see every step, react in real time, and change direction mid-stream. This is ordinary chat, and it’s where almost everyone learned to use AI. In contractor mode you write a brief, the AI works alone, and what you get back is a deliverable — a report, a spreadsheet, a draft — rather than a reply.

The vendors themselves now draw exactly this line. Anthropic’s guidance for Cowork says to use it when “you want to delegate and come back to results”, and to stay in regular chat when “you want real-time back-and-forth collaboration” or “the work is exploratory and you’re thinking alongside Claude”. OpenAI says ChatGPT agent tasks “usually complete within 5–30 minutes”. Gemini’s Deep Research runs for five to ten minutes and notifies you when the report is ready. Both modes are already in the tool you have (the delegate modes sit on the paid tiers) — this isn’t a buying decision.

Here’s the reframe: contractor mode is not the upgrade. The hype reads delegation as what advanced users graduate to, but Anthropic’s own usage research found the opposite drift — the share of Claude.ai conversations that were pure hand-offs (“directive” use) climbed from 27% to 39% over 2025, then fell back to 32% by November as people pulled work back into the conversation. Mode choice isn’t a maturity ladder. It’s a property of the task: can you write down what done looks like, which sources count, and what evidence should come back? If you can’t, the task isn’t ready to leave the conversation — colleague mode is the work of finding that out. If you can, delegate it.

Then comes the part the productivity story skips. A contractor’s work arrives looking finished — and looking finished is not the same as being checked. The first controlled head-to-head of two frontier agents (a May 2026 preprint — one author, one gravitational-wave analysis pipeline, two runs, so a case study rather than a verdict) found the two agents’ scientific results converged, but one quietly patched over mismatches in the spec and left no record, while the other halted, explained, and restarted — leaving a trail. When you didn’t watch the work happen, that trail is all you have. The same lesson landed in ordinary knowledge work this month: Stanford’s Andrew Hall told Axios that a coding agent updated a five-year-old research paper of his — gathered new data, ran the analyses, drafted the paper — and that when a graduate student audited it, the agent had “made a number of errors” and “very much needed an expert, Ph.D.-level student to oversee it quite closely”.

Call this cost review debt: every delegated run creates time you owe checking work you didn’t watch happen. It’s real money. Workday’s survey of 3,200 workers found nearly 40% of the time AI saves is lost again to rework — correcting errors, rewriting content, verifying outputs. And we’re bad at estimating the debt: when METR ran a randomised trial with 16 experienced open-source developers, AI assistance made them 19% slower on real tasks — while they believed it had sped them up by 20%. The delegation only paid off if the checking costs less than doing the work yourself. That’s the whole economics of contractor mode in one line.

Try it

Take one real task from this week — something with a definable output, not an open question. Paste this into your own tool, either in chat or in its delegate mode (Deep Research, agent mode, Cowork, Research):

I want to hand you this task to complete in one run:
[describe the task in 2-3 sentences].

Hold me to a contractor's brief before you start:
1. Done means: [the finished artefact - a report, a table, a draft]
2. Use only: [the sources that count - files I give you, named
   sites, my own notes]
3. Come back with: the finished work PLUS the evidence - a link or
   citation for every factual claim, and a list of every assumption
   you made where my brief was unclear.

If any of those three lines is too vague to act on, ask me before
doing anything else.

When the work comes back, pay the review debt first: open two of its citations and check one number or quote against the original before you read the conclusions. The assumptions list is where delegated work usually goes wrong — read it like a contractor’s variation claim.

Where this breaks: if you can’t fill in “done means” without hand-waving, stop. The task isn’t ready to delegate, and no agent feature will fix that — take it back into the conversation and work out what good looks like first. That’s not a failure of the tool or of you; that’s the diagnostic doing its job.

Additional reading

Codex for every role, tool, and workflow — OpenAI (June 2026) — the announcement behind the 20%-non-developers figure, plus role-specific plugins for knowledge work.
First head-to-head comparison of agentic AI… — Inguglia, arXiv (May 2026) — the two-agents-same-spec preprint: identical science, opposite behaviour when the spec didn’t match reality.
Office workers embrace OpenAI’s Codex — Axios, via Yahoo Tech (June 2026) — the Andrew Hall story: an agent updated his paper; an expert audit found the limits.
Anthropic Economic Index: Economic primitives — Anthropic (January 2026) — the directive-use numbers: delegation rose through 2025, then fell back as users returned to iteration.
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity — METR (July 2025) — the randomised trial behind the 19%-slower-while-feeling-20%-faster result; scope: 16 experienced developers on mature codebases.
Companies Are Leaving AI Gains on the Table — Workday Newsroom (January 2026) — the “nearly 40% of AI time savings are lost to rework” survey (3,200 workers, fielded November 2025).
Claude Cowork: a research preview — Claude tutorials (living page, fetched June 2026) — Anthropic’s own delegate-vs-stay-in-chat guidance, quoted above.
ChatGPT agent — OpenAI Help Center (living page, fetched June 2026) — run times, limits, and what agent outputs return (source links or screenshots).
Deep Research in Gemini Apps — Google Help (living page, fetched June 2026) — Gemini’s delegated research flow: editable plan, 5–10 minute runs, notification on completion.
AI Work Is Splitting in Two — Every (May 2026) — an independent take on the same split: delegation vs collaboration as the new meta-skill.
Claude vs. Codex isn’t about code — Nate B Jones, Nate’s Substack (June 2026, subscriber-only) — the essay that prompted this module: a deeper, tool-by-tool tour of how agent interfaces train their users.

Editor’s note

This site runs on delegated work — agents draft these modules, and a separate agent fact-checks every source before anything is published. That second step came from experience. An early draft attributed a quote to a paper that did not contain it; the work read as careful, and it was wrong in a way only the source could show. Delegate what you can specify; check what comes back against the originals, not the report.

✓signed-off-by: Luke Topfer <editor> · 2026·06·12

06 Was this useful?

Was this useful for your daily work?