Context engineering: curate the few high-signal tokens, don't dump everything in
Overview
The shift this teaches is from hunting for the cleverly-worded prompt to deciding what the model actually sees. In mid-2025 the field settled on a name for the discipline — context engineering — and that September Anthropic gave it its most durable handle: a model spends a finite “attention budget” reading whatever you give it, so your job is to spend it on the few tokens that matter. Today’s models make that easy to forget, which is exactly why it’s worth keeping.
The content
The obvious move — especially now that a context window can swallow an entire data room — is to pour everything in: the whole document, the full thread, every example, just in case. More context, more help.
Overturn it. Anthropic’s framing is that the goal is “the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome.” Every irrelevant token you add competes for the model’s attention with the ones that matter.
Be honest that the headline evidence has aged. The sharpest early demonstrations — Chroma’s “Context Rot” report (July 2025, run on GPT-4.1, Claude Opus 4 and Gemini 2.5) and the Stanford-led “Lost in the Middle” (2023) — ran on models now a generation or more old, and the frontier has moved a long way since: by 2026 the leading models advertise roughly million-token windows and score near-perfectly on clean retrieval tests. Read only the 2025 papers and you’d conclude long context is broken. It isn’t anymore.
What hasn’t changed is the principle. 2026 runs of the independent benchmarks keep finding it: NVIDIA’s RULER (2024) shows models reliably using only about half to two-thirds of their advertised window for multi-step work, and Adobe’s NoLiMa (2025) shows that flawless “find the needle” scores collapse once simple keyword matching is taken away — while real summarisation still sags as input grows. A bigger, better window raises the ceiling; it doesn’t repeal the cost of noise. And that cost was never only about model capability — surplus tokens dilute the signal, add lookalike distractors, and quietly buy you more latency and spend.
So treat the context window as a budget you spend, not a bucket you fill — the attention budget is the handle to keep. Before you paste, ask what each block earns its place doing. The one caveat: it can be over-applied. Strip too hard and you starve the model of the fact it needed; the skill is the smallest sufficient set, not the smallest possible one.
Try it
On your next real task — a contract to summarise, a dataset to interpret, a thread to draft a reply to — don’t paste the lot. Audit your context first with this:
I'm about to give you [task]. Before I paste anything, list the
specific pieces of information you actually need to do this well,
ranked by how much each one changes your answer.
Then tell me what I can safely leave out, and what would actively
mislead you if I included it.
Then assemble only the high-ranked pieces and run the task. Where this won’t help: short, self-contained asks. If the whole input is two paragraphs and one instruction, there’s nothing to curate — you’ll waste a turn formalising a decision you could have made by reading it. Curation earns its keep when the material is bigger than the question.
Additional reading
- Effective context engineering for AI agents — Anthropic (2025-09-29) — the source of the “attention budget” and “smallest set of high-signal tokens” framing; still the clearest statement of the principle.
- Context Rot: How Increasing Input Tokens Impacts LLM Performance — Chroma Research (2025-07-14) — the 18-model study. Note it tested GPT-4.1, Claude Opus 4 and Gemini 2.5, all since superseded — read it for the pattern, not the scores.
- Lost in the Middle: How Language Models Use Long Contexts — Liu et al. (2023) — the original finding that recall is strongest at the start and end of the window, weakest in the middle.
- Long-context LLM benchmarks, 2026 — accuracy past 200K tokens (2026-05) — current state: ~1M-token windows; RULER puts reliable use at ~50–65% of the advertised window for multi-hop work, and NoLiMa shows needle-style scores collapse without keyword overlap.
Editor’s note
It took Opus 4.7 for me to internalise this reframe — it was much more literal than Opus 4.6 and Codex 5.3, and that forced the shift from prescribing workflows in the prompt to curating context. Most of us were trained by the early prompt-tip culture to believe the lever is wording, when the real lever is selection. For example, the single biggest quality jump I see isn’t a smarter-sounding instruction; it’s deleting irrelevant material before adding it to context — I have restarted whole sessions because I realised that first-prompt context was poisoning downstream output. Beware of the overcorrection, of course, but do try to keep from believing that more is better.
Was this useful for your daily work?