Memory & recall mental-modelscontext-engineeringmemory 2026·06·08 · 4 min · evergreen

Your context window is working memory, not a hard drive

Edited by Luke Topfer | last reviewed 2026·06·08 |re-check by 2027·12·01

Overview

This is the mental model everything else about AI memory hangs off: what the model actually has in front of it when it answers you. In July 2025, Chroma’s “Context Rot” study tested 18 models — frontier ones included — and found that “models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows” — quietly demolishing the idea that a bigger window simply means a better memory. After this, you’ll stop treating the context window like storage and start managing it like the scarce, lossy working memory it is.

The content

Here is the obvious read: you’ve had a long chat, the model “knows” everything you’ve said, so naturally it remembers. The overturn: the model remembers nothing. Each response is generated fresh from the tokens sitting in the window at that moment. The model itself is stateless — it holds no memory of prior turns. The sense of continuity is an illusion the application maintains by resending the whole conversation on every single turn. Nothing you typed is “stored” anywhere the model can reach unless a separate feature deliberately persists it and feeds it back in.

That is why memory features exist at all. When a tool like ChatGPT carries a fact across conversations, it is not the model remembering — it is a memory layer (OpenAI splits this into explicit “saved memories” and inferred “chat history”, per their documentation) that retrieves text and pastes it back into the window before the model runs. Projects, retrieval, saved instructions: all of them are plumbing that gets the right tokens into working memory at the right moment. They are not optional polish. They are the only way anything survives a fresh session. And the plumbing is where the race now is: in March 2026 Anthropic extended Claude’s memory to all users — including importing saved context from rival chatbots — which is persistence as re-supply, made into a product.

So treat the window as working memory: small, volatile, and degrading under load. The Chroma finding is the part that catches people out — performance falls off well before you hit the window’s stated limit, not at the boundary. A million-token window does not grant you a million tokens of reliable recall. The more you stuff in, the more unevenly the model attends to it, and the relevant detail you buried on turn three can quietly stop pulling its weight.

The Monday-morning consequence: relevance beats volume. Pasting your entire thread back in is not “giving it more to work with” — it is diluting the signal. Curate what is in the window. Anything that must outlive this session goes into a memory feature or a document you can re-supply, not into the hope that the model held onto it.

Try it

Take a real task where you have a long, sprawling chat or a wall of pasted material. Before you ask your next question, run a deliberate cull with this prompt:

Here is the context for my task: [paste only what is genuinely relevant].

Before answering, restate in 3-5 bullet points the specific facts,
constraints, and decisions you are relying on from what I gave you.
Flag anything you think is missing to do this well.

Then answer: [your actual question].

The restatement shows you what actually landed in working memory versus what you assumed was there. Where this approach reaches its limit: it cannot recover anything from a previous session that was never persisted — if the detail was only ever in a closed chat, no prompt will retrieve it, because for the model it was never there. That gap is precisely the job a memory feature or a re-supplied document does.

Additional reading

Context Rot: How Increasing Input Tokens Impacts LLM Performance — Chroma, 14 July 2025. The empirical case that long context degrades unevenly, well before the window limit.
Memory and new controls for ChatGPT — OpenAI. How an explicit memory layer persists facts across sessions and feeds them back in.
Memory FAQ — OpenAI. The distinction between “saved memories” and inferred “chat history”, and the user controls over both.
Claude release notes: memory for all users — Anthropic, March 2026. Memory extended to every tier, with import of saved context from other chatbots — the persistence layer as a product, not the model remembering.

Editor’s note

I’ll say it plainly: this mental model is essential. Falling into the trap of keeping long context will drive worse results. The discipline that actually pays off is unglamorous — decide what belongs in working memory this turn, and put anything that must survive the session somewhere it can be re-supplied — for example, a saved instruction or a source document, not a hope that the model “kept” it.

✓signed-off-by: Luke Topfer <editor> · 2026·06·08

06 Self-check

// three assertions against what you just read · results stay in this browser

assert 1/3

You're twenty turns into a chat and the model still seems to track what you said at the start. What is actually going on?

assert 2/3

You have a long, sprawling chat about a project and need one more careful answer. What does this module say to do before you ask?

assert 3/3

A key constraint was settled in a chat you closed last week and never saved anywhere. What are your options now?

07 Was this useful?

Was this useful for your daily work?