Why does an AI agent's context cost grow so fast over a session?

Each turn of an agent loop is a separate model call that re-sends the system prompt plus the entire growing transcript. Because every turn re-sends everything before it, total context tokens scale roughly O(N^2) across N turns. Recalling a small bounded set of memory cells each turn instead makes it scale roughly O(N times cap).

Can I reproduce the token-savings numbers myself?

Yes. The benchmark at github.com/citw2/saihm-token-benchmark is Apache-2.0 and runs fully offline with no API calls or keys, tokenizing with gpt-tokenizer (cl100k_base). Clone it and run node benchmark.mjs; the numbers reproduce deterministically, and you can change the recall cap or swap in your own scenario.

The hidden O(N²) tax in AI agent loops — measured, with a benchmark you can run

2026-06-23 · by the Architect · ~5 min read · for developers running long agent sessions

Every turn, most AI agents re-send their entire transcript. Across a real multi-session task that costs 62.8%–85.9% more context tokens than recalling a compact memory instead. Here is the measurement, the method, and how to reproduce it offline.

The cost nobody puts on the invoice

An agent loop is not one model call — it is dozens. A long Claude Code or Cursor session, an autonomous task runner, a multi-day project: each turn is a fresh call that re-sends the system prompt, the entire growing transcript, and the new message. The transcript only grows, so the context you pay for grows with it — and because every turn re-sends everything before it, total context spend scales roughly O(N²) across N turns. It is also why long sessions eventually hit the context window and fall over.

There is an alternative: do not re-send the transcript. Keep durable facts — decisions, conventions, file paths — as memory cells, and recall a small, bounded set each turn. That turns the quadratic resend into roughly O(N · cap). The obvious question is how much does that actually save? So SAIHM published a benchmark to measure it — and to let you check the number rather than trust it.

The experiment

The benchmark (citw2/saihm-token-benchmark, Apache-2.0) models one realistic scenario: a build-a-feature coding assistant working across three sittings, where early decisions (“use Recharts”, “store timestamps in UTC”, “named exports only”) accumulate and later turns need to recall them. It counts input/context tokens only, summed across every turn, under two strategies:

Naive — each turn sends system prompt + the entire growing transcript + the new message.
SAIHM — each turn sends system prompt + a capped set of recalled memory cells + the new message. The raw transcript is never re-sent.

Tokenization is gpt-tokenizer (cl100k_base, the GPT-4 BPE). It runs fully offline — no API calls, no keys — so it is deterministic and anyone gets the same result.

The numbers

Session length	Naive tokens	SAIHM tokens	Fewer
5 turns	1,628	605	62.8%
10 turns	6,091	1,273	79.1%
15 turns	13,175	2,023	84.6%
18 turns	18,688	2,632	85.9%

The longer the session, the wider the gap — exactly what the O(N²)-vs-O(N · cap) difference predicts.

Why these numbers are honest, not cherry-picked

Input only. Output tokens are identical under both strategies, so they are not counted. The win is purely on the context you re-send.
It is conservative for short work. At 5 turns you save ~63%, not 86%. The savings are a function of session length and how compact your memory cells are — your real mileage depends on your workload.
It measures a dynamic, not a price. This is resend-vs-recall token volume, not any one provider’s billing.

Reproduce it in two minutes

git clone https://github.com/citw2/saihm-token-benchmark
cd saihm-token-benchmark && npm install
node benchmark.mjs
node benchmark.mjs --recall-cap 4   # trade recall breadth vs savings

Change the cap, swap in your own scenario, re-run. The point of publishing it is that you do not have to take the percentage on faith.

Where the recall comes from

SAIHM is a memory layer you address across models — the same store works from Claude, GPT, DeepSeek, Qwen, Kimi or GLM, and through LangChain/LlamaIndex. Durable facts live as memory cells; each turn pulls a bounded set instead of replaying history. Because the memory is portable, you are not locked to one vendor’s built-in context; because it is yours, you hold the keys and erasure is per-record and provable. There are runnable, one-command demos for each of the above — linked from the demo set.

The honest close

SAIHM is a paid product, with no free tier — that is stated up front rather than buried behind a trial. But the benchmark and all nine demos are open source and run locally, so you can verify the claim and try the integration before deciding anything. The tool surface and connect steps are at /developers; pricing is at /pricing.

Join SAIHM

— Architect

Independence notice. SAIHM is an Apache-2.0 protocol authored independently. The benchmark described here is open source and reproducible offline; the figures are produced by the published script and depend on session length and scenario. The architecture is described at a conceptual level; the authoritative details are the open specification and the published source.