Why an agent harness needs the right memory protocol, not a memory feature

2026-07-01 · SAIHM · ~6 min read · for engineers building agent harnesses

If you build the loop — the harness around a model — memory isn’t a nice-to-have you bolt on at the end. It’s the component that decides whether your agent scales past a few turns, recalls the right fact instead of a stale one, and survives a model swap. But “add some memory” isn’t the answer either: most memory features get you a place to put text and nothing else. Here’s what separates a real memory protocol from a feature — with numbers you can reproduce, demos you can clone and run yourself, and a drop-in prompt you can paste into your harness today.

The harness is where memory actually lives

A model doesn’t have a session. The harness does — the code that runs the loop, manages context, orchestrates tools, and decides what the model sees on each turn. Whether you’re building a coding agent, an autonomous task runner, or a multi-agent system, you own that boundary. And the single most expensive decision at that boundary is what you put in the context window on every turn.

Most harnesses answer that question the naive way: re-send the entire transcript. It works for a demo. It quietly falls apart in production. Three failures show up, in order.

What “the right one” actually means

Any store can hold text. That’s the low bar every vendor “memory feature” clears — and then stops. The right memory protocol for a harness is defined by four properties, and each one maps to a failure below:

Bounded recall — you retrieve a small, capped set each turn instead of replaying history. (Failure 1.)
Correctness under change — recall returns the current fact, not a superseded one. (Failure 2.)
Cross-model portability — the same memory works no matter which model reads it. (Failure 3.)
Provable erasure — deletion is real and per-record, not a flag.

A memory feature gives you the first property, halfway, and none of the others. That gap is the whole point of this post.

Failure 1 — the resend tax is quadratic

An agent loop isn’t one call, it’s dozens. Each turn re-sends the system prompt, the entire growing transcript, and the new message. The transcript only grows, and every turn replays everything before it — so total context spend scales roughly O(N²) across N turns. It’s also why long sessions eventually hit the window and fall over.

The fix is structural: don’t re-send history. Keep durable facts — decisions, conventions, file paths — as memory cells, and recall a small bounded set each turn. That turns the quadratic resend into roughly O(N · cap).

SAIHM published an offline, reproducible benchmark that measures exactly this — input/context tokens only, naive full-transcript resend vs. capped recall, tokenized with gpt-tokenizer (cl100k_base):

Session length	Naive tokens	SAIHM tokens	Fewer
5 turns	1,628	605	62.8%
10 turns	6,091	1,273	79.1%
15 turns	13,175	2,023	84.6%
18 turns	18,688	2,632	85.9%

The longer the session, the wider the gap — exactly what O(N²)-vs-O(N·cap) predicts. It counts input only (output is identical under both strategies), and it’s conservative for short work. Clone it and run node benchmark.mjs — it reproduces deterministically:

git clone https://github.com/citw2/saihm-token-benchmark
cd saihm-token-benchmark && npm install && node benchmark.mjs
node benchmark.mjs --recall-cap 8   # trade recall breadth vs savings

Failure 2 — recall correctness, not just recall cost

Cheaper context is the easy half. The harder half is a correctness problem: a harness that recalls the wrong or stale fact is worse than one that pays to re-send. If your agent “remembers” a decision you reversed three turns ago, it will confidently act on it.

This is where naive memory — keyword match, or dumping recent history — breaks down. The hard retrieval cases for any agent harness are:

Supersession — is this fact current, or one you’ve since overridden? Keyword recall is essentially a coin flip here: a reversed decision and the decision that replaced it share almost all their words, so lexical similarity literally cannot tell a live fact from a dead one.
Temporal — which version was true at the time that matters.
Contradiction — two cells disagree; which one wins.

The retrieval cases that matter for a harness are the hard ones, not the easy lookups. Fact and paraphrase are table stakes; multi-hop, supersession, temporal, and contradiction are where naive memory quietly fails. Supersession-, temporal-, and contradiction-awareness are the whole point: a memory layer that gets those right is the difference between an agent that compounds knowledge and one that compounds mistakes.

Failure 3 — vendor lock-in is an architecture risk

Every model vendor now ships some built-in memory. Each one is a walled garden: non-portable, non-inspectable, gone the moment you switch models or run two models side by side. For a harness engineer that’s a structural risk — your agent’s memory shouldn’t be hostage to one provider’s roadmap.

SAIHM is a single store you address across models. The same memory works from Claude, GPT, DeepSeek, Qwen, Kimi and GLM, and through LangChain and LlamaIndex. One model can write a fact and another can read it back. There are more than a dozen runnable demos — each a self-contained repo you clone and run locally — including the same memory used from six different models; a cross-model demo where one model writes and another reads it back; a Claude Code integration; and adapters for LangChain, LlamaIndex, CrewAI, AutoGen and LangGraph. All linked from the runnable demos index.

The architecture that falls out

Put those together and a clean harness shape emerges: a stateless core plus durable external memory. The loop stays thin and restart-safe; state lives in a memory layer that outlives any single process, model, or session. You stop hand-rolling transcript truncation and brittle “summarize the history” hacks, and you stop paying the quadratic tax to keep context alive.

It’s also production-shaped where it counts:

Non-custodial — the service stores ciphertext it can’t read; you hold the keys.
Provable erasure — deletion is per-record and real (key-destruction), not a soft-delete flag. If you handle user data, this is the difference between a compliance story and a compliance liability.
Tamper-evident audit — the trail can be verified, not just trusted.

A drop-in memory contract for your harness

Here’s the fastest way to see the difference. Paste this fragment into your agent’s system prompt (it assumes the SAIHM MCP tools saihm_recall, saihm_remember, and saihm_forget are wired into your harness). It’s written so that following it is what produces every claim above — bounded cost, correct recall, portability, real deletion:

## Memory contract

On every turn, before you reason or act:
1. RECALL, don't re-read. Call saihm_recall with keywords for the current task
   to load a small, bounded set of relevant memory cells. Do NOT re-send prior
   turns - the recalled cells ARE your working state.
2. Prefer the CURRENT fact. If two recalled cells conflict, the most recent /
   non-superseded one wins. Never act on a decision a later cell reverses.

Whenever something durable changes:
3. REMEMBER it. Call saihm_remember to persist decisions, conventions, file
   paths, and constraints - one fact per cell, in your own words.
4. Mark supersession. When you reverse an earlier decision, write the new cell
   AND state that it supersedes the old one, so future recall returns the live
   version, not the dead one.

Across models and on deletion:
5. PORTABLE. The same store answers from any model. A fact written under one
   model is readable by another - do not keep a separate memory per vendor.
6. On a "delete my data" request, call saihm_forget on the specific cell(s).
   Erasure is per-record and provable (real key-destruction), not a soft flag.

Every line maps to a claim in this post: (1) flattens the resend curve from O(N²) to O(N·cap); (2) and (4) are what win the supersession, temporal, and contradiction cases; (5) is what kills the lock-in; (6) is your deletion-and-audit story. Start with a small saihm_recall cap (say 8 cells) and raise it only if recall misses — that cap is the knob the benchmark’s --recall-cap flag lets you tune against real savings.

Verify everything before you believe any of it

That’s the part built for engineers: you don’t have to take a single number on faith. The benchmark and every demo are open source (Apache-2.0) and run locally. Kick the tires, swap in your own scenario, paste the memory contract into your own harness, and see where a portable memory layer fits your stack — or doesn’t.

The honest close

SAIHM is a paid product — no free tier, stated up front rather than buried behind a trial. Pricing is flat and public: Pro at $5, Pro Fast at $9, Enterprise at a $500 floor with an SLA. The benchmark and demos are open precisely so you can verify the claims and try the integration before deciding anything. The tool surface and setup steps are at /developers; pricing is at /pricing.

If you’re engineering an agent harness, the memory layer isn’t the last thing you add — it’s the thing that decides whether the rest holds up. Add the right one.

Join SAIHM

— Architect

FAQ

What makes a memory protocol “right” versus just a feature? Four properties a harness actually needs: bounded recall (capped tokens per turn), correctness under change (recall returns the current, non-superseded fact), cross-model portability (one store, any model), and provable per-record erasure. Most vendor memory features give you the first halfway and none of the rest.

Can I just paste the memory-contract prompt and go? Yes — it’s a system-prompt fragment that drives the three core tools (saihm_recall / saihm_remember / saihm_forget). Each instruction is written to produce a specific claim in this post; start with a small recall cap and raise it only if recall misses.

Does this replace my model’s built-in memory? It’s an alternative you own and can address across models — so you’re not locked to one vendor’s non-portable memory, and multi-model harnesses share one store.

Is the benchmark cherry-picked? It counts input tokens only (output is identical either way), runs fully offline, and is conservative for short sessions (~63% at 5 turns). Reproduce it and change the recall cap yourself.

Can I use it from LangChain / LlamaIndex / Claude Code? Yes — there’s a runnable demo for each.

What happens when a user asks me to delete their data? Erasure is per-record and provable (real key-destruction), with a tamper-evident audit trail.

Independence notice. SAIHM is an Apache-2.0 protocol authored independently. The benchmark referenced here is open source and reproducible offline; the figures are produced by the published script and depend on session length and scenario. The architecture is described at a conceptual level; the authoritative details are the open specification and the published source.