Your incident-response AI agent gets more expensive the longer the incident runs
· by the Architect · ~4 min read · for on-call engineers and SRE leads using AI in the loop
The point in an incident where an AI assistant should be cheapest and fastest — deep into a long, messy timeline — is exactly where most of them get slowest and most expensive. Here is why, and what to do about it.
The cost that peaks at the worst moment
Picture hour three of a production incident. Your AI assistant has already pulled the runbook, three dashboards’ worth of metrics, a wall of log lines, and the back-and-forth of everything you have tried so far. Every new question you ask — “could it be the cache?”, “what changed at 02:14?” — is a fresh model call that re-reads all of that history again before it answers.
So the assistant is slowest and priciest precisely when the timeline is longest, which is precisely when you are most under pressure. It is also why long incident sessions eventually overflow the context window and the assistant starts “forgetting” the early symptoms that turn out to matter.
Why it grows the way it does
An agent loop is not one call — it is dozens. Each step re-sends the system prompt plus the entire growing transcript: the runbook, the logs, every prior step. Because each step replays everything before it, the context you pay for grows faster than the incident itself. SAIHM measured this dynamic on a reproducible, offline benchmark and saw 62.8%–85.9% fewer context tokens across a session when an agent recalls a compact memory instead of replaying history — and the gap widens the longer the session runs. You can clone the benchmark and check the number yourself.
Recall the few facts a step needs — not the whole timeline
The alternative is simple: stop re-reading the timeline. SAIHM keeps the durable facts of an incident — the failing service, the suspected change, the hostname, the decision to roll back — as separate memory cells. Each step recalls only the handful it actually needs. The working context stays small even as the incident timeline grows, so the assistant stays fast and affordable at hour three, not just at minute one. The same store carries across whatever model your on-call tooling speaks to — Claude, GPT, DeepSeek, Qwen, Kimi or GLM — so a model swap mid-incident does not lose the thread.
Incident data is sensitive — so hold the keys to it
An incident memory is not neutral. It contains hostnames, internal topology, customer-impact notes, sometimes personal data from affected accounts. With most hosted-memory products that history lives on a vendor’s servers under the vendor’s keys. SAIHM inverts that: the memory is yours. You hold the encryption keys, so the facts are readable only by you; and erasure is per-record and provable — when an incident retrospective is closed and a sensitive note must go, that single cell is cryptographically destroyed, not merely flagged hidden. For a team that answers to auditors or to a data-protection regime, being able to prove that a specific record is gone is the difference between a clean post-incident review and an open finding.
The honest close
SAIHM is a paid product, with no free tier — that is stated up front rather than buried behind a trial. But the benchmark and all nine integration demos are open source and run locally, so you can verify the savings and try the connect path before deciding anything. The tool surface and setup steps are at /developers; pricing is at /pricing.
— Architect
Independence notice. SAIHM is an Apache-2.0 protocol authored independently. The benchmark referenced here is open source and reproducible offline; the figures are produced by the published script and depend on session length and scenario. The architecture is described at a conceptual level; the authoritative details are the open specification and the published source.