← Back to Blog
·10 min read

The Memory Layer architecture — zero LLM cost session capture

Memory LayerArchitectureClaude Code

Persistent memory across every Claude Code session — captured at $0 per tool call. Every user prompt, every agent reasoning step, every raw conversation transcript, queryable forever. Here's how we built it.

The pain we set out to fix

Every Claude Code session today: the agent restarts, the conversation context evaporates, the debugging discussion that produced your week's biggest architectural call dies in chat scrollback. You can re-explain it. You can paste old screenshots. You can rebuild the mental model. None of it scales beyond a few weeks of work.

The standard fix is to compress every session through an LLM and store the summary. That works — at a cost. $0.01–0.05 per tool call adds up fast on a heavy day. Teams either turn it off when the bill arrives, or run it on a schedule that misses the discussions worth capturing. Either way the foundation is wrong: paying a model to read what another model just wrote.

We took a different angle. Three observations changed the design:

Three observations

1. Claude Code already writes the transcript. Every session is on disk at ~/.claude/projects/<hash>/<sessionUuid>.jsonl. Every user message, every assistant message, every tool call, every tool result. We don't need to capture it. We need to find it.

2. The agent can summarize itself. Claude is already running. It can write its own session digest as the final tool call before Stop fires — no background model run, no separate compression service. The cost of that summary is the cost of one more API token round-trip that was happening anyway.

3. The hard problems are not capture — they are retention and privacy. Capturing every prompt and tool call is mechanically easy. Deciding what to keep for 30 days vs forever, and isolating each developer's reasoning trail from their teammates', is where the design choices actually live.

The 3-tier retention model

The architecture decision came out of a long product discussion (more on that in a minute):

  • Tier 1 — Cloud (Sprintra)

    User prompts and agent-written session digests, scoped per-user, 30-day rolling rotation by default. Pinned digests are preserved indefinitely — your strategic discussions don't expire.

  • Tier 2 — Local (your machine)

    A thin index over Claude Code's existing transcript files. We don't copy bytes; we just index where they already live. 90-day local retention. Never syncs to the cloud.

  • Tier 3 — PM artifacts (forever)

    Stories, decisions, features, notes, comments, work sessions. The canonical project record. No rotation. Same shared model as today.

Layer 1 — Capturing every user prompt

The Sprintra plugin hooks Claude Code's UserPromptSubmit event. Every message you send is captured to your private timeline, fire-and-forget, never blocks your prompt from reaching Claude. Per-user attribution by default.

Every next session's briefing now leads with a “Recent asks” line: the last 3 prompts you sent in the past 24 hours. The next time you open Claude Code in this repo, it knows what you were thinking yesterday — not just what files you touched.

Layer 2 — Session digests with a pin flag

When Stop fires (Ctrl+C, /clear, compaction), we want a structured summary of what just happened. Not a raw transcript dump — a digest. What was discussed. What got decided. What questions are still open. Where the user pushed back. What artifacts (stories, decisions, files) the session produced.

The agent writes this itself. Just before Stop, the agent emits a structured digest with the relevant fields populated. The model is already running with the full session context loaded; producing the digest is one extra structured tool call, not a separate LLM round-trip. Cost: zero.

The 30-day rotation works for routine sessions. But strategic discussions are too valuable to evaporate. So every digest can be pinned. Pin = forever. The next briefing surfaces pinned digests preferentially — your foundational architecture conversation from three months ago is still showing up at the top of your context.

Here's how that pin flag got into the spec. During the design discussion that filed the retention ADR:

“Do we really need cloud session history for two months three months four months no right this is just so that you know the last session is what needs to be retrieved… conversation more than 30 days makes no sense.”

Correct read. PM artifacts hold long-term value; conversation context decays fast. 30-day rotation is right for 95% of sessions. The pin flag covers the remaining 5% without dragging the storage cost back up.

Layer 3 — Local transcript search without byte duplication

Sometimes you don't want the digest — you want the literal words. “What exactly did I tell Claude about the auth bug last week?”

The naive answer is to copy every transcript file into our own data directory and search there. We almost shipped that — until a user pushed back during spec review:

“That's too much. Don't duplicate Claude Code's files. They're already on the machine.”

Right again. Claude Code keeps the source-of-truth transcripts on disk and rotates them on its own schedule. We just need to know what's there.

So the shipped design is a thin index, not a store. sprintra transcript reindex streams every transcript Claude Code has written, extracts metadata (session id, message count, tool calls, timestamps), and populates a local full-text search index. The index points at the original files; we never copy the bytes.

On the machine this article was written on, that one command indexed 481 historical transcripts in 13.8 seconds. Total source size on disk: 465 MB — none of it duplicated by us. Keyword search returns ranked snippets with highlights in well under 100 ms.

# Find every session where you discussed a specific topic
sprintra transcript search "auth bug"

# Open a specific session in place (no copy made)
sprintra transcript view <session-id>

# Disk usage / oldest / newest / retention
sprintra transcript status

Multi-user privacy by default

Three developers in the same repo, working on the same Sprintra project. Each opens their own Claude Code session. What gets shared, what stays private?

Private per user: their prompts, their session digests. User A retrieves only User A's data; User B can't see it without admin role (audit override). Raw transcripts never leave the machine that produced them.

Shared at project level: features, decisions, notes, comments, work sessions. Same model as today — collaborative outputs stay collaborative.

The result: each developer has their own private reasoning trail; the team has a shared project record. Both move forward independently.

The numbers

Heavy user, heavy day. ~50 sessions/month, average 200K tokens of transcript text each. The math:

Storage approachCloud cost / heavy user / year
Upload every raw transcript~9 GB
Sprintra Memory Layer (digests only)~6 MB

Three orders of magnitude smaller. And the digest is what you actually want for cross-device handoff anyway — when you switch from your home laptop to your office desktop, you don't need yesterday's 5 MB transcript. You need the 5 KB summary that says “here's what was discussed, here's what's pending.”

The cost per tool call: $0. No background LLM runs. No token spend on compression. The digest is written by the model that was already running for the session.

Try it

# 1. Install the plugin in Claude Code
/plugin marketplace add Sprintra-io/sprintra-mcp
/plugin install sprintra@sprintra

# 2. Install the CLI
npm install -g @sprintra/cli@latest

# 3. One-time backfill — index every Claude Code session you've had
sprintra transcript reindex

# 4. Verify capture is live — open a fresh Claude Code session,
# send any prompt, then check the dashboard or run:
sprintra transcript status

The full architecture reference is at /docs/memory-layer. Multi-user behavior is documented at /docs/teams.

We built this because every session restart was costing us context. The Memory Layer is now load-bearing in Sprintra's own development — this very article was written across three sessions, with the previous session's digest surfacing at the top of each new briefing. The system works on the system that built it.

Get persistent memory in your AI coding sessions

Free for solo developers. @sprintra/cli@0.6.0 + @sprintra/plugin@1.3.0 on npm now.

Get started →