2026-04-22 05:34:08 网络安全文章来源：ZONE.CI 全球网 0 阅读模式

文章总结： ClaudeCode的记忆栈系统优先保障promptcache命中率而非完整记忆保存，采用四层架构：Tier1通过MEMORY.md硬盖稳定核心上下文；Tier2用独立side-query检索长尾记忆避免污染主缓存；Tier3通过cachedmicrocompact等服务端协议删除tool_result而不改动消息前缀；Tier4处理当前对话。关键设计原则包括缓存优先编辑、按工具类型差异化清理、时间感知压缩阈值。 综合评分： 82 文章分类： 安全开发,解决方案,技术标准,安全工具,云安全

cover_image

Claude Code 的记忆栈（上）：一套为 Prompt Cache 而生的记忆系统

马甲三号

2026年4月21日 00:28 江苏

在小说阅读器读本章

去阅读

Claude Code 的记忆栈：一套为 Prompt Cache 而生的

记忆系统

读完 Claude Code 源码里 src/memdir/（~4000 行）之后的笔记。本篇只讲设计理念——为什么它要长成这样；下一篇会讲落地实践——如何把这套思想迁移到自建推理栈上。

一、一个被忽视的事实

大多数人把 LLM 的”记忆”理解成：让模型记住更多历史对话。

但当你真的去读 Claude Code 的代码，会看到一个截然不同的优先级：

它第一优先保护的不是”完整上下文”，而是 prompt cache 的命中率。

所有的记忆分层、压缩策略、注入时机，都围绕一个目标：

尽可能不改动已被缓存的前缀。

这个优先级一旦想通，整套架构里许多”看起来很怪”的选择就全部合理了。

二、为什么这很重要：一个真实的 cold-prefill 故事

在一次 agent 多轮场景里，我们遇到过一个非常教科书的失败：

• 第一步请求跑了 300 秒然后 http_504。
• 17K tokens 的 prompt × 35B MoE × 冷缓存 ≈ 5 分钟以上 prefill。
• 客户端 60–120s 就 timeout，根本等不到结果。
• 同一时段 26 次请求里 13 次 prefix_applied_no_hit（50% cache break）。

表面看，是推理后端的前缀缓存只做精确哈希匹配，而多轮对话每次 messages 数组都在变长（新增 assistant / tool_call / tool_result），哈希永远不同 → 永远 miss。

但更深层的原因是：所有 17K tokens 都挤在”当前对话”这一层里，没有任何机制去稳定前缀 / 淘汰中段 / 保护 KV。

Claude Code 之所以不会这样挂，是因为它有一套四层记忆栈，每一层都在为上一层挡子弹。

三、Claude Code 的四层记忆栈

┌─────────────────────────────────────────────────────┐
│ Tier 1 &nbsp;MEMORY.md &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;持久、硬盖、系统层注入 │
├─────────────────────────────────────────────────────┤
│ Tier 2 &nbsp;findRelevantMemories &nbsp; 按需检索、side-query &nbsp;│
├─────────────────────────────────────────────────────┤
│ Tier 3 &nbsp;compact / microCompact 三级压缩 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;│
│ &nbsp; &nbsp;3a &nbsp;Cached Microcompact &nbsp; &nbsp; (不改 messages) &nbsp; &nbsp; &nbsp; │
│ &nbsp; &nbsp;3b &nbsp;Time-based Microcompact (超时窗主动清理) &nbsp; &nbsp; &nbsp; │
│ &nbsp; &nbsp;3c &nbsp;API Microcompact &nbsp; &nbsp; &nbsp; &nbsp;(按 tool 分类) &nbsp; &nbsp; &nbsp; &nbsp;│
│ &nbsp; &nbsp;3d &nbsp;/compact &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;(summary + 文件恢复) &nbsp; │
├─────────────────────────────────────────────────────┤
│ Tier 4 &nbsp;当前对话 messages &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; │
└─────────────────────────────────────────────────────┘

Tier 1：MEMORY.md —— 用”硬盖”换确定性

位置在 src/memdir/memdir.ts。第一眼看到的就是两个冷冰冰的常量：

export&nbsp;const&nbsp;ENTRYPOINT_NAME&nbsp;=&nbsp;'MEMORY.md'
export&nbsp;const&nbsp;MAX_ENTRYPOINT_LINES&nbsp;=&nbsp;200
// ~125 chars/line at 200 lines. At p97 today; catches long-line indexes that
// slip past the line cap (p100 observed: 197KB under 200 lines).
export&nbsp;const&nbsp;MAX_ENTRYPOINT_BYTES&nbsp;=&nbsp;25_000

200 行 × 25 KB，是一个硬盖。超了会被自动截断，并追加一条 warning：

content:
&nbsp; truncated +
&nbsp;&nbsp;`\n\n> WARNING:&nbsp;${ENTRYPOINT_NAME}&nbsp;is&nbsp;${reason}. Only part of it was loaded. `&nbsp;+
&nbsp;&nbsp;`Keep index entries to one line under ~200 chars; move detail into topic files.`,

设计哲学是什么？

MEMORY.md 是用户无论如何都要带的核心上下文，所以它必须小、必须可预测、必须每次请求都一模一样——这样它才能稳定地被 prompt cache 复用。

每 session 只加载一次，从 system prompt 注入，换会话也不丢失。这是”持久层”唯一的职责：不是记住更多，而是保证这一小块内容永远稳定。

Tier 2：findRelevantMemories —— 把检索赶到副车道

真正的长尾记忆（用户的其它 memory 文件）并不直接塞进主对话。src/memdir/findRelevantMemories.ts 的做法很讲究：

const&nbsp;result =&nbsp;await&nbsp;sideQuery({
&nbsp;&nbsp;model:&nbsp;getDefaultSonnetModel(),
&nbsp;&nbsp;system:&nbsp;SELECT_MEMORIES_SYSTEM_PROMPT,
&nbsp;&nbsp;skipSystemPromptPrefix:&nbsp;true,
&nbsp;&nbsp;messages: [
&nbsp; &nbsp; {
&nbsp; &nbsp; &nbsp;&nbsp;role:&nbsp;'user',
&nbsp; &nbsp; &nbsp;&nbsp;content:&nbsp;`Query:&nbsp;${query}\n\nAvailable memories:\n${manifest}${toolsSection}`,
&nbsp; &nbsp; },
&nbsp; ],
&nbsp;&nbsp;max_tokens:&nbsp;256,
&nbsp;&nbsp;output_format: {&nbsp;/* JSON schema: selected_memories: string[] */&nbsp;},
&nbsp; signal,
&nbsp;&nbsp;querySource:&nbsp;'memdir_relevance',
})

几个关键点：

1. memory 文件本身不加载到主对话。
2. 用一个便宜的 Sonnet 独立调用 + max_tokens=256，只列出文件名 + description，让它选 ≤ 5 个。
3. 主对话只加载真正相关的 memory——可能 0 个。
4. 这个 side-query 走独立 cache，不污染主对话 prefix。

一句话概括：记忆检索本身是一个独立 LLM 任务，它的成本与代价不会拖累主会话。

Tier 3：三级 Compaction —— 论如何”动到消息”却不打破缓存

这是整个架构里技术含量最高、也是别家项目最容易抄漏的一层。

3a. Cached Microcompact：cache-preserving edit

这是 Claude 的杀手锏。当需要删除旧 tool_result 来减小 prompt 时，它不修改本地 messages，而是告诉 API：「请在服务端的 KV cache 里就地删除这几个 tool，前缀不变」。

src/services/compact/microCompact.ts 里的 cachedMicrocompactPath：

return&nbsp;{
&nbsp; messages,
&nbsp;&nbsp;compactionInfo: {
&nbsp; &nbsp;&nbsp;pendingCacheEdits: {
&nbsp; &nbsp; &nbsp;&nbsp;trigger:&nbsp;'auto',
&nbsp; &nbsp; &nbsp;&nbsp;deletedToolIds: toolsToDelete,
&nbsp; &nbsp; &nbsp;&nbsp;baselineCacheDeletedTokens: baseline,
&nbsp; &nbsp; },
&nbsp; },
}

源码注释写得很直白：

Return messages unchanged — cache_reference and cache_edits are added at API layer

这背后依赖 Anthropic API 的两种原生 context management 协议：clear_tool_uses_20250919 和 clear_thinking_20251015（见 apiMicrocompact.ts）。这是服务端协议层的能力——大多数自建推理栈（vLLM / mlx 系）暂时没有对应协议，这也是它们在多轮 agent 场景下 cache 总在”第二轮就失效”的根因。

3b. Time-based Microcompact：承认失败，主动止血

src/services/compact/timeBasedMCConfig.ts 只有 40 行，但思想非常关键：

const&nbsp;TIME_BASED_MC_CONFIG_DEFAULTS:&nbsp;TimeBasedMCConfig&nbsp;= {
&nbsp;&nbsp;enabled:&nbsp;false,
&nbsp;&nbsp;gapThresholdMinutes:&nbsp;60,
&nbsp;&nbsp;keepRecent:&nbsp;5,
}

注释里的判断比代码本身更值钱：

60 is the safe choice: the server’s 1h cache TTL is guaranteed expired for all users, so we never force a miss that wouldn’t have happened.

翻译一下：

当保 cache 已经不可能时（距上次 assistant 消息 > 60 分钟，服务端 KV 几乎一定已过期），不再挣扎，先主动把旧 tool_result 清掉，让接下来那次无可避免的 cold prefill 少跑些 token。

这是整个设计里最有”工程老兵气质”的一笔——该认输的时候要认输，把损失降到最低。

3c. API Microcompact：按工具性质差异化清理

src/services/compact/apiMicrocompact.ts 里显式区分了两类工具：

const&nbsp;TOOLS_CLEARABLE_RESULTS&nbsp;= [
&nbsp; ...SHELL_TOOL_NAMES,
&nbsp;&nbsp;GLOB_TOOL_NAME,
&nbsp;&nbsp;GREP_TOOL_NAME,
&nbsp;&nbsp;FILE_READ_TOOL_NAME,
&nbsp;&nbsp;WEB_FETCH_TOOL_NAME,
&nbsp;&nbsp;WEB_SEARCH_TOOL_NAME,
]

const&nbsp;TOOLS_CLEARABLE_USES&nbsp;= [
&nbsp;&nbsp;FILE_EDIT_TOOL_NAME,
&nbsp;&nbsp;FILE_WRITE_TOOL_NAME,
&nbsp;&nbsp;NOTEBOOK_EDIT_TOOL_NAME,
]

差异化的理由：

• FileRead / Grep / WebFetch 的 result 是可重放的信息，删内容不会丢语义；
• FileEdit / FileWrite 的 use 代表状态变更，一旦删就丢失”我改过这里”的事实——所以即使 result 能删，use 必须保留。

一把梭的压缩算法做不到这种分辨，会把 file_edit 一起清掉。

3d. 完整 `/compact`：最后一招

src/services/compact/compact.ts 全文 1705 行，核心在 prompt.ts 里的那段 compact prompt——让 LLM 把整段对话 summarize 成结构化的 9 段总结（Primary Request / Key Technical Concepts / Files and Code Sections / Errors and fixes / …），再恢复最相关的 top-5 文件 attachment：

export&nbsp;const&nbsp;POST_COMPACT_MAX_FILES_TO_RESTORE&nbsp;=&nbsp;5
export&nbsp;const&nbsp;POST_COMPACT_TOKEN_BUDGET&nbsp;=&nbsp;50_000
export&nbsp;const&nbsp;POST_COMPACT_MAX_TOKENS_PER_FILE&nbsp;=&nbsp;5_000

这里值得注意的是”文件恢复”——很多 compact 实现只做 summary，但 summary 会丢具体代码。Claude 的做法是：

summary 保意图 + 文件恢复保细节，并给每个文件一个 5K token 的硬预算。

四、设计哲学的五条原则

读完四层之后，可以提炼出 Claude Code 记忆栈背后的五条清晰原则：

1. Cache-preserving edit first 能不改 prompt 就不改。用 API 端协议删 cache，本地 messages 原样保留。
2. Selective clear 按 tool 性质分类——result 可删、state-changing use 必留、thinking 保留最近 N 个。
3. Side-query 隔离 每次 memory 检索是独立 LLM 调用，不影响主 cache。
4. 硬阈值 + 软阈值双轨 180K 触发 API context management、40K 是目标值、200 行 / 25KB 是 MEMORY.md 硬盖。阈值是确定性的护栏，不是建议。
5. 时间感知 cache 过期窗口内用 cached MC，过期后用 time-based MC，过长 / 过重才走 /compact。

五、小结

Claude Code 这套记忆栈最反直觉的一点，是它几乎不为”记忆的完整性”做设计——它更像一个 KV cache 友好的上下文调度器：

• Tier 1 保证”核心上下文永远稳定”；
• Tier 2 保证”长尾记忆不污染主 cache”；
• Tier 3 保证”需要缩短 prompt 时不破坏 prefix”；
• Tier 4 只是”上面全做完后剩下的”。

如果你的 agent 在多轮场景下频繁 timeout 或 cache miss，你要找的不一定是更大的 context window 或更好的 embedding 检索，而是先问自己一个问题：

有没有一层机制在保护我的 prompt 前缀？

一旦把这个问题问出来，上面四层的每一层都会变成一个明确的工程议题。

下篇预告：下一篇我们会用这一套思想对照本地自建推理栈（vLLM / mlx 系）最佳实践，并给出按 ROI 排序的落地路径。

免责声明：

本文所载程序、技术方法仅面向合法合规的安全研究与教学场景，旨在提升网络安全防护能力，具有明确的技术研究属性。

任何单位或个人未经授权，将本文内容用于攻击、破坏等非法用途的，由此引发的全部法律责任、民事赔偿及连带责任，均由行为人独立承担，本站不承担任何连带责任。

本站内容均为技术交流与知识分享目的发布，若存在版权侵权或其他异议，请通过邮件联系处理，具体联系方式可点击页面上方的联系我。

本文转载自：马甲三号《Claude Code 的记忆栈（上）：一套为 Prompt Cache 而生的记忆系统》