2026-01-14 23:38:27 网络安全文章来源：ZONE.CI 全球网 0 阅读模式

文章总结： 本文分析RAG场景下的代码分割算法优化，对比了通用文本切分与基于Tree-sitter的AST分割方案。针对现有工具切断代码语义的问题，测试发现benbrandt/text-splitter-rust在行边界对齐和语义完整性上表现最佳。建议采用该方案进行代码向量化，以提升Embedding质量与检索效果，同时指出模型质量同样关键。 综合评分： 90 文章分类： AI安全,安全工具

cover_image

对 AI 更友好的代码分割算法分析

原创

crossoverJie

2026年1月14日 08:08 重庆

字数 1192，阅读大约需 6 分钟

背景

因为最近在基于 RAG[1] 对我们的 code repo 做 AI 分析，其中有一个非常核心的流程就是需要将我们的代码库里的源码进行分割，分割之后会作为 chunk 供 RAG 查询；然后再将查询到的 chunk 提交给 LLM 做分析。

目前我们所使用的 deepwiki-open[2]对代码的分析使用的是最通用的 text_splitter:

分割方法也是最简单的按照 word 进行分割，普通场景下 text_splitter 够用，但对于我们这种存代码的场景就需要使用特殊的 Spitter 了；主要问题是它不理解语言结构，容易把函数/类等语义单元切断，导致检索召回片段不完整、上下文丢失。

算法对比

1. 基础文本切分

简介：最原始的方法。不管代码逻辑，直接按字符长度或者空格硬切。就像切西瓜不管瓜瓤结构，每一刀切固定的厚度。

• 优点：简单，适用于所有文本项目
• 缺点：不适合代码项目，经常把一个完整的函数拦腰截断，大模型读起来云里雾里。

代码示例：

splitter = TextSplitter(**configs["text_splitter"])
document = splitter.split_text(code)

手搓 Tree-Sitter (参考Claude-context[3])

it('should split Java code from external file',&nbsp;async&nbsp;() => {
&nbsp; &nbsp; const&nbsp;filePath =&nbsp;'AppService.java';

&nbsp; &nbsp; if&nbsp;(fs.existsSync(filePath)) {
&nbsp; &nbsp; &nbsp; &nbsp; const&nbsp;code = fs.readFileSync(filePath,&nbsp;'utf-8');
&nbsp; &nbsp; &nbsp; &nbsp; console.log(`Reading Java file from:&nbsp;${filePath}`);
&nbsp; &nbsp; &nbsp; &nbsp; console.log(`File size:&nbsp;${code.length}&nbsp;characters`);

&nbsp; &nbsp; &nbsp; &nbsp; const&nbsp;chunks =&nbsp;await&nbsp;splitter.split(code,&nbsp;'java', filePath);

&nbsp; &nbsp; &nbsp; &nbsp; console.log(`Split into&nbsp;${chunks.length}&nbsp;chunks`);
&nbsp; &nbsp; &nbsp; &nbsp; chunks.forEach((chunk, index) =>&nbsp;{
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; console.log(`>>>>Chunk&nbsp;${index}:&nbsp;${chunk.content}\n`);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; console.log(`Metadata:`, chunk.metadata);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; console.log(`Content preview:&nbsp;${chunk.content.substring(0,&nbsp;100)}...`); &nbsp; &nbsp; &nbsp; &nbsp;});
&nbsp; &nbsp; }&nbsp;else&nbsp;{
&nbsp; &nbsp; &nbsp; &nbsp; console.warn(`File not found:&nbsp;${filePath}`);
&nbsp; &nbsp; }
});

原本是 TS[4] 写的，核心是使用 tree-sitter 做 AST 分析之后进行拆分，只是会在解析 AST 失败的时候使用 LangChainCodeSplitter 作为兜底。

这部分没有找到现成的开源方案，于是我就按照 ts 代码翻译了一份 Python 的版本：

splitter = AstCodeSplitter(chunk_size, chunk_overlap)
chunks = splitter.split(code,&nbsp;"java", file_path)
for&nbsp;i, chunk&nbsp;in&nbsp;enumerate(chunks):
&nbsp; &nbsp; print(f">>>>Chunk&nbsp;{i}:&nbsp;{chunk.content}\n")

总结

我们对同一个 Java 源码文件分别使用了 claude-context[3] 和 text-splitter[9]进行了对比。

最后我们选择了 benbrandt:text-splitter-rust 的版本（提供了 Python binding 库）。

但对某个代码 repo 分析的效果与许多因素有关，比如 LLM 大模型质量、Embeding 的质量、提示词是否合理；其中的 Code Splitter 算法只是较小的一个环节。

这类需求随着大模型的迭代也需要常用常新，后续也会继续迭代相关知识。

引用链接

[1] RAG: https://crossoverjie.top/2025/12/25/AI/deepwiki-rag-principle/ [2] deepwiki-open: https://github.com/AsyncFuncAI/deepwiki-open/ [3] Claude-context: https://github.com/zilliztech/claude-context [4] TS: https://github.com/zilliztech/claude-context/blob/2efe1e9aaf59f4f8c9aa3635b27326a2ae94fa1b/packages/core/src/splitter/ast-splitter.ts#L44 [5] LangChain CodeTextSplitter: https://reference.langchain.com/python/langchain_text_splitters/?_gl=1*1wsspz*_gcl_au*MTQxMjAyNDczOS4xNzY1NDQ2MTUx*_ga*NDcyOTM2OTM2LjE3NjU0NDYxNTI.*_ga_47WX3HKKY2*czE3NjY0NjkxODUkbzYkZzEkdDE3NjY0NzA0MDkkajYwJGwwJGgw#langchain_text_splitters.RecursiveCharacterTextSplitter.from_language [6] 相关代码: https://github.com/zilliztech/claude-context/blob/2efe1e9aaf59f4f8c9aa3635b27326a2ae94fa1b/packages/core/src/splitter/ast-splitter.ts#L109 [7] wangxj03/code-splitter: https://github.com/wangxj03/code-splitter [8] Python: https://pypi.org/project/code-splitter/ [9] benbrandt/text: https://github.com/benbrandt/text-splitter [10] LlamaIndex’s CodeSpiller: https://developers.llamaindex.ai/python/framework/module_guides/loading/node_parsers/modules/#codesplitter [11] tree-sitter: https://tree-sitter.github.io/tree-sitter/ [12] 相关代码: https://github.com/wangxj03/code-splitter/blob/aa9a37e967c242b481a8a0ad1f663e5113d12a04/src/splitter.rs#L118 [13] 关键字: https://github.com/langchain-ai/langchain/blob/master/libs/text-splitters/langchain_text_splitters/character.py#L172 [14] 有提供 Python binding 库: https://pypi.org/project/semantic-text-splitter/

往期推荐

AI 如何用 AST 每天对 200 万+ 文件做高质量分块（用于代码搜索）

DeepWiki 一个常用 RAG 应用的开发流程

大模型应用开发必需了解的基本概念

持续剖析超级增强：将 Trace/ Span 和 Profile 整合打通

点分享

点收藏

点点赞

点在看

免责声明：

本文所载程序、技术方法仅面向合法合规的安全研究与教学场景，旨在提升网络安全防护能力，具有明确的技术研究属性。

任何单位或个人未经授权，将本文内容用于攻击、破坏等非法用途的，由此引发的全部法律责任、民事赔偿及连带责任，均由行为人独立承担，本站不承担任何连带责任。

本站内容均为技术交流与知识分享目的发布，若存在版权侵权或其他异议，请通过邮件联系处理，具体联系方式可点击页面上方的联系我。

本文转载自：crossoverJie crossoverJie《对 AI 更友好的代码分割算法分析》