I need a plan for effective code retrieval to extract the most relevant part of code file(s) before send it to LLM - today we call it Context Management (code as most the context window)
For most of the reranking task, a 2-step retrieval is a usual combo:
由于缺乏深层的 Cross-Attention,它难以捕捉精细的语义匹配(例如:代码中的变量名是否对应 Query 中的某个特定约束)。
如果不适合 Rerank,为什么还要用它? 因为它快。你可以预先计算好库里几百万个代码的向量,检索时只需做矩阵乘法。 相对的 Cross-Encoder 必须实时计算每一对 (Query, Candidate),无法预计算。
jina-embeddings-v2-base-code is an multilingual embedding model speaks English and 30 widely used programming languages. Same as other jina-embeddings-v2 series, it supports 8192 sequence length.
CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. encoder-only, decoder-only, and encoder-decoder) to support a wide range of code understanding and generation tasks. It is introduced in the paper:
CodeT5+: Open Code Large Language Models for Code Understanding and Generation by Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, Steven C.H. Hoi (* indicates equal contribution).
Compared to the original CodeT5 family (base: 220M, large: 770M), CodeT5+ is pretrained with a diverse set of pretraining tasks including span denoising, causal language modeling, contrastive learning, and text-code matching to learn rich representations from both unimodal code data and bimodal code-text data. Additionally, it employs a simple yet effective compute-efficient pretraining method to initialize the model components with frozen off-the-shelf LLMs such as CodeGen to efficiently scale up the model (i.e. 2B, 6B, 16B), and adopts a “shallow encoder and deep decoder” architecture. Furthermore, it is instruction-tuned to align with natural language instructions (see our InstructCodeT5+ 16B) following Code Alpaca.
以 CodeT5+ 为例:CodeT5+ 的重排序能力主要来自于其 Text-Code Matching预训练任务。在该模式下,模型不仅独立编码文本和代码,还通过 Decoder 的 Cross-Attention 机制深度融合两者的信息,判断它们是否匹配。
选择模型: 必须使用经过双模态(Bimodal)训练的checkpoint,例如 Salesforce/codet5p-220m-bimodal 或 Salesforce/codet5p-770m。纯 Encoder 模型(如 embedding 版本)通常用于粗排(向量检索),而 Reranker 需要 Decoder 参与。 数据构造: Reranker 是 Cross-Encoder 模式。我们需要将 (Query, Code_Candidate) 拼成一对输入:
[CLS] Query [SEP] Code
Encoder 输入:自然语言查询(Query)。 Decoder 输入:代码候选(Code Snippet)。 获取匹配分数: CodeT5+ 的匹配分数的计算逻辑如下:
这里 Logit(未归一化的原始预测值)和 Probability(经过 Sigmoid/Softmax 归一化的概率值)或 Cosine Similarity 是两码事。Logit 的绝对值没有物理意义(不像概率代表置信度)。这就是所谓 Calibration 问题,Logit之间的相关关系并不线性对应于相关性,除非模型校准过。虽然 Logit 对排序(A > B)有效,但如果需要过滤低质量结果,Logit 很难确定截断点。以及 Logit 跨模型不可比(那是自然的)。
不过,如果在重排序阶段,我们只关心候选文档的相对顺序(谁比谁好),Logit 提供了最细粒度的区分度,避免了 Softmax 在高分段的挤压效应。这样也能用。
CodeRankLLM is a 7B LLM fine-tuned for listwise code-reranking. When combined with performant code retrievers like CodeRankEmbed, it significantly enhances the quality of retrieved results for various code retrieval tasks.
https://huggingface.co/nomic-ai/CodeRankEmbed
https://gangiswag.github.io/cornstack/