Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 96 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Kimi K2 189 tok/s Pro
2000 character limit reached

Improving the Context Length and Efficiency of Code Retrieval for Tracing Security Vulnerability Fixes (2503.22935v2)

Published 29 Mar 2025 in cs.CR and cs.SE

Abstract: An upstream task for software bill-of-materials (SBOMs) is the accurate localization of the patch that fixes a vulnerability. Nevertheless, existing work reveals a significant gap in the CVEs whose patches exist but are not traceable. Existing works have proposed several approaches to trace/retrieve the patching commit for fixing a CVE. However, they suffer from two major challenges: (1) They cannot effectively handle long diff code of a commit; (2) We are not aware of existing work that scales to the full repository with satisfactory accuracy. Upon identifying this gap, we propose SITPatchTracer, a scalable and effective retrieval system for tracing known vulnerability patches. To handle the context length challenge, SITPatchTracer proposes a novel hierarchical embedding technique which efficiently extends the context coverage to 6x that of existing work while covering all files in the commit. To handle the scalability challenge, SITPatchTracer utilizes a three-phase framework, balancing the effectiveness/efficiency in each phase. The evaluation of SITPatchTracer demonstrates it outperforms existing patch tracing methods (PatchFinder, PatchScout, VFCFinder) by a large margin. Furthermore, SITPatchTracer outperforms VoyageAI, the SOTA commercial code embedding LLM (\$1.8 per 10K commits) on the MRR and Recall@10 by 18\% and 28\% on our two datasets. Using SITPatchTracer, we have successfully traced and merged the patch links for 35 new CVEs in the GitHub Advisory database Our ablation study reveals that hierarchical embedding is a practically effective way of handling long context for patch retrieval.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com