SecRepoBench: Security Evaluation Framework

Updated 5 January 2026

SecRepoBench is a benchmark framework designed to evaluate security-oriented repository tools, focusing on secret detection and secure code generation.
It integrates the SecretBench dataset with 97,479 candidate secrets from 818 curated GitHub repositories, spanning 49 languages and diverse file types.
It assesses secure code generation through 318 real-world code-completion tasks from 27 open-source C/C++ repositories, reflecting practical vulnerability fixes.

SecRepoBench is a comprehensive benchmark framework and dataset suite designed for the evaluation of security–oriented repository tools, with distinct tracks for secret detection and secure code generation in real-world software repositories. It provides rigorously curated tasks, datasets, and evaluation protocols to assess the capabilities of both static analysis systems and LLMs in practical, highly contextual security scenarios. SecRepoBench draws on (i) the SecretBench dataset for evaluating secret-detection tools and (ii) real-world C/C++ code repositories with known vulnerabilities for evaluating LLM-driven secure code generation, emphasizing the gap between controlled, self-contained benchmarks and the operational complexities of modern software development (Basak et al., 2023, Dilgren et al., 29 Apr 2025).

1. Motivation and Development Context

SecRepoBench was introduced to address critical deficiencies in existing security benchmarks for both secret detection and secure code generation. Prior datasets frequently consisted of synthetic, hand-crafted problems with limited diversity, narrow repository context, and evaluation protocols that did not scale to realistic project size or complexity. For secret detection, the lack of a high-quality public dataset hampered methodological comparison and tool improvement due to the high rate of false positives and pronounced file/language heterogeneity (Basak et al., 2023). For secure code generation, benchmarks such as SecCodePLT and BaxBench targeted self-contained, didactic problems, failing to capture the multifaceted dependencies, cross-file interactions, and nuanced vulnerability profiles of large, real repositories. SecRepoBench aims to (a) provide large-scale, representative, and systematically annotated corpora, and (b) enable rigorous and reproducible evaluation protocols for both tasks (Dilgren et al., 29 Apr 2025).

2. Dataset Structure and Taxonomic Coverage

SecRepoBench for Secret Detection

SecRepoBench leverages the SecretBench dataset, which includes:

Candidate and True Secrets: 97,479 candidate secrets (27,336 unique), with 15,084 (4,014 unique) manually verified as true, extracted from 818 non-forked, actively maintained GitHub repositories.
Language and File-Type Breadth: 49 programming languages and 311 file types, prominently Shell, JavaScript, Python, Java, and Ruby by repository count; and .js, .nix, .json, .txt, and .xml by secret count.
Taxonomy: Eight major secret types: Private Key, API Key and Secret, Authentication Key and Token, Database and Server URL, Generic Secret, Password, Username, and Other (artifacts or false-positive patterns).

SecRepoBench for Secure Code Generation

Task Corpus: 318 code-completion tasks from 27 open-source C/C++ repositories (e.g., ImageMagick, FFmpeg, wireshark) covering 15 CWEs, dominated by heap- and stack-based buffer overflows (CWE-122, CWE-121).
Realistic Vulnerability Edits: Each task is anchored to an actual vulnerability fix (sourced from ARVO/OSS-Fuzz), demanding precise edits within true repository context, including cross-file dependencies, build scripts, and in-situ test suites (Dilgren et al., 29 Apr 2025).

3. Data Collection, Annotation, and Processing Methodologies

Secret Detection Track

Pattern Definition: 761 regexes from TruffleHog and prior literature.
Repository Selection: Multi-stage filtering and set cover to maximize pattern coverage across repositories, yielding 818 curated targets.
Automated Mining: All branches and commits scanned, extracting candidate secrets using TruffleHog and Gitleaks.
Manual Labeling: Dual reviewer adjudication with metadata context; Cohen’s $\kappa=0.86$ denotes near-perfect agreement. A developer survey for external validation produced 78.6% agreement on labels for a random secret subsample.
Taxonomic Assignment: Grouping guided by string/regex features, entropy, keyword context, and string characteristics.

Secure Code Generation Track

Patch Extraction: Patch locator filters ARVO data for function-scoped, vulnerability-fixing commits with reproducible OSS-Fuzz test cases.
AST-Based Masking: Tree-sitter identifies and masks the minimal AST subtree covering the fix, augmented for minimal meaningful context.
Anti-Memorization Mutation: Variable name randomization in and around the masked region rules out LLM exact-match memorization.
Test Suite Construction: Systematically harvests developer-written unit tests and pairs with OSS-Fuzz crashing inputs to enable correctness and security validation at scale (Dilgren et al., 29 Apr 2025).

4. Evaluation Protocols, Metrics, and Experimental Results

Secret Detection

Data Splitting: Stratified by secret class. Suggested: 20% repository-independent hold-out or k-fold CV without cross-contamination.
Metrics: Standard classification metrics—
- True Positive Rate (TPR/Recall): $TPR = \frac{TP}{TP+FN}$
- False Positive Rate: $FPR = \frac{FP}{FP+TN}$
- Precision: $P = \frac{TP}{TP+FP}$
- $F_1$ -Score: $F_1 = 2 \times \frac{P \times R}{P+R}$
Reporting: Per-category and macro-aggregated metrics, ROC/PR curves, confusion matrices at relevant thresholds.

Secure Code Generation

Core Metrics:
- Compile Success Rate (CSR): Fraction of generations compiling in repo context.
- pass@1: Fraction passing all developer-written unit tests.
- secure-pass@1: Fraction passing both unit and OSS-Fuzz security tests (Dilgren et al., 29 Apr 2025).
Experimental Highlights:
- 19 LLMs (open and closed weights) evaluated with diverse prompting and retrieval strategies.
- Substantial performance drop observed versus self-contained benchmarks (e.g., Claude 3.7 Sonnet with 58.6% secure-pass@1 on SecCodePLT vs. 28.0% on SecRepoBench).
- Prompt engineering yielded negligible gains for security on repository tasks (≤1.6 pp).
- Contemporary agentic patch repair approaches showed no success fixing insecure completions on SecRepoBench, in contrast to higher rates on single-file benchmarks.

5. Usage Guidelines, Best Practices, and Analysis

Integration: Direct access via Google BigQuery and Cloud Storage; recommended use of SQL for precise label and metadata extraction. Comparative evaluation requires tool outputs to be matched to ground-truth annotation.
Imbalance Handling: "Other" (false positive–rich) class dominates; effective analysis must include minority class oversampling, careful per-category reporting, and weighted loss strategies.
False Positive Filtering: Incorporate context-based features—entropy thresholds, in-URL checks, and template string exclusion via post-processing heuristics.
Analysis of LLM Errors: Key error classes include hallucinated struct members, undeclared identifiers, incorrect or incomplete conditional checks, and misallocated memory regions. Compilation failure rates hover around 30–40% per model. Among secure completions, 33–57% are functionally incorrect, indicating the twin challenge of security and correctness.

6. Limitations and Future Directions

Coverage Limitations: Secret detection: GitHub exclusivity, regex-driven pattern biases, incomplete language instrumentation, and static-only context. Secure coding: C/C++ focus, necessity for improved retrieval of relevant code context, and limited efficacy for both prompt engineering and agentic techniques on challenging repository-level tasks.
Annotation Granularity: Eight current secret categories may conflate distinct risks; expansion to more granular sub-types, especially for symmetric/asymmetric keys or platform-specific tokens, is an area for development.
Dynamic Analysis Gaps: Lack of runtime validation for secrets or vulnerability patches; future augmentation with dynamic instrumentation (e.g., log-based or honeypot data) is suggested.
Pretraining and Instruction Tuning: Scaling high-context secure coding fine-tuning remains challenging due to memory constraints in LLMs.
Expansion to Additional Languages and Hybrid Stacks: While C/C++ tasks set an initial scope for secure code generation, extension to Rust, C++, and mixed-language environments is a future research direction (Basak et al., 2023, Dilgren et al., 29 Apr 2025).

7. Impact and Research Significance

SecRepoBench standardizes the evaluation of secret detection and secure code generation tools under realistic conditions, reflecting the operational demands of modern software workflows. It exposes both strengths and critical weaknesses of current systems, particularly the large performance gap between self-contained programmatic benchmarks and full-repository, real-world tasks. The poor responsiveness of LLMs and existing repair agents to prompt engineering at the repository level underscores the need for integrative, context-aware approaches in future research. SecRepoBench is positioned as a catalyst for new methods in retrieval-augmented, dynamic, and multi-objective security-aware code generation and detection (Basak et al., 2023, Dilgren et al., 29 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (2)

SecretBench: A Dataset of Software Secrets (2023)

SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SecRepoBench.