IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

Published 27 May 2024 in cs.CR, cs.PL, and cs.SE | (2405.17238v3)

Abstract: Software is prone to security vulnerabilities. Program analysis tools to detect them have limited effectiveness in practice due to their reliance on human labeled specifications. LLMs (or LLMs) have shown impressive code generation capabilities but they cannot do complex reasoning over code to detect such vulnerabilities especially since this task requires whole-repository analysis. We propose IRIS, a neuro-symbolic approach that systematically combines LLMs with static analysis to perform whole-repository reasoning for security vulnerability detection. Specifically, IRIS leverages LLMs to infer taint specifications and perform contextual analysis, alleviating needs for human specifications and inspection. For evaluation, we curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. A state-of-the-art static analysis tool CodeQL detects only 27 of these vulnerabilities whereas IRIS with GPT-4 detects 55 (+28) and improves upon CodeQL's average false discovery rate by 5% points. Furthermore, IRIS identifies 4 previously unknown vulnerabilities which cannot be found by existing tools. IRIS is available publicly at https://github.com/iris-sast/iris.

Abstract PDF HTML Upgrade to Chat

Authors (3)

References (48)

Citations (10)

View on Semantic Scholar

Summary

The paper presents a neuro-symbolic approach that integrates LLMs with static analysis to improve vulnerability detection in Java projects.
It uses LLM-based taint specification inference and contextual alert filtering to boost precision, achieving over 70% accuracy while reducing false positives by up to 80%.
Performance benchmarks demonstrate IRIS outperforms traditional tools, detecting 69 vulnerabilities compared to 27 by CodeQL, showcasing its practical efficacy.

IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

The paper "IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities" introduces a novel approach that integrates LLMs with static analysis techniques to improve the efficacy of vulnerability detection in Java projects. It highlights the challenges faced by traditional static analysis tools in detecting security vulnerabilities and proposes IRIS as a solution that leverages the strengths of LLMs for whole-repository reasoning.

Introduction to Vulnerability Detection Challenges

Traditional static analysis tools like GitHub CodeQL and others rely heavily on manually crafted taint specifications for API sources, sinks, and sanitizers, which are labor-intensive and often incomplete, leading to false negatives. Furthermore, these tools suffer from low precision due to imprecise reasoning and over-approximation, generating many false positives that burden developers during triaging. Despite advances in program analysis techniques, these challenges persist, hindering effective vulnerability detection in complex projects.

Overview of the IRIS Framework

IRIS presents a neuro-symbolic approach combining the reasoning capabilities of LLMs with static analysis to detect security vulnerabilities at the project level.

Figure 1: An overview of IRIS: our neuro-symbolic approach for vulnerability detection. Given a whole Java repository, IRIS checks whether it contains a certain type of vulnerability (CWE).

Candidate API Extraction: IRIS begins by using static analysis to extract candidate APIs from a given Java project, focusing on potential sources, sinks, and sanitizers.
LLM-Based Specification Inference: LLMs are utilized to automatically infer taint specifications for these APIs. This involves labeling APIs as sources or sinks relevant to a specified vulnerability class.
Static Analysis Integration: The inferred specifications are integrated with a static analysis engine, such as CodeQL, to perform taint analysis and identify vulnerable paths in the code.
Contextual Alert Filtering: IRIS implements a contextual analysis technique that uses LLMs to filter out false positives from detected paths, significantly reducing the triaging burden.

Practical Application and Dataset

IRIS was evaluated using a newly curated dataset, CWE-Bench-Java, consisting of 120 Java projects with manually validated vulnerabilities across four common classes (CWEs). This dataset serves as a robust benchmark for assessing the capability of vulnerability detection tools given the complexity and scale of real-world projects.

Figure 2: Steps for curating CWE-Bench-Java, and dataset statistics.

Performance Evaluation

The performance of IRIS was benchmarked against state-of-the-art static analysis tools and a variety of LLMs including GPT-4 and DeepSeekCoder.

Figure 3: Estimated precision of inferred source and sink specs by selected LLMs.

Vulnerability Detection: IRIS outperforms top static analysis tools by detecting 69 vulnerabilities with GPT-4, compared to 27 by CodeQL, showcasing substantial improvements across all CWEs.
Source/Sink Precision: The LLM-based inference achieves precision levels of over 70% for source and sink specifications, mitigating the limitations of traditional taint inference methods.
False Positive Reduction: Contextual filtering using LLMs reduces false alarms by over 80% in some cases, easing the developer's workload during vulnerability triaging.

Conclusion

IRIS demonstrates the substantial benefits of integrating LLMs with static analysis to enhance vulnerability detection in software engineering. By effectively inferring taint specifications on the fly and minimizing false positives through context-based reasoning, IRIS addresses key limitations of existing tools, paving the way for more accurate and efficient security analyses. Future research may explore expanding IRIS to other programming languages and deeper integration of dynamic analysis components to handle more complex vulnerability patterns.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Practical Applications

Immediate Applications

The following applications can be deployed today using the IRIS framework (LLM-assisted static analysis), especially for Java repositories and the four CWEs studied (CWE-22, CWE-78, CWE-79, CWE-94).

DevSecOps code scanning in CI/CD pipelines (Software)
- Use IRIS to auto-infer repository- and CWE-specific taint source/sink specs via LLMs, then run CodeQL queries to detect unsanitized dataflows.
- Integrate as a GitHub Action or Jenkins step to scan each PR/commit; publish alerts and LLM-generated explanations to developers.
- Potential workflow: “IRIS Spec Miner” step → CodeQL run → “IRIS Triage” step to filter false positives (80% reduction shown in best case).
- Assumptions/dependencies: the repository must build successfully for CodeQL; sending code to closed LLMs may require privacy controls; token cost and latency must be budgeted.
SAST vendor augmentation (Software security products)
- Embed IRIS’s LLM-driven spec mining to keep third‑party API source/sink specifications current without manual curation.
- Ship automatically generated CodeQL packs or equivalent taint rules per CWE that update as libraries evolve (e.g., Maven ecosystem).
- Assumptions/dependencies: ongoing prompt/LLM maintenance; evaluate per-model performance and stability; legal/licensing of training/inference on customer code.
Targeted CWE sweeps after advisories (Software, Security Operations)
- Rapidly scan internal Java codebases for specific CWE patterns when a CVE drops (e.g., path traversal in libraries handling file paths).
- Use IRIS contextual analysis to prioritize alerts and suppress benign flows.
- Assumptions/dependencies: focused on the four CWEs in the paper; static analysis cannot capture vulnerabilities that rely on runtime-only behaviors or external side effects.
Enterprise secure code review assist (Software)
- Augment code review with IRIS alerts and LLM-generated, path-aware justifications; reviewers can request context explanations for flagged flows.
- Reduce time spent triaging false alarms; route high-confidence alerts to security owners.
- Assumptions/dependencies: developers must trust LLM verdicts; provide audit trails for decisions.
Open-source project maintainers’ guardrails (Software)
- Run IRIS periodically on large Java repositories (hundreds of thousands to millions of lines of code) to catch CWE-specific flows missed by stock CodeQL rules.
- Use pre-merge scanning to catch regressions; publish results with reasoning to issues/PRs.
- Assumptions/dependencies: compute/time budget for scanning large repos; repository must compile; community acceptance of LLM-driven triage.
Software supply chain risk checks (Finance, Healthcare, Energy, Government)
- Scan internally maintained Java services and third-party components for CWE-22/78/79/94, generating evidence for audits and due diligence.
- Produce actionable reports for compliance frameworks (e.g., OWASP Top 10 alignment).
- Assumptions/dependencies: mapping findings to compliance controls requires policy translation; static analysis gaps must be documented.
Security consulting and M&A code diligence (Industry)
- Use IRIS for time-bounded vulnerability discovery across large Java portfolios; deliver reduced-noise findings with contextual evidence.
- Assumptions/dependencies: scope limited to Java and supported CWEs; codescan privacy provisions.
Academic uses of CWE-Bench-Java (Academia)
- Benchmark LLM-assisted static analysis; reproduce IRIS results; teach secure coding and whole‑repository reasoning.
- Build course assignments around the curated dataset with 120 validated vulnerabilities and build scripts.
- Assumptions/dependencies: dataset availability; students need access to CodeQL and at least one LLM.
Developer “daily life” safeguards (Software practice)
- Local pre-commit scan for Java projects using on‑prem/open-source LLMs (e.g., DeepSeekCoder 7B) to avoid sending code offsite.
- Lightweight triage for modified files only; post-commit job for full scans on larger repos.
- Assumptions/dependencies: memory/compute for local LLMs; models vary in detection quality (GPT‑4 highest in paper).

Long-Term Applications

The following applications require further research, engineering, scaling, or standardization before widespread deployment.

Multi-language and cross-service support (Software, Cloud-native)
- Extend IRIS to C/C++, Python, JavaScript/TypeScript, and Android; track cross-language flows in polyglot repos and microservices.
- Build language-agnostic spec mining with framework-aware semantics (e.g., Spring, Express).
- Dependencies: robust CodeQL or equivalent for each language; cross-language dataflow models; large-context prompts for multi-service reasoning.
Automated sanitizer inference and flow correctness (Software security research)
- Teach LLMs to identify sanitizer functions and conditions per CWE, reducing both false positives and false negatives.
- Combine IRIS with symbolic/dynamic analysis to validate flow feasibility and side effects (e.g., OS command gadgets).
- Dependencies: precise definitions of sanitization per API; hybrid analysis orchestration; higher compute cost.
AI-assisted remediation (Software)
- Move from detection to fix suggestions: propose and validate code changes (escaping, input validation, safer APIs) with tests.
- “Detect → Explain → Patch → Verify” workflow integrated into IDEs and CI.
- Dependencies: reliable patch synthesis and safety checks; developer-in-the-loop acceptance; regression testing.
Enterprise privacy and on-prem model deployment (Policy, Governance, Software)
- Train/fine-tune smaller, specialized security LLMs on CWE-Bench-Java and internal corpora; host models behind firewalls.
- Provide provable data handling for regulated sectors (HIPAA, PCI DSS, FISMA).
- Dependencies: high-quality fine-tuning data; model evaluation/monitoring; compliance documentation.
Continuous, registry-backed spec mining services (Software supply chain)
- Operate a service that mines taint specs from package ecosystems (Maven, NPM, PyPI) and publishes signed spec updates.
- Integrate with SBOM pipelines to annotate dependencies with up-to-date source/sink roles per CWE.
- Dependencies: ecosystem cooperation; version tracking and semantic diffs; trust and signing infrastructure.
Standards and policy adoption (Policy, Compliance)
- Incorporate “AI-augmented static analysis” into secure development baselines, procurement requirements, and audit checklists (e.g., NIST/OWASP guidance).
- Standardize reporting formats for LLM‑assisted findings and evidentiary reasoning.
- Dependencies: consensus on metrics and disclosure formats; validation suites; regulator buy-in.
IDE-centric interactive vulnerability exploration (Software tooling)
- Build visualization and chat-based assistants around IRIS: navigate paths, request targeted context, query “what if” scenarios.
- Blend path graph views with natural language explanations and code snippets.
- Dependencies: UX design, latency management; large-context handling; developer adoption.
Sector-specific secure SDLC gates (Finance, Healthcare, Energy)
- Tailor IRIS policies for regulated domains (e.g., stricter gating on CWE-78 and CWE-22 in data-processing pipelines).
- Automate risk scoring and escalation paths tied to regulatory controls.
- Dependencies: domain-specific rulepacks; mapping to control catalogs; integration into governance tooling.
Advanced benchmarking and model distillation (Academia, Software)
- Use CWE-Bench-Java to build new benchmarks for whole-repository reasoning; distill GPT‑4 performance into smaller models with reproducible prompts.
- Study generalization across CWEs and projects; develop reliability metrics.
- Dependencies: sustained access to strong teacher models; careful prompt and dataset curation; standardized evaluation.

Global Assumptions and Dependencies That Affect Feasibility

Buildability: Projects must compile to extract dataflow graphs for static analysis.
Model quality and variability: Detection rates depend on the chosen LLM; GPT‑4 performed best in the paper, while open models varied.
Privacy and IP: Using closed/cloud LLMs requires guardrails; on‑prem inference mitigates but needs hardware.
Cost and latency: LLM calls introduce runtime and financial costs; batching and prompt optimization are needed.
Static analysis limitations: Some vulnerabilities (e.g., OS command injection via gadget chains or external side effects) may evade static methods; hybrid approaches may be necessary.
Scope of CWEs and specs: The paper focused on four CWEs and did not infer sanitizers; annotation-based sources (e.g., Java annotations) may need special handling.
Documentation dependency: Spec inference benefits from README/JavaDoc quality; sparse docs can degrade labeling accuracy.
Prompt maintenance: Templates and context-selection heuristics require ongoing tuning to avoid drift and hallucinations.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

Summary

IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

Introduction to Vulnerability Detection Challenges

Overview of the IRIS Framework

Practical Application and Dataset

Performance Evaluation

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Practical Applications

Immediate Applications

Long-Term Applications

Global Assumptions and Dependencies That Affect Feasibility

Open Problems

Continue Learning

Collections

Tweets