Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

deepSURF: Automated Memory Safety in Rust

Updated 30 June 2025
  • deepSURF is an advanced tool that automatically detects memory safety flaws in Rust libraries by analyzing unsafe code blocks.
  • It synthesizes custom fuzzing harnesses and dynamic API call sequences to reveal both known and previously unknown vulnerabilities.
  • Its integrated approach achieves high coverage of unsafe code paths, outperforming legacy tools with 87.3% unsafe-reaching API detection.

deepSURF refers to an advanced tool and methodology for the automated detection of memory safety vulnerabilities in Rust libraries, with particular emphasis on code that employs Rust’s unsafe primitives. Although the Rust programming language enforces memory safety by default, the allowance for unsafe code blocks can reintroduce exploitable software bugs common in systems programming. deepSURF integrates static program analysis with LLM-augmented fuzzing harness generation, enabling comprehensive detection of vulnerabilities that are often inaccessible to legacy tools due to the complexity of Rust’s type system and the limitations of automated harness synthesis (2506.15648).

1. Technical Overview and Motivation

deepSURF targets the memory safety assurance gap that arises when Rust developers use the unsafe keyword, which bypasses the language’s compile-time safety guarantees. Existing tools frequently fail to identify subtle memory corruption bugs in unsafe code or require extensive manual curation to validate results and generate proof-of-concept triggering inputs. deepSURF’s design addresses these shortcomings by:

  • Automatically identifying reachable unsafe routines within large codebases.
  • Synthesizing and augmenting fuzzing harnesses that exercise intricate Rust features such as generics, trait bounds, and closures.
  • Employing LLMs to produce semantically coherent, high-coverage API call sequences, thus increasing the likelihood of exposing deeply buried vulnerabilities.
  • Focusing explicitly on memory corruption errors—excluding panics, assertion failures, or safe API misuse from its bug reports.

This technical orientation enables deepSURF to both rediscover known security-critical bugs and uncover previously unknown vulnerabilities with minimal manual intervention.

2. Methodology: Static Analysis Coupled with LLM-Augmented Harness Generation

The deepSURF methodology is characterized by the synthesis of static and generative approaches:

Static Analysis:

  1. Unsafe Encapsulating Functions (UEFs): deepSURF statically analyzes the codebase to locate all safe functions containing unsafe blocks, and maps their reachability from public APIs (Unsafe Reaching APIs, URAPIs).
  2. Type Analysis and Instantiation Tree: Each URAPI’s argument types—primitive, generic, or complex—are recursively analyzed to determine how they can be instantiated. This includes enumerating all possible constructors and identifying trait/closure dependencies.

Automated Harness Generation:

  • For each URAPI, harness code is synthesized, with custom types and tailored trait implementations statically generated as necessary to satisfy type bounds.
  • For generics and complex trait bounds, deepSURF constructs user-defined (custom) types and implements the requisite traits with method bodies driven by fuzzer input, thereby simulating varied and even malicious user behavior.

LLM Augmentation:

  • With the initial static harness as a base, deepSURF employs an LLM (DeepSeek-R1) to further enhance the harness. This includes:
    • Modifying harnesses so that runtime selection among multiple constructor options is governed by the fuzzer.
    • Replacing custom types (for unsafe trait bounds) with library-defined types to avoid introducing spurious vulnerabilities.
    • Generating API call sequences where function selection and order are dynamically determined by fuzzer input for semantically rich test coverage.
  • Harnesses are iteratively validated for successful compilation, with errors fed back into the prompt for automatic repair.

Fuzzing and Filtering:

  • Harnesses—both LLM-augmented and statically generated—are integrated into a fuzzing workflow utilizing AFL++ and AddressSanitizer, with further crash filtering to focus exclusively on genuine memory safety violations.

3. Handling Rust Generics and Trait Logic

A defining innovation in deepSURF is its robust support for exercising generic API surfaces:

  • For each generic argument, required trait bounds and associated types are extracted and analyzed.
  • Custom user types (e.g., struct CustomTy0) are generated for generics, with corresponding trait implementations whose logic consumes fuzzer input. This allows exploration of edge cases and error paths that may depend on specific user-supplied behaviors.
  • For generics with unsafe traits, the LLM is directed to use only existing library types that are known to uphold safety contracts, thereby minimizing false positives attributable to erroneous custom implementations.

Illustrative Example:

1
2
3
4
5
6
7
struct CustomTy0(String);
impl STrait for CustomTy0 {
    fn desc(&self) -> String {
        if _to_u8(fz_data) % 2 == 0 { panic!("INTENTIONAL PANIC!") }
        return _to_string(fz_data);
    }
}
This approach enables the fuzzer to simulate diverse and potentially adversarial user code, thereby maximizing the coverage of input-dependent and trait-driven bug triggers.

4. Evaluation on Real-World Rust Libraries

deepSURF was evaluated on a dataset of 27 real-world Rust crates, each containing previously documented memory vulnerabilities. The methodology included 24 hours of fuzzing per harness, employing DeepSeek-R1 as the LLM, and AFL++ as the fuzzing engine.

Results:

  • deepSURF rediscovered 20 known memory safety bugs and identified 6 previously unknown vulnerabilities, all of which were reported to the respective maintainers.
  • Types of detected vulnerabilities include double-free, buffer overflow, arbitrary memory access (SEGV), and use-after-free.
  • deepSURF achieved a URAPI (unsafe-reaching API) coverage of 87.3%, successfully generating complex harnesses for nearly 9 out of 10 public APIs capable of invoking unsafe code.
  • In direct comparison, existing tools lagged significantly: RUG achieved just 21.8% URAPI coverage with no memory safety bug findings, while RPG and RULF managed 4% and 3% respectively, finding no bugs.
  • An ablation paper established that omitting either static analysis or LLM augmentation resulted in a sharp loss of both coverage and effective bug finding, underscoring the necessity of an integrated approach.
Aspect deepSURF RUG RPG/RULF
Bugs found (total) 26 (inc. 6 new) 0 0
URAPI Coverage 87.3% 21.8% 4%/3%
Custom Trait/Closure Support ± ×
Sequence Generation ✓ (semantic) Static/fixed Mostly static
Handling Unsafe Traits LLM/library types None None
Automation Full Partial

5. Implications for Rust Security and Automated Vulnerability Discovery

The deployment and evaluation of deepSURF have several notable implications:

  • The persistence of memory corruption bugs in code using Rust’s unsafe blocks demonstrates that language-level safety guarantees are only as strong as the static analysis and testing applied to bypasses of those guarantees.
  • Comprehensive detection of such bugs necessitates the generation of rich, dynamic harnesses capable of simulating realistic and complex usage patterns—including support for generics, custom traits, and closure logic—which in turn require sophisticated tools that combine static reasoning with generative modeling.
  • deepSURF’s methodology of integrating LLM-based augmentation directly into the fuzzing harness synthesis pipeline results in higher coverage and greater bug-finding efficacy compared to tools relying solely on enumeration or static code analysis.

A plausible implication is that extension of this approach to support asynchronous, multi-threaded, or cross-crate scenarios could further increase the scope of discoverable vulnerabilities. Additionally, deeper integration of fuzzer feedback into the LLM harness refinement loop may facilitate dynamic adaptation of generation prompts, optimizing for coverage of previously unexplored code regions.

6. Future Research Directions

Several extensions and optimizations are suggested to advance the deepSURF methodology:

  • Expansion of code analysis capabilities to cover async programming models, concurrency, and inter-crate/cross-API boundaries.
  • Enhancements in LLM prompting and scaling, including the exploration of larger context windows, improved prompt engineering, and LLMs tuned specifically for static analysis and code security tasks.
  • Integration with continuous integration (CI/CD) pipelines to facilitate routine and automated vulnerability screening within the software development lifecycle.
  • Development and refinement of public benchmarks and datasets for unbiased, reproducible evaluation of memory safety tools in Rust and related safe-systems programming languages.

7. Summary Table: Key Features and Results

Feature deepSURF Implementation
Targeted Bug Class Memory corruption in unsafe Rust code
Harness Synthesis Static + LLM-augmented, generics/traits
Sequence Support Semantically rich, dynamic sequences
URAPI Coverage 87.3%
Bugs Found 26 (6 new, 20 known rediscovered)
Automation Level Fully automated
Baseline Comparison Outperforms RUG, RPG, RULF

deepSURF represents an effective integration of static code analysis and generative LLM-based augmentation for automated memory safety validation in Rust, achieving unprecedented coverage and bug-detection capabilities. Its methodology demonstrates the growing viability of combining symbolic and learning-based synthesis for security testing in strongly-typed systems programming environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)