Papers
Topics
Authors
Recent
2000 character limit reached

TitanFuzz: LLM-Driven Fuzzing

Updated 16 December 2025
  • TitanFuzz is an automated, coverage-driven fuzzing framework that leverages LLMs to generate and mutate test cases for deep learning APIs and hardware verification.
  • It uses evolutionary mutation, differential testing oracles, and LLM-based seed generation to expose both crash and silent computation faults efficiently.
  • TitanFuzz shows significant gains in API and code coverage over state-of-the-art methods while also highlighting challenges like computational cost and edge-case detection.

TitanFuzz is the designation for independent, high-impact fuzzing approaches in two distinct research domains: (1) deep learning library fuzzing via LLMs (Deng et al., 2022), and (2) hardware design verification using software-inspired fuzz strategies for hardware description languages (HDLs) (Trippel et al., 2021). In both contexts, TitanFuzz refers to automated, coverage-driven methodologies that leverage recent advances in test-case synthesis and feedback-guided mutation to expose faults in complex, high-assurance systems. Despite domain differences, both variants share the goal of maximizing coverage and bug discovery in domains traditionally seen as resistant to undirected or syntactically naive fuzzing.

1. TitanFuzz for Deep Learning Libraries via LLMs

TitanFuzz, as introduced in (Deng et al., 2022), is a two-phase, end-to-end zero-shot fuzzer for Python-based deep learning (DL) libraries, targeting TensorFlow and PyTorch. Its core innovation is the direct use of LLMs, specifically OpenAI Codex (“code-davinci-002”) and InCoder, for both initial seed program generation and subsequent mutation.

Seed Generation via Generative LLMs

TitanFuzz uses codex-based autoregressive generation to produce valid and diverse Python program seeds for DL library APIs. For each target API, TitanFuzz constructs a structured prompt comprising a docstring and stepwise instructions:

  1. Import the relevant DL library.
  2. Synthesize the required input tensors.
  3. Invoke the specific API (e.g., tf.nn.conv2d).

Sampling uses top-p (p=0.95), temperature=0.4, and a maximum of 256 tokens. Output candidates are filtered for syntactic correctness and runtime validity before coverage metrics are computed.

Mutation via Evolutionary Infilling LLMs

Subsequent program mutation employs InCoder, capable of masked infilling, guided by a multi-armed bandit using Thompson Sampling for operator selection (e.g., argument replacement, keyword insertion, method-name swap). Each candidate's fitness is computed as

fitness(C)=D+UR\mathrm{fitness}(C) = D + U - R

where DD is max data-flow depth, UU is number of distinct API calls, RR is number of repeated API calls. The process is repeated iteratively with prioritization over the top-N seeds by softmax of the fitness score.

Differential Oracle

TitanFuzz applies automated differential testing oracles by running candidates on both CPU and GPU backends, flagging both hard crashes (segfaults, INTERNAL_ASSERT_FAILED) and silent wrong-computations (vcpuvgpu>ϵ|v_{cpu} - v_{gpu}| > \epsilon).

2. TitanFuzz in Hardware Design Verification

Described in (Trippel et al., 2021), TitanFuzz adapts grey-box software fuzzing to the verification of hardware modules defined in Verilog/SystemVerilog. Here, hardware RTL is translated—using Verilator—into a cycle-accurate C++ simulation binary, with coverage and crash detection hooks inserted directly into the device-under-test's generated code.

RTL-to-Software Translation and Harness Design

Verilator processes the HDL into C++ classes for simulation, exposing registers, memories, and I/O as typed variables. TitanFuzz then uses generic harnesses—a byte-to-cycle interface mapping input data to hardware I/O over simulation cycles, or a bus-centric harness that interprets fuzzer-supplied inputs as TileLink-UL "instructions" (read, write, wait).

Error Oracles and Coverage

Crashes are defined as assertion failures (e.g., SVA violations) or deviations from a golden model, each triggering process aborts detectable by the fuzzing framework (AFL/libFuzzer). Coverage is quantified both at the HDL line level and as state-transition coverage over the RTL's finite state machines:

  • Covline=C/L\mathrm{Cov}_{\mathrm{line}} = |C| / |L| (HDL lines covered/all lines)
  • CovFSM=Thit/T\mathrm{Cov}_{\mathrm{FSM}} = |T_{\mathrm{hit}}| / |\mathcal{T}| (FSM transitions covered/total)

Only DUT code is instrumented for coverage; harness and simulation runtime are excluded to improve throughput.

3. Evaluation Methodologies and Results

Deep Learning Library Fuzzing

TitanFuzz was evaluated on TensorFlow v2.10 (3,316 APIs) and PyTorch v1.12 (1,593 APIs), with comparisons to FreeFuzz, DeepREL, LEMON, and Muffin. Under a budget of 1–4 minutes per API, TitanFuzz achieved:

TensorFlow PyTorch
API Coverage 2,215 1,329
Relative gain +91.1% +24.1%
Code Coverage 39.97% 20.98%
Gain vs SOTA +30.38% +50.84%
Bugs Found 65 total (41 previously unknown)

Running time per valid program was 0.67–1.69s depending on the model and backend. TitanFuzz's coverage and bug findings established a significant advance over baselines, especially in exercising more APIs and uncovering previously unreported faults (Deng et al., 2022).

Hardware Verification

For OpenTitan SoC IPs (AES, HMAC, KMAC, RV-Timer), TitanFuzz attained:

Core HDL Line Coverage (1h, empty seed)
AES 90.1%
HMAC 89.4%
KMAC 88.7%
RV-Timer 65.4%

On 2–64-state FSMs, TitanFuzz was 10–80× faster than unconstrained CRV in achieving full transition coverage, and two orders of magnitude faster overall in aggregate (Trippel et al., 2021).

4. Advantages, Limitations, and Open Problems

Strengths

  • Zero-shot capability: TitanFuzz requires no manually-defined input grammars or schemas, enabled by LLMs that have internalized DL API usage and shape semantics (Deng et al., 2022).
  • Generality: For hardware, design-agnostic harnesses and binary grammars allow reuse across RTL designs.
  • Automation and coverage: Differential oracles and feedback-guided search yield discoveries of both crash and non-crash bugs; coverage-driven harnesses maximize exercised state space.
  • Efficiency: Hardware TitanFuzz scales to large modules without constraints, instrumental only on the DUT, with negligible reset overhead.

Limitations

  • Cost: LLM-based fuzzing is computationally and monetarily expensive due to API call latencies and cloud costs, notably in Codex usage.
  • Coverage saturation: Even with large-scale generation, raw line coverage remains below 40% (TF) and 21% (PyTorch).
  • False positives: For DL, certain CPU-GPU differentials represent benign implementation variation rather than genuine faults, requiring human triage.
  • Domain specificity: Edge-case behaviors are underrepresented in standard LM training corpora, suggesting that rare or adversarial paths may be missed absent historical priming (Deng et al., 2023).

TitanFuzz’s zero-shot approach was foundational for subsequent LLM-based fuzzing, notably FuzzGPT (Deng et al., 2023). FuzzGPT extends TitanFuzz by mining and leveraging historical bug-triggering code from repositories to drive LLMs toward rare/edge-case program generation through in-context learning, fine-tuning, and chain-of-thought prompts. FuzzGPT demonstrates:

  • Substantially higher code and API coverage (e.g., PyTorch code coverage 31–33% vs. TitanFuzz’s 21%).
  • More unique bug discoveries (49 new bugs, 11 high-priority/security vs. TitanFuzz’s 2).
  • Ability to discover edge cases (e.g. zero-length tensors, buffer aliasing) that TitanFuzz’s zero-shot pipeline misses.

This highlights that while TitanFuzz enables broad and deep bug exposure via generative and mutation-based LLM use, incorporation of history and targeted edge-case prompting can further enhance coverage and bug-finding capability.

6. Future Directions

Future improvements outlined in (Deng et al., 2022) include:

  • Type/shape-aware pruning: Incorporating symbolic or lightweight type-checking to further reduce invalid LLM-generated mutants.
  • Efficient LLMs: Fine-tuning smaller models on DL-API corpora to reduce cost without sacrificing domain specificity.
  • Extended oracles and domains: Expanding beyond CPU/GPU differential oracles to include invariants, metamorphic relations; generalizing to domains such as compilers, database engines, and SMT solvers.
  • Prompt engineering: Leveraging instruct-specialty LLMs and improved prompt schemas to enhance rare-bug discovery without explicit fine-tuning.

A plausible implication is that the TitanFuzz architecture forms a baseline for future research that integrates LLM-driven generation, feedback-directed mutation, and domain-informed prompting for scalable, automated robustness testing.

7. Conclusion

TitanFuzz represents a significant step towards scalable, fully automated fuzzing for complex systems, both in deep learning software and hardware verification. Its combination of LLM-driven seed generation, evolutionary mutation, and differential oracles establishes new baselines for coverage and bug discovery. Subsequent advancements, notably FuzzGPT, reinforce the continued potential for combining historical examples and prompt engineering with LLMs to further improve edge-case detection and robustness assurance in critical-system software and hardware (Deng et al., 2022, Trippel et al., 2021, Deng et al., 2023).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to TitanFuzz.