Data-driven Test Generation for Fuzzing AI Compiler

Published 24 Jan 2026 in cs.SE | (2601.17450v1)

Abstract: AI compilers are critical for efficiently deploying AI models across diverse hardware platforms. However, they remain prone to bugs that can compromise both compiler reliability and model correctness. Thus, ensuring the quality of AI compilers is crucial. In this work, we present a unified data-driven testing framework that systematically addresses stage-specific challenges in AI compilers. Specifically, OPERA migrates tests for AI libraries to test various operator conversion logic in the model loading stage. OATest synthesizes diverse optimization-aware computational graphs for testing high-level optimizations. HARMONY generates and mutates diverse low-level IR seeds to generate hardware-optimization-aware tests for testing low-level optimizations. Together, these techniques provide a comprehensive, stage-aware framework that enhances testing coverage and effectiveness, detecting 266 previously unknown bugs in four widely used AI compilers.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates a unified data-driven approach that uses stage-aware test generation to reveal over 266 bug instances in AI compilers.
It employs three methods—migration-based, synthesis-based, and mutation-based—to target model loading, high-level IR optimizations, and low-level hardware-specific stages.
Empirical results indicate improvements in test prioritization efficiency (up to 47.4%) and higher coverage compared to existing fuzzing methods.

Data-driven Test Generation for Systematic Fuzzing of AI Compilers

Motivation and Challenges in AI Compiler Testing

AI compilers serve as essential infrastructure for deploying neural models across heterogeneous hardware, yet their complexity and rapid evolution introduce significant reliability risks. Architectural diversity and the richness of frontends (e.g., PyTorch, TensorFlow) exacerbate the difficulty of fully exercising compilation logic across all phases, including model loading, high-level IR optimization, and hardware targeting. Empirical studies show that bugs are distributed across all stages, often escaping traditional test generation methods, which lack tight coupling to the semantics and optimization logic specific to each compiler phase.

Three critical gaps exist in the current literature: First, achieving comprehensive frontend coverage is nontrivial due to the nuanced parameter space and semantics of operator conversions from diverse libraries. Second, generating context-sensitive computational graphs that meaningfully stress high-level optimization passes remains unsolved by random or pure combinatorial synthesis approaches. Third, hardware-optimization-specific IR mutation is challenged by the intricacy of backend transformation logic, deep invocation chains, and incomplete or obsolete documentation on optimization patterns.

Unified Stage-aware Data-driven Test Generation Approach

This work introduces a unified framework that integrates three mutually reinforcing, stage-aware test generation methodologies:

Model Loading Stage – OPERA: A migration-based strategy that automatically harvests and adapts existing AI library tests to generate operator conversion test cases.
High-level Optimization Stage – OATest: A synthesis-based strategy that mines and recombines optimization patterns from developer-written tests to create computational graphs sensitive to hardware-independent optimizations.
Low-level Optimization Stage – HARMONY: A mutation-based pipeline that utilizes LLM-guided seed generation and constrained mutations, informed by cross-checked documentation and code examples, to fuzz hardware-dependent optimization logic.

Each component is technically aligned to the unique semantics and challenges at its respective pipeline stage, maximizing coverage while controlling test redundancy and noise.

Technical Contributions

OPERA: Migration-based Model Loading Test Generation

OPERA leverages the observation that operator implementations in AI libraries, and their validation suites, implicitly encode rich behavioral specifications. By instrumenting model-construction APIs in the source code of libraries (e.g., PyTorch, Keras), OPERA extracts real operator instances from in situ test executions. These instances are programmatically re-encapsulated into single-operator models suitable for feeding into the compiler frontend. To scale to large test corpora and control computational expense, OPERA applies a semantic clustering and prioritization algorithm, ensuring coverage-maximizing selection under constrained test budgets. Empirical evaluation over multiple compiler frontends revealed 170 new bugs, with confirmed issues spanning all root categories of model loading defects, including tensor shape inference, type mismatches, exception logic, and cross-library incompatibility. The clustering-based test prioritization yields substantial improvements (11.9%–47.4%) in efficiency over baseline prioritization methods.

OATest: Optimization-aware Computational Graph Synthesis

OATest targets the high-level IR, where semantics-rich graph transformations (e.g., fusion, layout inference) are both context- and granularity-sensitive. By instrumenting compiler source to mine optimization patterns from existing developer tests, OATest generates a parametric template bank indexed by transformation type and granularity. These patterns are then algorithmically inserted into seed context graphs—either by reusing available node I/O or synthesizing compatible nodes to maintain correctness—thus generating optimization-aware test cases. Randomized selection and synthesis points ensure broad structural exploration of the possible optimization spaces. Comparative analysis against state-of-the-art generators (NNSmith, WhiteFox) demonstrated significant gains: 60.2% higher branch and 66.98% higher line coverage on TVM and ONNXRuntime, with 42 of 56 new bugs independently confirmed.

HARMONY: Mutation-based Low-level Optimization Fuzzing

HARMONY addresses hardware-specific IR lowering stages, characterized by high structural complexity and optimization constraints. It employs a dual-LLM pipeline: one LLM generates diverse, valid low-level IR seeds (guided by multi-source operator constraints to mitigate hallucinations and invalidity); a second LLM, informed by extracted optimization patterns from documentation and codebases, mutates these seeds while preserving semantic validity. The approach improves mutation efficiency and reduces test rejection rate relative to approaches that generate from scratch. HARMONY yielded 40 new bug discoveries in TVM’s low-level optimization layer, with marked improvement in backend-specific coverage over canonical fuzzers (NNSmith, Tzer).

Empirical Results and Implications

The framework enabled the detection of 266 previously unreported bugs across TVM, TensorRT, ONNXRuntime, and OpenVINO. Notably, each generator achieved its strongest results in its targeted pipeline phase, validating the thesis that specialized, stage-aware test generation is essential for end-to-end AI compiler reliability. The confirmed and fixed bug count underscores the production relevance of the discovered defects; for example, OPERA’s findings directly led to numerous developer patches and improvements in frontend robustness.

The practical implications are significant for both AI compiler developers and downstream model deployment teams. Systematic and stage-specific fuzzing powerfully reduces latent bugs that would otherwise manifest as silent numerical inaccuracies or fatal runtime faults in production AI workloads.

Theoretical Implications and Future Directions

This work advances the testing landscape by coupling stage-specific semantic knowledge with data-driven and LLM-based synthesis/mutation. It demonstrates that legacy test assets (including developer-authored tests and documentation) can be repurposed at scale to fuzz deep learning compilers more efficiently than pure random or domain-agnostic approaches.

Future work will generalize the approach to next-generation AI compilers for models beyond classical DNNs, such as LLMs and specialized DSL-driven architectures (e.g., Triton, MLC-LLM, Tilelang). Another critical direction is integrating automatic bug localization and repair, closing the loop from automated discovery to patch. This will further leverage LLM capacities for program analysis, repair suggestion, and test triage.

Conclusion

The presented framework systematically bridges the critical gaps in AI compiler testing by delivering targeted, data-driven, and optimization-aware test generation tools for each compilation stage. The approach sets a robust baseline for subsequent work in end-to-end AI compiler quality assurance and highlights the continued integration of program analysis, developer knowledge, and generative models to address the evolving complexity of machine learning infrastructure (2601.17450).

Markdown