Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 54 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 22 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 333 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

OODTE: A Differential Testing Engine for the ONNX Optimizer (2505.01892v2)

Published 3 May 2025 in cs.LG, cs.AI, cs.SE, cs.SY, and eess.SY

Abstract: With over 700 stars on GitHub and being part of the official ONNX repository, the ONNX Optimizer is the default tool for applying graph-based optimizations to ONNX models. Despite its widespread use, its ability to maintain model accuracy during optimization has not been thoroughly investigated. In this work, we present OODTE, a utility designed to automatically and comprehensively evaluate the correctness of the ONNX Optimizer. OODTE adopts a straightforward yet powerful differential testing and evaluation methodology, which can be readily adapted for use with other compiler optimizers. Specifically, OODTE takes a collection of ONNX models, applies optimizations, and executes both the original and optimized versions across a user-defined input set, automatically capturing any issues encountered during optimization. When discrepancies in accuracy arise, OODTE iteratively isolates the responsible optimization pass by repeating the process at a finer granularity. We applied OODTE to 130 well-known models from the official ONNX Model Hub, spanning diverse tasks including classification, object detection, semantic segmentation, text summarization, question answering, and sentiment analysis. Our evaluation revealed that 9.2% of the model instances either caused the optimizer to crash or led to the generation of invalid models using default optimization strategies. Additionally, 30% of classification models and 16.6% of object detection and segmentation models exhibited differing outputs across original and optimized versions, whereas models focused on text-related tasks were generally robust to optimization. OODTE uncovered 15 issues-14 previously unknown-affecting 9 of 47 optimization passes and the optimizer overall. All issues were reported to the ONNX Optimizer team. OODTE offers a simple but effective framework for validating AI model optimizers, applicable beyond the ONNX ecosystem.

Summary

Differential Testing of the ONNX Optimizer Through OODTE

The paper introduces the OODTE (ONNX Optimizer Differential Testing Engine), a utility designed to rigorously evaluate the correctness of the ONNX Optimizer, which is a widely recognized tool for graph-based optimizations of deep neural network models. Despite its prominence, the optimizer's impact on model accuracy has not been extensively evaluated until now.

OODTE employs a differential testing approach, conducting comparisons between original and optimized models across a variety of inputs to identify discrepancies, focusing on maintaining model correctness. Of particular interest is the observation by the authors that around 9.2% of tested models exhibited issues such as optimizer crashes or invalid model generation when using primary strategies. Furthermore, accuracy deviations were found in significant portions of classification (30%) and object detection (16.6%) models, though text generation models generally appeared robust against the optimization process. These findings illustrate potential shortcomings in the optimizer's implementation that need addressing.

The paper presents an empirical paper involving 130 models sourced from the ONNX Model Hub, chosen to provide a broad spectrum of use cases including classification, object detection, and text generation tasks. This choice reflects the aim of the paper to benchmark impacts across diverse machine learning applications. Notably, the strict adherence to reproducible tests and comprehensive bug reporting underscores a methodical approach, yielding 15 distinct bugs—14 being previously undocumented—and identifying affected cases in passes typically used by developers.

The reported results indicate that object detection models are notably vulnerable to crashing faults during optimization due to their complexity, while classification models are more likely to experience accuracy deviations, particularly with older opset versions and specific passes like fuse_bn_into_conv. Furthermore, the text generation models, when optimized, showed minimal degradation, highlighting robustness in this domain against optimization faults.

Future implications include broadening the scope of testing other model types such as speech recognition and super-resolution, which remain untested by OODTE. Additionally, insights derived from measuring execution improvements versus accuracy losses, an aspect not currently captured by OODTE, could provide an enhanced understanding of optimization trade-offs. Extending differential testing methodologies to support variations in optimization pass orders represents another avenue for improving robustness and understanding of model optimizers.

In conclusion, the development of OODTE substantiates a significant step forward in ensuring reliability and correctness within AI model optimizations. By meticulously documenting issues and engaging with ONNX Optimizer developers, the paper not only highlights areas needing refinement but also promotes an ongoing dialogue aimed at maintaining the fidelity of models post-optimization. Such advancements are crucial for evolving AI technologies aligned with both academic research and practical implementation needs.