ONNXExplainer: Real-Time Shapley Attributions
- ONNXExplainer is a framework-agnostic explainer that computes Shapley-style feature attributions by integrating forward and backward graphs within the ONNX runtime.
- It employs custom reverse-mode automatic differentiation and DeepLIFT multipliers to approximate Shapley values, reducing computational complexity from O(2^n) to O(|R|).
- The system uses cache-based optimization for one-shot, real-time deployment, dramatically speeding up explanation latency over traditional methods like SHAP.
ONNXExplainer is a generic, framework-agnostic explainer that computes Shapley-value–style feature attributions for neural networks represented in the ONNX (Open Neural Network Exchange) format. Designed to address the inefficiencies and lack of cross-platform support in existing explainers such as SHAP for TensorFlow and PyTorch, ONNXExplainer introduces custom automatic differentiation, graph-level optimizations, and one-shot deployment, enabling efficient, model-agnostic, and real-time explainability directly within the ONNX runtime ecosystem (Zhao et al., 2023).
1. System Architecture and Workflow
ONNXExplainer orchestrates model explainability by transforming the ONNX model into integrated forward and backward graphs. The principal architecture consists of the following components:
- Model Loader & Parser: Loads a frozen ONNX computational graph from storage, constructs a forward symbolic graph, and inverts this into a backward graph where each node contains dedicated "flow-in" and "flow-out" slots for tracking gradients.
- Gradient Engine / Automatic Differentiation: Implements a custom reverse-mode AD mechanism through a single DFS over the backward graph. Supports four gradient propagation types (one-to-one, many-to-one, one-to-many, many-to-many). Local gradients are derived for standard ONNX ops (e.g. MatMul, Conv, Add, Pooling) using exact partial derivatives for linear ops and DeepLIFT multipliers for nonlinear ops.
- Shapley Calculator: Uses DeepLIFT multipliers to induce efficient Shapley value approximations over a user-defined reference set :
where is the computed multiplier matrix, is the query input, and is the set of reference inputs.
- Optimizer / Cache: Precomputes and caches all outputs and intermediate activations , for during graph construction. During explanation, only a single forward and backward pass on is executed, while reference terms are acquired via efficient lookup from cache.
This workflow allows both inference and explanation to be performed within a single integrated ONNX graph, enabling cross-platform and real-time deployment.
2. Shapley Value Approximation and Differentiation Strategy
ONNXExplainer addresses the computational intractability of exact Shapley value calculation:
by adopting a DeepLIFT-style multiplier approximation. The procedure defines differences from a reference:
- 0
- Contribution: 1 with 2
- Multiplier: 3
- Propagation via chain rule: 4
The final attribution is computed as the average over reference comparisons using the multiplier-masked difference 5. This approach reduces complexity from 6 for exact Shapley to 7 for the approximate method.
3. Graph-Level Caching and Computational Optimization
ONNXExplainer introduces a cache-based optimization that precomputes and stores all forward outputs and intermediate activations 8 for the reference set. At inference/explanation time:
- All reference activations are reused through broadcast and lookup within the ONNX computation graph.
- Only one forward and one backward pass are required per query 9.
- This reduces the forward and backward computational cost from 0 to 1 per explanation, dramatically lowering both latency and memory consumption relative to methods such as SHAP.
Empirical performance analysis demonstrates that ONNXExplainer can reach up to 500% speedup in explanation latency over SHAP-TensorFlow on models such as VGG19, ResNet50, DenseNet201, and EfficientNetB0, with similar improvements observed for CPU-only scenarios (Zhao et al., 2023).
4. Deployment and Resource Efficiency
ONNXExplainer's one-shot deployment approach packages all necessary components into a single ONNX file containing: 1. The primary forward computational graph 2. The custom backward-pass subgraph for gradient/multiplier calculations 3. A cache subgraph storing reference activations 4. An explanation output node providing Shapley-style attribution maps
This design allows deployment via ONNX Runtime or compatible backends (e.g. Triton) without the need for TensorFlow/PyTorch APIs or Python interpreter. The solution is well suited for production-serving pipelines and edge deployment scenarios.
Resource utilization is quantified in Table 1 below, showing maximum reference set size 2 avoid out-of-memory (OOM) on V100 GPUs for various models and frameworks:
| Model | ONNX Opt FP32 / FP16 | TF Opt FP32 / FP16 | PT Opt FP32 / FP16 |
|---|---|---|---|
| VGG19 | 86 / 166 | 79 / 149 | 97 / 175 |
| ResNet50 | 182 / 362 | 157 / 242 | 112 / 253 |
| DenseNet201 | 78 / 158 | 60 / 115 | 72 / 127 |
| EfficientNetB0 | 166 / 255 | 154 / 232 | 114 / 266 |
The cache-based approach enables higher 3 values for ONNXExplainer than corresponding non-optimized baselines.
5. Implementation Practices and Parallelization
Best practices identified for ONNXExplainer include:
- Maintaining only a single copy of reference activations in memory, releasing intermediate tensors post-caching
- Utilizing ONNX Runtime’s thread pools for parallelizing per-node descriptor operations
- Batching multiple input queries 4 along the batch axis to leverage efficient backward pass parallelization
- Segmenting large batches if memory constrained, with reference cache reused across sub-batches
These strategies further enhance scalability for both GPU and CPU inference backends. The system currently covers more than 25 ONNX operators, supporting key model classes (e.g. CNNs) but not yet RNNs, loops, or custom ops.
6. Trade-offs, Use Cases, and Limitations
ONNXExplainer’s use of DeepLIFT multipliers in place of exact Shapley values introduces a trade-off, reducing computational complexity to 5 at the expense of approximation fidelity. While a larger reference set improves attribution quality, it increases memory consumption.
This architecture is most appropriate for:
- Real-time integrated explanation in production pipelines such as fraud detection
- Edge or mobile deployment scenarios where Python-based frameworks are infeasible
- Serving environments demanding tight bounds on explanation latency and throughput
Documented limitations include incomplete support for all ONNX operators (notably RNNs), sporadic latency spikes for certain batch sizes under ONNX Runtime, and the potential for further speedups by employing reference set sampling or feature pruning. Extension to additional ops and addressing batch-specific runtime latencies is specified as active and future work.
ONNXExplainer constitutes a substantial advancement in framework-independent neural network explainability via Shapley-style attributions within the ONNX ecosystem, optimizing both computational efficiency and deployment flexibility (Zhao et al., 2023).