Cause of Qwen3-30B-A3B verification variability across implementations and hardware

Determine whether the observed differences in Token-DiFR and Activation-DiFR behavior for Qwen3-30B-A3B across inference implementations (vLLM versus HuggingFace Transformers) and GPU types (H200 versus A100) are driven primarily by the model’s mixture-of-experts architecture or by implementation-specific factors in the inference stacks, in order to clarify the source of variability that affects detector calibration and deployment.

Background

Across models, Token-DiFR generally performs consistently; however, the Qwen3-30B-A3B results show substantially more variability than Llama-3.1-8B and Qwen3-8B. The largest shifts arise when switching between vLLM and HuggingFace implementations, and noticeable changes appear when moving between H200 and A100 GPUs within vLLM.

When verifier and provider match (H200/vLLM), Qwen3-30B-A3B achieves very high exact-match rates (~99.9%) and minor misconfigurations are detected reliably. In pooled settings including multiple implementations, minor deviations become harder to detect, suggesting that understanding whether architecture (mixture-of-experts) or implementation details dominate is important for practical verification and calibration.

References

It is unclear whether these differences are driven primarily by the mixture-of-experts architecture or by implementation details in the current inference stacks.

DiFR: Inference Verification Despite Nondeterminism (2511.20621 - Karvonen et al., 25 Nov 2025) in Appendix K, Qwen3-30B-A3B Results