Scaling behavior of perturbation-induced reasoning failures in larger LLMs
Determine the scaling behavior of robustness to meaning-preserving perturbations and associated mechanistic failure signatures in transformer-based large language models as parameter counts increase beyond 8B parameters, specifically assessing how answer flip rates under name substitution and number-format paraphrasing, first divergence layers under logit lens, activation patching recoverability, attention versus MLP component ablation effects, and the Cascading Amplification Index evolve when applying the Mechanistic Perturbation Diagnostics framework to models larger than the 7–8B parameter range evaluated here.
References
We evaluate only 7--8B parameter models; scaling behavior is unknown.
— Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations
(2604.01639 - Han et al., 2 Apr 2026) in Limitations