Scaling behavior of perturbation-induced reasoning failures in larger LLMs

Determine the scaling behavior of robustness to meaning-preserving perturbations and associated mechanistic failure signatures in transformer-based large language models as parameter counts increase beyond 8B parameters, specifically assessing how answer flip rates under name substitution and number-format paraphrasing, first divergence layers under logit lens, activation patching recoverability, attention versus MLP component ablation effects, and the Cascading Amplification Index evolve when applying the Mechanistic Perturbation Diagnostics framework to models larger than the 7–8B parameter range evaluated here.

Background

The study evaluates three instruction-tuned transformer LLMs—Mistral-7B-Instruct-v0.2, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct—on 677 GSM8K problems paired with meaning-preserving perturbations (name substitution and number-format paraphrasing). All models exhibit substantial answer-flip rates, with number paraphrasing more disruptive than name swaps.

To trace the mechanistic basis of failures, the authors introduce the Mechanistic Perturbation Diagnostics (MPD) framework combining logit lens analysis, activation patching, component ablation, and the Cascading Amplification Index (CAI). They identify architecture-specific failure modes: distributed (Mistral), localized (Llama-3), and entangled (Qwen), and show that CAI significantly predicts failure.

However, all analyses are conducted on models in the 7–8B parameter range. The authors explicitly note that scaling behavior is unknown, leaving open whether the observed robustness patterns and mechanistic signatures persist, attenuate, or change qualitatively at larger model scales.

References

We evaluate only 7--8B parameter models; scaling behavior is unknown.