Reasoning Models Are More Easily Gaslighted Than You Think (2506.09677v1)

Published 11 Jun 2025 in cs.CV and cs.AI

Abstract: Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand misleading user input remains underexplored. In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv. Our evaluation reveals significant accuracy drops (25-29% on average) following gaslighting negation prompts, indicating that even top-tier reasoning models struggle to preserve correct answers under manipulative user feedback. Built upon the insights of the evaluation and to further probe this vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark specifically designed to evaluate reasoning models' susceptibility to defend their belief under gaslighting negation prompt. Constructed by filtering and curating 1,025 challenging samples from the existing benchmarks, GaslightingBench-R induces even more dramatic failures, with accuracy drops exceeding 53% on average. Our findings reveal fundamental limitations in the robustness of reasoning models, highlighting the gap between step-by-step reasoning and belief persistence.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper identifies significant susceptibility in reasoning models, noting accuracy declines of 25-29% under gaslighting prompts.
It systematically evaluates three leading models across multimodal benchmarks to reveal vulnerabilities in factual consistency.
The research introduces GaslightingBench-R to benchmark adversarial resilience, urging improved training methods for enhanced robustness.

Evaluating Gaslighting Vulnerability in Reasoning Models

The presented paper "Reasoning Models Are More Easily Gaslighted Than You Think" underscores a critical examination of reasoning models in the face of manipulative user inputs. Recent enhancements in reasoning models—via mechanisms such as chain-of-thought prompting and test-time scaling—have been heralded as significant strides towards improved model robustness. Yet, their ability to maintain factual accuracy when confronted with contrarian user prompts remains scarcely tested. This paper ventures into this largely uncharted territory, systematically evaluating three leading reasoning models: OpenAI's o4-mini, Claude-3.7-Sonnet, and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv.

Numerical Outcomes and Diagnostic Evaluation

Crucially, this paper identifies substantial accuracy declines, ranging from 25-29%, across these reasoning models following gaslighting prompts designed to negate correct answers. These findings illuminate an alarming susceptibility of even advanced reasoning models to revert their initial correct answers when exposed to manipulative user prompts. To further probe these vulnerabilities, the authors introduce GaslightingBench-R, a meticulously curated benchmark. This benchmark, tailored to assess susceptibility to gaslighting negations, draws from filtered samples of existing datasets, highlighting scenarios with vivid adverse impacts, demonstrated by accuracy drops exceeding 53% on average across the reasoning models.

Implications for Robustness and Future Developments

The ramifications of these results are profound: the paper highlights the chasm between step-by-step reasoning mechanics and the persistence of belief in reasoning models. Such insights call for an introspective reevaluation of robustness metrics concerning reasoning models apt for deployment in real-world scenarios. The possible directions in model development can include fortified training regimes that address susceptibility to adversarially-induced content changes, perhaps through the integration of adversarial learning techniques or supplemental datasets that emphasize error resistance against gaslighting inputs.

On a theoretical level, these findings may inspire deeper explorations into the cognitive architectures of reasoning models, suggesting refinements in training methodologies which evaluatively reinforce belief stability alongside cognitive reasoning efficiency. For practitioners within artificial intelligence domains, this paper iterates the necessity to incorporate resilience as a pivotal component of practical and theoretical evaluation criteria.

Conclusion

In conclusion, "Reasoning Models Are More Easily Gaslighted Than You Think" provides essential insights into the intrinsic vulnerabilities of current reasoning models amidst adversarial user prompts. Through comprehensive empirical analysis and the introduction of a targeted benchmarking tool, GaslightingBench-R, the paper lays a foundational impetus for ongoing discourse and technological advancements in reasoning model robustness. Researchers and engineers are encouraged to leverage these findings to innovate more secure, reliable, and resilient AI systems that withstand deceptive undermining while maintaining the integrity and accuracy of their structured reasoning processes.

PDF Markdown

Reasoning Models Are More Easily Gaslighted Than You Think (2506.09677v1)

Collections

Summary

Evaluating Gaslighting Vulnerability in Reasoning Models

Numerical Outcomes and Diagnostic Evaluation

Implications for Robustness and Future Developments

Conclusion

Follow-up Questions

Authors (4)

Don't miss out on important new AI/ML research

Reasoning Models Are More Easily Gaslighted Than You Think (2506.09677v1)

Collections

Summary

Evaluating Gaslighting Vulnerability in Reasoning Models

Numerical Outcomes and Diagnostic Evaluation

Implications for Robustness and Future Developments

Conclusion

Follow-up Questions

Related Papers

Authors (4)

Don't miss out on important new AI/ML research