LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model (2506.11402v1)

Published 13 Jun 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Parameter Efficient FineTuning (PEFT), such as Low-Rank Adaptation (LoRA), aligns pre-trained LLMs to particular downstream tasks in a resource-efficient manner. Because efficiency has been the main metric of progress, very little attention has been put in understanding possible catastrophic failures. We uncover one such failure: PEFT encourages a model to search for shortcut solutions to solve its fine-tuning tasks. When very small amount of tokens, e.g., one token per prompt, are correlated with downstream task classes, PEFT makes any pretrained model rely predominantly on that token for decision making. While such spurious tokens may emerge accidentally from incorrect data cleaning, it also opens opportunities for malevolent parties to control a model's behavior from Seamless Spurious Token Injection (SSTI). In SSTI, a small amount of tokens correlated with downstream classes are injected by the dataset creators. At test time, the finetuned LLM's behavior can be controlled solely by injecting those few tokens. We apply SSTI across models from three families (Snowflake Arctic, Apple OpenELM, and Meta LLaMA-3) and four diverse datasets (IMDB, Financial Classification, CommonSense QA, and Bias in Bios). Our findings reveal three astonishing behaviors. First, as few as a single token of SSTI is sufficient to steer a model's decision making. Second, for light SSTI, the reliance on spurious tokens is proportional to the LoRA rank. Lastly, with aggressive SSTI, larger LoRA rank values become preferable to small rank values as it makes the model attend to non-spurious tokens, hence improving robustness.

Summary

The paper demonstrates that even a single injected spurious token can significantly bias predictions in LoRA-finetuned models.
The study reveals that higher LoRA ranks increase susceptibility under light token injection while sometimes offering robustness during aggressive manipulations.
The authors introduce attention entropy as a practical diagnostic tool to detect model over-reliance on spurious tokens and safeguard data integrity.

An Analysis of Spurious Token Effects on LoRA-Finetuned Models

The paper "LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model" presents a critical examination of the vulnerabilities inherent in Parameter-Efficient Fine-Tuning (PEFT) methods specifically focusing on Low-Rank Adaptation (LoRA). Through rigorous empirical analysis, the authors illuminate the repercussions of spurious correlations formed through the controlled injection of particular tokens within training datasets. This elucidation reveals significant implications for the robustness and reliability of models undergoing LoRA-based adaptation, highlighting areas for future research and practical considerations in AI model deployment.

Phenomenon of Spurious Token Injection

The core of the paper investigates Seamless Spurious Token Injection (SSTI), where spurious tokens—intentionally correlated with target labels—are minimally introduced into datasets. Remarkably, results indicate that even a single injected token is sufficient to steer model predictions, offering potential vectors for exploitation and manipulation by malevolent entities. This alarming finding underscores the critical need for heightened scrutiny over data quality and the processes employed in model fine-tuning.

LoRA and Model Vulnerability

A pivotal aspect of the paper explores how varying LoRA ranks affect susceptibility to spurious tokens. Under light SSTI conditions, increased LoRA rank amplifies model vulnerability, with larger rank configurations leading to a more pronounced reliance on injected tokens. Conversely, under aggressive SSTI, higher ranks paradoxically afford greater robustness, allowing models to attend to non-spurious tokens amidst extensive dataset corruption. This dual behavior highlights the non-linear relationship between model capacity and resilience, presenting a nuanced perspective on LoRA's trade-offs between efficiency and robustness.

Expanding the Analysis

Through extensive experiments across diverse datasets and model configurations, including Snowflake Arctic, Apple OpenELM, and Meta LLaMA-3, the analysis consistently shows that SSTI can manipulate model behavior independent of token placement and type. Additionally, variations in model size and training duration fail to mitigate SSTI's influence, underscoring the pervasive nature of this vulnerability. This comprehensive approach accentuates the broader applicability of the findings and calls for more robust mechanisms to guard against these pitfalls.

Attention Entropy as a Diagnostic Tool

The paper introduces attention entropy as a promising diagnostic measure to detect SSTI vulnerabilities. By analyzing attention distributions, researchers can observe how spurious tokens lead to decreased entropy, suggesting a model's over-reliance on injected shortcuts. This practical tool adds a layer of transparency, allowing for the empirical evaluation of dataset-integrity and model-awareness pre-deployment.

Implications and Future Directions

The revelations from this paper urge the AI community to re-evaluate the current methodologies governing PEFT and model fine-tuning, emphasizing the importance of ensuring data cleanliness and implementing safeguards against manipulation. The potential for expanded research is vast, particularly in generative settings where distinction between signal and noise is inherently challenging. Finally, the paper suggests that vulnerabilities may also arise during pretraining, as actors could embed triggers within models, potentially altering behavior post-finetuning.

Conclusion

Overall, this investigation serves as a sobering reminder of the latent vulnerabilities in efficiency-oriented adaptation methods such as LoRA, promoting a comprehensive and cautious approach to PEFT. By harnessing insights from this paper, researchers can drive forward advancements in AI robustness, paving the way for more secure, reliable, and ethically-conscious model architectures in the future.

PDF Markdown

Tweets

https://twitter.com/randall_balestr/status/1934624063810375976