SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models (2510.26769v1)

Published 30 Oct 2025 in cs.CV and cs.LG

Abstract: This work introduces SteerVLM, a lightweight steering module designed to guide Vision-LLMs (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.

Summary

The paper introduces a lightweight, prompt-pair-driven steering module that enables fine-grained, dynamic control over vision-language model outputs.
The paper demonstrates significant improvements, including a 21% boost in topic steering and enhanced hallucination mitigation, without retraining the underlying model.
The paper validates its approach with robust zero-shot performance on diverse datasets, ensuring minimal computational overhead for real-time applications.

SteerVLM: Lightweight Activation Steering for Robust Vision-LLM Control

Introduction and Motivation

SteerVLM addresses the challenge of fine-grained, inference-time control over Vision-LLMs (VLMs) by introducing a lightweight, parameter-efficient steering module. Traditional prompt engineering and activation steering methods for LLMs are limited in multimodal contexts, often requiring static steering vectors, predetermined intervention layers, or extensive manual tuning. SteerVLM overcomes these limitations by learning to modulate activations dynamically, using paired prompts that encode both target and converse behaviors, and by operating in a layer-agnostic, token- and dimension-specific manner. This approach enables robust, zero-shot control of VLM outputs, including mitigation of hallucinations and alignment with nuanced user instructions, without modifying the underlying model weights.

Methodology

Steering Module Architecture

SteerVLM's steering module is composed of two submodules: the Steerer and the SteeringGate. The module is inserted after the multi-head attention block in each decoder layer of the LLM, operating on the activations post-attention and pre-normalization. The same module is shared across all layers, enabling adaptive, layer-agnostic steering.

Steerer: Implements a two-layer multi-head attention mechanism with aggressive down-projection (to 1/8th of the model dimension), followed by up-projection. It receives as input the current activation, the unsteered activation, and the target/converse prompt activations. The Steerer computes a context-aware adjustment vector by attending over these inputs, capturing complex, non-linear relationships even when prompt pairs are not strict antonyms.
SteeringGate: A lightweight MLP with down- and up-projection, followed by a sigmoid gating function. It modulates the intensity of the steering signal per dimension, conditioned on the Steerer output and the prompt pair, enabling fine-grained, dimension-specific control.

The steering strength parameter $\lambda$ can be adjusted at inference to control the degree of intervention.

Training and Dataset

SteerVLM is trained via supervised fine-tuning on the newly introduced VNIA (Visual Narrative Intent Alignment) dataset. VNIA consists of 61,391 image-prompt pairs, each annotated with target and converse prompts and corresponding steered responses, generated and filtered using a combination of GPT-4o, CLIP-based image-prompt matching, and Qwen2.5-VL-72B. The dataset covers a broad range of topics and semantic axes, ensuring diversity and mutual exclusivity in prompt pairs.

The steering module is optimized using cross-entropy loss to maximize the likelihood of target-aligned outputs. Training is performed on 8 A100 GPUs, with a learning rate of $3 \times 10^{-4}$ and cosine scheduling.

Inference and Application

At inference, the module requires a forward pass to cache unsteered activations, followed by a steered pass. The module operates in a zero-shot setting: given a new (target, converse) prompt pair, it dynamically computes the steering adjustment without requiring precomputed steering vectors or retraining.

Experimental Results

Quantitative Evaluation

SteerVLM is evaluated on two primary tasks: topic-based steering (using VNIA) and hallucination mitigation (using the OHD benchmark). The method is compared against state-of-the-art activation steering baselines, including ActAdd, ML-ACT, CAA, and ACT.

Topic Steering: SteerVLM achieves an average score of 0.71 on the VNIA evaluation set, outperforming the best baseline (CAA) by 21%. The method demonstrates robust zero-shot generalization to unseen prompt pairs and image contexts.
Hallucination Mitigation: On the OHD benchmark, SteerVLM achieves an overall accuracy of 86.4% and F1 of 86.8%, improving over ActAdd by 1.7% accuracy and 0.9% F1 in a zero-shot setting. This demonstrates the efficacy of prompt-pair-based steering for factuality control.

Qualitative and Ablation Analysis

Qualitative Analysis: SteerVLM produces contextually rich, semantically aligned outputs that reflect the intended target behavior, outperforming prompt engineering and prior steering methods, especially in integrating nuanced or negative sentiment prompts.
Ablation Studies: Removal of the SteeringGate or unsteered context vector degrades performance, confirming their necessity for stable and effective steering. Applying steering only to specific layers or using uniform gating also reduces qualitative performance.
Robustness: The method is robust to semantic variations in prompt phrasing, maintaining stable performance across paraphrased or conceptually shifted prompt pairs.

Computational Considerations

The steering module adds only 0.14% to the base model's parameter count. The main computational overhead arises from the additional forward pass to cache unsteered activations. However, optimizations such as sparse attention (FlexAttention) and parallelization can reduce this overhead, making the approach viable for real-time applications.

Theoretical and Practical Implications

SteerVLM demonstrates that dynamic, prompt-pair-driven activation steering can provide fine-grained, interpretable control over VLM outputs without sacrificing generalization or requiring model retraining. The approach leverages insights from mechanistic interpretability, operating on superpositions of activation dimensions and enabling token- and dimension-specific interventions. The introduction of VNIA fills a critical gap in multimodal steering datasets, supporting further research in this area.

The method's ability to operate in a zero-shot setting, its robustness to prompt variation, and its lightweight design make it suitable for deployment in production VLM systems where user-driven control, safety, and factuality are paramount. The approach is orthogonal to prompt engineering and can be combined with other alignment or safety techniques.

Limitations and Future Directions

SteerVLM's reliance on synthetic data (VNIA) may limit its coverage of real-world hallucination modes. The requirement for additional forward passes introduces latency, though this can be mitigated with architectural optimizations. Like all activation steering methods, the module inherits the base model's risks and limitations.

Future work may explore:

Extending the approach to other modalities (e.g., audio, video)
Joint optimization with reinforcement learning from human feedback (RLHF)
Automated selection or generation of effective target/converse prompt pairs
Integration with interpretability tools for real-time user feedback and debugging

Conclusion

SteerVLM establishes a new paradigm for robust, lightweight, and dynamic control of vision-LLMs via activation steering. By leveraging prompt-pair-driven, layer-agnostic, and dimension-specific interventions, it achieves superior performance in both topic steering and hallucination mitigation, with minimal computational and parameter overhead. The method's generality, efficiency, and interpretability position it as a strong foundation for future research and deployment in controllable multimodal AI systems.