LLM-VA: Vector Alignment for LLM Safety
- LLM-VA is a framework that aligns answer and safety vectors in large language models to mitigate jailbreak and over-refusal errors.
- It employs closed-form, layer-wise rank-one updates derived from dual SVM training to synchronize model responses with safety assessments.
- Empirical results show an 11.45% F1 improvement, an 18.50% reduction in Attack Success Rate, and a 22.00% decrease in Over-Refusal Rate while preserving overall accuracy.
LLM-VA
LLM-VA is a term spanning several technical contexts in the literature, but the most pertinent contemporary definition refers to Vector Alignment in LLMs for safety control—specifically, the method outlined in "LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment" (Zhang et al., 27 Jan 2026). In other usages, notably in efficient code-generation frameworks ("VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference" (Liu et al., 4 Mar 2025)) or in multimodal virtual assistant systems, LLM-VA can also denote LLM–Visual Assistant and related architectures. This article focuses on LLM-VA in the context of vector alignment for safety alignment, while also briefly noting key alternative usages.
1. Vector Alignment in Safety-Aligned LLMs
Safety-aligned LLMs exhibit two major failure modes: jailbreak (producing responses to harmful inputs) and over-refusal (incorrectly refusing to answer benign queries). Traditional vector steering and magnitude adjustment approaches create a trade-off between these errors: suppressing jailbreaks with strong answer suppression invariably leads to high over-refusal rates, and vice versa.
LLM-VA resolves this trade-off by diagnosing the root cause: the answer vector (governing willingness to answer) and the benign vector (governing input safety assessment) are nearly orthogonal in the model’s internal representation space. As a result, answering and safety are treated as independent quantities, permitting harmful answers and unnecessary refusals depending on threshold settings. LLM-VA aligns these directions through closed-form, layer-wise weight updates, thereby making the answering behavior a causal function of the model's safety assessment.
2. Methodology: Control Vector Identification and Rank-One Alignment
At each transformer layer , the hidden state residual outputs are extracted for a dataset . Two support vector machines (SVMs) are trained: one to distinguish benign vs. toxic inputs, and another to distinguish answer vs. refuse examples. The resulting normal vectors are normalized to produce unit "control vectors" (benign/toxic) and (answer/refuse).
Empirically, the angle is close to . Thus, in the original model, the decision to answer is nearly independent from the internal safety signal.
LLM-VA solves for a minimum-norm, closed-form update of the layer's weight matrix so that, for any activation ,
where and are empirical standard deviations of and over the training set, normalizing their magnitudes.
The update is:
and . This is a rank-one modification per selected layer, guaranteeing downstream answer activations become a function of the safety judgment.
3. Iterative Alignment and Layer Selection
Rather than applying a one-off update, LLM-VA iterates this alignment process:
- For each iteration , all samples in are forwarded through the model, collecting residuals.
- For each layer :
- SVMs yield and their respective validation accuracies.
- Influence scores are calculated relative to the control vectors at the model’s output: , .
- Per-layer scores combine influences and accuracies: .
- Select layers with the highest scores.
- Per selected layer, perform the closed-form update as above.
The process continues until validation F1 plateaus (typically within 20–30 iterations).
4. Mechanistic Implications and Safety Coupling
Before LLM-VA, the answer direction could be positive (enabling an answer) even when the safety direction was negative (input unsafe), resulting in jailbreaks, or vice versa for over-refusal. After alignment, the model computes answers only on input regions assessed as safe—answer activations become a direct function of safety activations.
This resolves the previously observed trade-off: instead of tuning answer thresholds and balancing false positives/negatives, answering is only possible if the model predicts input safety, eliminating jailbreaks without introducing unnecessary refusals.
5. Empirical Performance Across LLM Families
LLM-VA was evaluated on 12 instruction-tuned LLMs, spanning Llama-3.1 (8B), gemma-2 (9B), Mistral (7B), Phi-3.5/4, and multiple Qwen2.5/3 variants (Zhang et al., 27 Jan 2026). Challenging splits of jailbreak (S-Eval-Attack/Risk) and over-refusal (ORFuzzSet, NaturalQuestions) benchmarks were constructed, with an 8:1:1 train/validation/test split. No model customization or hand-tuning was required.
Results:
- LLM-VA achieved an 11.45% higher F1-score over the best prior baseline (AlphaSteer).
- Attack Success Rate (ASR) reduced by 18.50% and Over-Refusal Rate (ORR) reduced by 22.00% relative to the original models.
- Across six utility benchmarks (CoLA, MNLI, RTE, MRPC, SST, GSM8K), the aligned models retained a mean of 95.92% of original accuracy.
- Adaptation is automatic: models with high initial jailbreak rates saw large ASR drops, while those with high over-refusal saw marked ORR drops, requiring no hand-balancing.
6. Implementation Considerations and Limitations
Implementation requires no full-model fine-tuning or architectural modifications—only matrix updates to existing down-projection weights in Transformer layers. Each iteration fits 2×(#layers) SVMs and applies rank-one updates. SVMs are trained using liblinear with and are robust to control vector estimation errors: up to 30° angle distortion in yields <5% F1 penalty.
LLM-VA is robust, scalable to 7B–14B models within several hours on two A100 GPUs, and hyperparameterization is handled via validation set performance, without manual adjustment of steering parameters or safety/utility balancing.
Limitations include possible over-modification for very deep or shallow models if iterations or are not validated, and reliance on quality of the safety and behavior SVM training sets. The approach presumes labels for benign/toxic and answer/refuse are obtainable via prior manual or automated annotation.
7. Alternate Usages of "LLM-VA"
In broader literature, LLM-VA may also refer to:
- LLM-based Virtual Assistants: e.g., multimodal agents for Chinese ultrasound analysis ["LLaVA-Ultra" (Guo et al., 2024)], interactive mathematics education agents, and professional development tutors (Yang et al., 5 Jul 2025).
- Visual Analytics with LLMs: frameworks in which LLMs facilitate data visualization, analysis, and explanation—ranging from task planning (Zhao et al., 2024), to evaluation stacks (Podo et al., 2024), to proactive user support agents (Zhao et al., 24 Jul 2025), and natural-language cluster explanations (Buchmüller et al., 15 Jan 2026).
- Code Generation and Quantization: "VQ-LLM" is also referred to as LLM-VA in the context of fusing vector quantization with LLM inference for high-throughput code generation (Liu et al., 4 Mar 2025).
Each of these usages strictly refers to LLM-driven systems, but only the vector alignment approach in (Zhang et al., 27 Jan 2026) is associated with the systematic, closed-form, safety–utility tradeoff resolution captured by the LLM-VA acronym.
References
- "LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment" (Zhang et al., 27 Jan 2026)
- "VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference" (Liu et al., 4 Mar 2025)
- "LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound" (Guo et al., 2024)
- "LightVA: Lightweight Visual Analytics with LLM Agent-Based Task Planning and Execution" (Zhao et al., 2024)
- "Vi(E)va LLM! A Conceptual Stack for Evaluating and Interpreting Generative AI-based Visualizations" (Podo et al., 2024)
- "LangLasso: Interactive Cluster Descriptions through LLM Explanation" (Buchmüller et al., 15 Jan 2026)