Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLM-VA: Vector Alignment for LLM Safety

Updated 3 February 2026
  • LLM-VA is a framework that aligns answer and safety vectors in large language models to mitigate jailbreak and over-refusal errors.
  • It employs closed-form, layer-wise rank-one updates derived from dual SVM training to synchronize model responses with safety assessments.
  • Empirical results show an 11.45% F1 improvement, an 18.50% reduction in Attack Success Rate, and a 22.00% decrease in Over-Refusal Rate while preserving overall accuracy.

LLM-VA

LLM-VA is a term spanning several technical contexts in the literature, but the most pertinent contemporary definition refers to Vector Alignment in LLMs for safety control—specifically, the method outlined in "LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment" (Zhang et al., 27 Jan 2026). In other usages, notably in efficient code-generation frameworks ("VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference" (Liu et al., 4 Mar 2025)) or in multimodal virtual assistant systems, LLM-VA can also denote LLM–Visual Assistant and related architectures. This article focuses on LLM-VA in the context of vector alignment for safety alignment, while also briefly noting key alternative usages.

1. Vector Alignment in Safety-Aligned LLMs

Safety-aligned LLMs exhibit two major failure modes: jailbreak (producing responses to harmful inputs) and over-refusal (incorrectly refusing to answer benign queries). Traditional vector steering and magnitude adjustment approaches create a trade-off between these errors: suppressing jailbreaks with strong answer suppression invariably leads to high over-refusal rates, and vice versa.

LLM-VA resolves this trade-off by diagnosing the root cause: the answer vector vav_a (governing willingness to answer) and the benign vector vbv_b (governing input safety assessment) are nearly orthogonal in the model’s internal representation space. As a result, answering and safety are treated as independent quantities, permitting harmful answers and unnecessary refusals depending on threshold settings. LLM-VA aligns these directions through closed-form, layer-wise weight updates, thereby making the answering behavior a causal function of the model's safety assessment.

2. Methodology: Control Vector Identification and Rank-One Alignment

At each transformer layer \ell, the hidden state residual outputs oi()Rdo_i^{(\ell)}\in\mathbb{R}^d are extracted for a dataset D={oi(),yi}D=\{o_i^{(\ell)}, y_i\}. Two support vector machines (SVMs) are trained: one to distinguish benign vs. toxic inputs, and another to distinguish answer vs. refuse examples. The resulting normal vectors are normalized to produce unit "control vectors" vb()v_b^{(\ell)} (benign/toxic) and va()v_a^{(\ell)} (answer/refuse).

Empirically, the angle φ()=arccos(vb()va())\varphi^{(\ell)} = \arccos(v_b^{(\ell)}\cdot v_a^{(\ell)}) is close to 9090^\circ. Thus, in the original model, the decision to answer is nearly independent from the internal safety signal.

LLM-VA solves for a minimum-norm, closed-form update Δ\Delta of the layer's weight matrix WW so that, for any activation xx,

x(W+Δ)va=(σa/σb)xWvbx(W+\Delta)v_a = (\sigma_a/\sigma_b) xWv_b

where σb\sigma_b and σa\sigma_a are empirical standard deviations of xWvbxWv_b and xWvaxWv_a over the training set, normalizing their magnitudes.

The update is:

Δ=[(σa/σb)WvbWva]va\Delta = \left[ (\sigma_a/\sigma_b) W v_b - W v_a \right] v_a^\top

and WW+ΔW \leftarrow W + \Delta. This is a rank-one modification per selected layer, guaranteeing downstream answer activations become a function of the safety judgment.

3. Iterative Alignment and Layer Selection

Rather than applying a one-off update, LLM-VA iterates this alignment process:

  • For each iteration tt, all samples in DD are forwarded through the model, collecting residuals.
  • For each layer \ell:
    • SVMs yield vb(),va()v_b^{(\ell)}, v_a^{(\ell)} and their respective validation accuracies.
    • Influence scores are calculated relative to the control vectors at the model’s output: Cb()=vb(fin)vb()C_b^{(\ell)} = v_b^{(fin)}\cdot v_b^{(\ell)}, Ca()=va(fin)va()C_a^{(\ell)} = v_a^{(fin)}\cdot v_a^{(\ell)}.
    • Per-layer scores combine influences and accuracies: Score()=Cb()Accb()+Ca()Acca()Score^{(\ell)} = C_b^{(\ell)}\cdot Acc_b^{(\ell)} + C_a^{(\ell)}\cdot Acc_a^{(\ell)}.
    • Select LselectL_{select} layers with the highest scores.
  • Per selected layer, perform the closed-form update as above.

The process continues until validation F1 plateaus (typically within 20–30 iterations).

4. Mechanistic Implications and Safety Coupling

Before LLM-VA, the answer direction vav_a could be positive (enabling an answer) even when the safety direction vbv_b was negative (input unsafe), resulting in jailbreaks, or vice versa for over-refusal. After alignment, the model computes answers only on input regions assessed as safe—answer activations become a direct function of safety activations.

This resolves the previously observed trade-off: instead of tuning answer thresholds and balancing false positives/negatives, answering is only possible if the model predicts input safety, eliminating jailbreaks without introducing unnecessary refusals.

5. Empirical Performance Across LLM Families

LLM-VA was evaluated on 12 instruction-tuned LLMs, spanning Llama-3.1 (8B), gemma-2 (9B), Mistral (7B), Phi-3.5/4, and multiple Qwen2.5/3 variants (Zhang et al., 27 Jan 2026). Challenging splits of jailbreak (S-Eval-Attack/Risk) and over-refusal (ORFuzzSet, NaturalQuestions) benchmarks were constructed, with an 8:1:1 train/validation/test split. No model customization or hand-tuning was required.

Results:

  • LLM-VA achieved an 11.45% higher F1-score over the best prior baseline (AlphaSteer).
  • Attack Success Rate (ASR) reduced by 18.50% and Over-Refusal Rate (ORR) reduced by 22.00% relative to the original models.
  • Across six utility benchmarks (CoLA, MNLI, RTE, MRPC, SST, GSM8K), the aligned models retained a mean of 95.92% of original accuracy.
  • Adaptation is automatic: models with high initial jailbreak rates saw large ASR drops, while those with high over-refusal saw marked ORR drops, requiring no hand-balancing.

6. Implementation Considerations and Limitations

Implementation requires no full-model fine-tuning or architectural modifications—only matrix updates to existing down-projection weights in LselectL_{select} Transformer layers. Each iteration fits 2×(#layers) SVMs and applies rank-one updates. SVMs are trained using liblinear with C=1.0C=1.0 and are robust to control vector estimation errors: up to 30° angle distortion in va/vbv_a/v_b yields <5% F1 penalty.

LLM-VA is robust, scalable to 7B–14B models within several hours on two A100 GPUs, and hyperparameterization is handled via validation set performance, without manual adjustment of steering parameters or safety/utility balancing.

Limitations include possible over-modification for very deep or shallow models if iterations or LselectL_{select} are not validated, and reliance on quality of the safety and behavior SVM training sets. The approach presumes labels for benign/toxic and answer/refuse are obtainable via prior manual or automated annotation.

7. Alternate Usages of "LLM-VA"

In broader literature, LLM-VA may also refer to:

Each of these usages strictly refers to LLM-driven systems, but only the vector alignment approach in (Zhang et al., 27 Jan 2026) is associated with the systematic, closed-form, safety–utility tradeoff resolution captured by the LLM-VA acronym.

References

  • "LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment" (Zhang et al., 27 Jan 2026)
  • "VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference" (Liu et al., 4 Mar 2025)
  • "LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound" (Guo et al., 2024)
  • "LightVA: Lightweight Visual Analytics with LLM Agent-Based Task Planning and Execution" (Zhao et al., 2024)
  • "Vi(E)va LLM! A Conceptual Stack for Evaluating and Interpreting Generative AI-based Visualizations" (Podo et al., 2024)
  • "LangLasso: Interactive Cluster Descriptions through LLM Explanation" (Buchmüller et al., 15 Jan 2026)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLM-VA.