VisAlign: Visual Alignment for AI Models
- VisAlign is a dual framework that combines VIRAL regularization for MLLMs with a dataset protocol to evaluate AI-human visual alignment in image classification.
- The VIRAL method aligns intermediate visual features using cosine similarity loss to preserve fine-grained semantic details critical for tasks like object counting and spatial reasoning.
- The VisAlign dataset quantifies alignment through curated image categories and robust metrics, revealing key insights into AI safety and performance discrepancies with human perception.
VisAlign encompasses two distinct, independently developed technical concepts in visual AI: (1) a regularization strategy for multimodal LLMs (MLLMs) termed Visual Representation Alignment (VIRAL), which aligns internal model representations with those of vision foundation models; and (2) the VisAlign dataset and protocol for quantifying the degree of alignment between AI models and human visual perception in image classification tasks. Both approaches address the challenge of retaining or evaluating fine-grained visual semantic fidelity in models, targeting limitations in MLLM architecture and AI safety, respectively.
1. Motivation and Problem Landscape
Instruction-tuned MLLMs typically optimize a language-modeling loss over output tokens, employing an indirect, language-mediated supervision mechanism for visual inputs. This paradigm results in the model learning to preserve only superficial visual features correlated with the training captions, leading to notable degradation of critical semantic details. Empirical measurements, such as CKNNA similarity, demonstrate rapid divergence between the internal feature distributions of MLLMs and those of their frozen vision encoders.
Key tasks affected by these limitations include:
- Object counting: Loss of individual object identities causes inaccurate counts.
- Spatial reasoning: Substandard spatial embeddings impede correct reasoning about relative object positions.
- Fine-grained grounding (small-object recognition, attribute queries): Subtle attribute distinctions are frequently omitted.
In parallel, AI-human visual alignment is defined as the congruence between the model’s output distribution (including abstention) and the human perceptual distribution for identical images. Measurement of such alignment is critical, as large-scale models typically operate as black boxes and direct manual control is infeasible.
2. Methodological Frameworks
2.1 Visual Representation Alignment (VIRAL)
VIRAL introduces an explicit regularization term during instruction tuning that aligns intermediate visual feature maps of an MLLM with those of pre-trained, frozen Vision Foundation Models (VFMs). This is formalized as:
- Input image is encoded by the frozen vision encoder yielding .
- The same input is processed by a frozen VFM encoder yielding .
- is projected via a vision-language projector to .
- is processed through transformer layers .
- At layer , features are projected via to the VFM space, yielding .
- Alignment loss is computed between and , and the transformer stack proceeds to generate output tokens.
2.2 VisAlign Dataset and AI-Human Alignment Evaluation
The VisAlign dataset facilitates measurement of visual alignment through carefully curated image categories:
- Must-Act: Images human annotators can reliably classify (clean, adversarial, or prompt-driven).
- Must-Abstain: Images leading to consistent human abstention (out-of-class, hybrids, near-class confounders, non-photorealistic renderings).
- Uncertain: Images cropped or corrupted across ten severity levels to simulate ambiguity.
Human-grounded label distributions are established through crowd-sourced annotations (134 per image), and the dataset’s reliability is substantiated via Cronbach’s calculation (with observed ).
3. Mathematical Formulation and Loss Functions
3.1 VIRAL Alignment Loss
Let be the projected intermediate features and the target VFM features. The alignment loss is:
Alternatively, an penalty may be used over normalized vectors:
The total objective is , where is standard causal language-modeling loss.
3.2 VisAlign Evaluation Metrics
- Hellinger Distance: For model output and human-derived gold over classes,
This metric provides a sensitive, smooth, and bounded similarity between distributions, including consideration of abstention.
- Reliability Score with Abstention:
| Sample Type | Model Action | |
|---|---|---|
| Must-Act | Correct | +1 |
| Must-Act | Incorrect | –c |
| Must-Act | Abstain | 0 |
| Must-Abstain | Abstain | +1 |
| Must-Abstain | Incorrect | –c |
| Must-Abstain | Original Label | 0 |
For Uncertain samples, classification thresholding defines effective sample type (default ).
4. Empirical Setup and Experimental Findings
4.1 VIRAL Experimental Configuration
- Encoders: CLIP-ViT-L/14, SigLIPv2; VFMs: DINOv2 (default), SAM, DepthAnythingV2, RADIO-v2.5.
- Regularization applied at mid-transformer layer (e.g., in 32-layer LLM).
- Projectors: Three-layer MLP, hidden dimensions equal to input, output equal to VFM.
- Alignment weight . LoRA with learning rate 3e-5; batch size 32 per GPU (4A100), 5000 steps.
Benchmarks include CV-Bench²ᴰ (object counting, spatial queries), What’s Up, MMVP (geometric relations), POPE (hallucination), MME, and MMStar (general multimodal).
4.2 VisAlign Dataset Validation
- Five architectures: ViT, Swin Transformer, DenseNet, ConvNeXt, MLP-Mixer.
- Seven abstention mechanisms: SP, ASP, Mahalanobis detector (MD), KNN, TAPUDD, OpenMax, MC-Dropout, Deep Ensemble.
- Key results:
- Must-Act: Distance-based methods (MD, KNN, TAPUDD) achieve lower Hellinger distances; Swin+KNN attains (Category 1), $0.182$ (Category 3).
- Must-Abstain: Probability-based methods (SP, ASP) superior; Swin+ASP distances –$0.736$; KNN –$0.543$.
- Uncertain: All methods yield Hellinger –$0.60$, reflecting inability to match human judgments amidst severe ambiguity.
- Reliability: For strict cost ( or $900$), reliability scores remain negative, exposing AI safety vulnerabilities.
- Correlation: Hellinger distance and reliability negatively correlated, , confirming that alignment quantifies reliability.
4.3 VIRAL Performance Outcomes
On Vicuna-7B + CLIP encoder:
- CV-Bench²ᴰ: +2.9% improvement (59.7% vs. 56.8%)
- MMVP: +5.1% (33.3% vs. 28.2%)
- What’s Up: +8.4% (48.6% vs. 40.1%)
- POPE: +1.7% (87.4% vs. 85.7%)
- MME total: +44 points (1694 vs. 1650)
Consistent gains observed across encoder and backbone variations, and both scaling regimes (7B 13B). Training curves demonstrate faster task-specific convergence and superior qualitative reasoning.
5. Ablation Studies and Design Choices
Detailed ablations uncover sensitivity across several axes:
- VFM Teacher Selection: DINOv2 yields maximal vision-centric gains, surpassing CLIP; SAM, DepthAnythingV2, RADIO display secondary improvements.
- Layer for Alignment (): Best trade-off at single-layer (e.g., =16); multi-layer windowing underperforms.
- Alignment Objective: Cosine similarity loss delivers better outcomes than structural MSE over relation matrices.
- Regularization Weight (): Range explored; provides robustness with negligible adverse effect on language performance.
6. Limitations and Prospective Developments
- The VisAlign dataset is constrained to ten mammals, dictated by annotation complexity (134900). Extension to domains with richer taxonomies (e.g., medical, naturalistic) would require alternative expert-driven annotation protocols.
- Synthetic corruptions model only a finite subset of real-world uncertainties; future work may encompass more semantically diverse and naturally occurring noisy collections.
- Expansion of evaluation to detection, segmentation, or contentious human-centric categories (gender, race) is merited.
- For VIRAL, dynamic regularization schedules, more granular multi-layer/alignment mechanisms, and applicability to video-language or generative multimodal models remain salient open avenues.
7. Synthesis and Research Significance
VisAlign, across its two manifestations, addresses key deficiencies in both model training and evaluation. VIRAL operationalizes explicit regularization to prevent vision pathway semantic erosion under text-only supervision, leveraging VFM knowledge to elevate model performance on tasks requiring intricate visual reasoning. The VisAlign dataset advances quantification of AI-human visual alignment with statistically validated human annotation, revealing persistent gaps in AI reliability and alignment, particularly under ambiguous or adversarial conditions.
A plausible implication is that direct alignment strategies, whether supervised (VIRAL) or evaluative (VisAlign metrics), will remain central in efforts to mitigate semantic loss in vision-language architectures and to ensure trustworthy model behavior in high-stakes domains.