Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models

Published 3 Jun 2026 in cs.CV | (2606.04385v1)

Abstract: Foundation models have driven rapid progress in computer vision, yet the two dominant paradigms, vision-language foundation models (VLMs) and vision-only foundation models (VFMs), remain only partially compatible. VLMs offer language-grounded semantic alignment but are often visually coarse, while VFMs learn discriminative perceptual geometry but lack semantic grounding. We propose GPUA (Geometry-Preserving Unsupervised Alignment), a framework that integrates the complementary strengths of VFMs and VLMs. Inspired by cross-lingual alignment, GPUA treats VFM features as a visual language and learns an orthogonal mapping that translates the VFM space into the VLM semantic space, preserving geometry and narrowing the modality gap without labels or model parameter updates. GPUA is task-agnostic and requires only feature-level access to pretrained models. Experiments across diverse benchmarks demonstrate improved cross-model compatibility and strong gains in downstream zero-shot recognition and segmentation with negligible overhead. Code is available at https://github.com/Yuteam14/GPUA

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a geometry-preserving unsupervised alignment framework that maps VFM features into a fixed VLM semantic space via an orthogonal mapping.
It employs unsupervised correspondence mining and topology-aware hubness suppression to enhance zero-shot classification and open-vocabulary segmentation performance.
Experimental results show significant accuracy gains and robust out-of-domain generalization without updating model parameters.

Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models

Motivation and Context

Contemporary computer vision leverages two predominant foundation model paradigms: vision-language foundation models (VLMs) and vision-only foundation models (VFMs). VLMs provide robust semantic interfaces and open-vocabulary capabilities via contrastive image-text pretraining but lack sensitivity to fine-grained perceptual cues and local geometry. VFMs, exemplified by DINO-style models, deliver highly discriminative visual representations with strong locality but lack explicit semantic grounding and text-driven inference abilities. Fusion approaches that combine VLMs and VFMs have demonstrated improvements in open-vocabulary tasks but require substantial model access and are tightly coupled to task-specific pipelines. This paper addresses the fundamental challenge of achieving task-agnostic, geometry-aware, unsupervised alignment between heterogeneous foundation models solely at the feature level.

GPUA Framework Overview

The GPUA (Geometry-Preserving Unsupervised Alignment) framework formulates vision-language integration as a cross-modal translation problem inspired by cross-lingual embedding alignment. Treating VFM representations as a "visual language," GPUA learns an orthogonal mapping that transfers perceptually rich VFM features into the fixed VLM semantic reference space. The alignment is unsupervised, requires no label or parameter updates, and operates as a plug-and-play module at the representation level.

The algorithm decouples correspondence mining from geometry-preserving alignment:

Unsupervised Correspondence Mining (UCM): Infers soft instance-to-prototype assignments via an entropic regularization optimal transport approach. The assignment matrix $P$ jointly enforces semantic alignment (VLM prototypes) and geometric coherence (VFM centroids), balancing the two with a fusion parameter $\lambda$ .
Geometry-Preserving Alignment (GPA): Computes a closed-form orthogonal mapping $W$ (via Procrustes analysis), refined by a topology-aware hubness suppression loss (THS) that penalizes dominant prototype hubs and enhances discriminative property.

This methodology avoids the pitfalls of alternating optimization and initialization sensitivity found in prior unsupervised alignment approaches, resulting in a stable, scalable alignment pipeline.

Experimental Results and Numerical Analysis

GPUA is evaluated on eleven benchmark datasets for zero-shot image classification and five for open-vocabulary segmentation. All experiments are performed without model parameter updates, highlighting GPUA's practicality for restricted-access models and deployment scenarios.

Key findings include:

Zero-Shot Classification: GPUA achieves substantial accuracy gains across all benchmarks (e.g., +34.9% on EuroSAT, +11.7% on Cars, +10.5% on ImageNet) compared to CLIP and strong adaptation baselines (COSMIC [73.3], GPUA [77.4]). Alignment learned from as few as 16 samples per class (GPUA*) remains competitive, demonstrating exceptional sample efficiency.
Open-Vocabulary Segmentation: GPUA integrated with CLIP-based segmentation frameworks (MaskCLIP, SCLIP, SC-CLIP) yields consistent mIoU improvements, e.g., SC-CLIP baseline (40.1, C59) vs. SC-CLIP+GPUA (41.0), without modifying segmentation heads or training loss, confirming the impact of geometry-aware feature alignment.
Out-of-Domain Generalization: GPUA enhances robustness under domain shift (ImageNet-A: CLIP [47.9], GPUA [57.4]; ImageNet-S: CLIP [46.1], GPUA [56.9]) and dense prediction tasks (COCO-Stuff164K: SC-CLIP [26.9], GPUA [29.0]).
Ablation Studies: Incorporating both semantic and geometric loss terms, and using THS for hubness suppression, yields marked improvements over alternatives, especially on challenging datasets with class imbalance or distribution shift.

Theoretical and Practical Implications

GPUA demonstrates that feature-level unsupervised alignment can bridge the modality gap between heterogeneous foundation models without task-specific supervision or model parameter access. The use of orthogonal mappings preserves geometric structure, enabling task-agnostic applicability and minimizing computational overhead. The hubness-robust loss further sharpens prototype discrimination, facilitating more reliable open-vocabulary recognition.

Practically, GPUA enables deployment in scenarios with limited access to proprietary foundation models, supports extensibility to multiple visual backbones, and generalizes across diverse tasks—from global image-level recognition to patch-level dense prediction.

Theoretically, the success of geometry-preserving cross-modal alignment reinforces the isomorphism hypothesis observed in multilingual NLP and reveals analytic connections to K-means-style clustering and optimal transport in unsupervised data association. The fusion of perceptual geometry and semantic topology sets a foundation for future advances in multimodal representation learning.

Future Directions

Several limitations remain. GPUA does not explicitly model class imbalance, potentially affecting correspondence estimation on highly skewed datasets. Future work could explore adaptive weighting, uncertainty modeling in the alignment process, and deeper integration with hierarchical prototype selection. Extending geometry-preserving alignment to non-Euclidean and hierarchical semantic spaces, as well as leveraging joint visual-textual optimal transport, are promising avenues.

GPUA's compatibility with frozen models suggests utility in federated, privacy-sensitive deployments and emerging closed-API vision-language services. As foundation models continue to diversify, principled, unsupervised compatibility solutions such as GPUA will further impact open-vocabulary perception and cross-modal inference.

Conclusion

GPUA presents a principled, unsupervised vision-language alignment framework that fuses the perceptual geometry of VFMs with the semantic topology of VLMs, using a stable two-stage process anchored in correspondence mining and orthogonal feature translation. Empirical results demonstrate consistent, strong improvements in both zero-shot classification and open-vocabulary segmentation benchmarks, validating GPUA's accuracy-efficiency tradeoff and robust generalization without model parameter updates. The framework offers significant potential for practical multimodal model deployment and theoretical exploration in cross-modal representation learning (2606.04385).

Markdown Report Issue