FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding (2504.09925v2)

Published 14 Apr 2025 in cs.CV

Abstract: We introduce FUSION, a family of multimodal LLMs (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION

Summary

Integration of Vision-Language Representations with FUSION

The paper "FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding" introduces an advanced approach to multimodal learning by integrating vision-language representations through the entire processing pipeline of multimodal LLMs (MLLMs). The authors present FUSION, a new family of models that radically depart from traditional methods which typically rely on late-stage modality interaction during the LLM decoding phase. The intent is to achieve dynamic, deep integration of visual and linguistic modalities from the outset, improving cross-modal understanding.

Core Contributions

FUSION's methodology is founded on several core innovations. Firstly, Text-Guided Unified Vision Encoding meticulously integrates textual information into vision encoding to allow pixel-level integration. This enhancement helps align textual and visual representations more intimately, addressing the static nature of conventional vision encoder outputs.

Secondly, the model employs Context-Aware Recursive Alignment Decoding. This technique leverages interaction layers to recursively refine visual features conditioned on the current textual context during decoding. The dynamic nature of this method allows for fine-grained semantic integration at question-level granularity, effectively bridging modality gaps present in conventional MLLMs.

A notable feature within FUSION is the introduction of a Dual-Supervised Semantic Mapping Loss. This component aims at reinforcing feature alignment between text and vision modalities, mitigating semantic discrepancies and guiding feature mappings through bidirectional supervision. This loss function theoretically helps in maintaining consistent semantic alignment across the processing stages.

The authors supplement the architectural innovations with a novel Synthesized Language-Driven QA dataset, emphasizing language-centric integration. The dataset, constructed through generative techniques, features high-quality question-answer pairs that are primarily driven by complex textual descriptions. This dataset is strategically used for optimizing text-driven feature integration during the training process.

Performance Evaluation

FUSION's development included two model scales: FUSION 3B and FUSION 8B. The 3B variant notably outperformed Cambrian-1 8B and Florence-VL 8B across most benchmarks, highlighting the efficiency and effectiveness of fully integrated modality fusion even with fewer vision tokens. Moreover, the dynamic nature of FUSION 3B was underscored as it maintained competitive performance with as few as 300 vision tokens, showcasing robustness across reduced settings.

Implications and Future Directions

The deep integration strategy proposed by FUSION has both theoretical and practical implications. Theoretically, it shifts the paradigm of multimodal learning towards more cognitive-inspired models, where modalities continuously interact throughout the entire computational pipeline. Practically, such deep integration can lead to improved efficiency and performance in multimodal tasks, with reduced computational overhead compared to models relying heavily on increased visual token counts.

These results suggest promising advancements in developing adaptive, multimodal AI systems. Future research could explore further enhancing dynamic interactions between modalities, potentially through continuous feedback mechanisms or improved semantic mapping techniques. As multimodal models increasingly influence areas such as autonomous driving, healthcare imaging, and interactive AI, methodologies like those in FUSION will be crucial for addressing complex, real-world tasks demanding nuanced vision and language understanding.

GitHub

GitHub - starriver030515/FUSION (54 stars)

Tweets

https://twitter.com/_akhaliq/status/1912047476409852026