Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

The Telephone Game: Evaluating Semantic Drift in Unified Models (2509.04438v1)

Published 4 Sep 2025 in cs.CV and cs.CL

Abstract: Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and and visual generation (text-to-image: T2I) has opened a new direction in Visual LLM (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T, as consistency between understanding and generation is critical for downstream use. Existing evaluations consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These single-pass metrics do not reveal whether a model that understands a concept can also render it, nor whether meaning is preserved when cycling between image and text modalities. To address this, we introduce the Unified Consistency Framework for Unified Models (UCF-UM), a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. UCF formulates 3 metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic loss; (ii) Semantic Drift Rate (SDR), that summarizes semantic decay rate; and (iii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO, which is widely used in training; we create a new benchmark ND400, sampled from NoCaps and DOCCI and evaluate on seven recent models. UCF-UM reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantics over many alternations, whereas others like Vila-u drift quickly despite strong single-pass scores. Our results highlight cyclic consistency as a necessary complement to standard I2T and T2I evaluations, and provide practical metrics to consistently assess unified model's cross-modal stability and strength of their shared representations. Code: https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a cyclic evaluation protocol (UCF-UM) that reveals and quantifies semantic drift in unified multimodal models through repeated T2I and I2T conversions.
  • It employs embedding-based metrics, including Mean Cumulative Drift, Semantic Drift Rate, and Multi-Generation GenEval, to measure semantic consistency across cycles.
  • Empirical results show that models like BAGEL maintain superior semantic stability, while others exhibit rapid degradation in cross-modal fidelity.

Evaluating Semantic Drift in Unified Multimodal Models: The Telephone Game Protocol

Introduction and Motivation

Unified multimodal models (UMs) integrate both visual understanding (image-to-text, I2T) and visual generation (text-to-image, T2I) within a single architecture, enabling seamless cross-modal reasoning and synthesis. While recent advances have produced models with strong single-pass performance on isolated tasks, existing evaluation protocols fail to capture the semantic consistency of these models when alternating between modalities over multiple generations. The paper introduces the Unified Consistency Framework for Unified Models (UCF-UM), a cyclic evaluation protocol that quantifies semantic drift by repeatedly alternating I2T and T2I, analogous to the "Telephone Game" where information degrades as it is passed along. Figure 1

Figure 1: (a) Unified models support both image generation and understanding. (b) UCF-UM cyclic evaluation reveals semantic drift: a suitcase disappears and banana count inflates over generations.

Limitations of Existing Evaluation Metrics

Traditional metrics such as FID, CLIPScore, and GenEval for T2I, and MME/MMBench for I2T, assess model performance in isolation. These single-pass metrics do not reveal whether a model that "understands" a concept can also "render" it, nor do they measure the preservation of entities, attributes, relations, and counts under repeated cross-modal conversions. The paper demonstrates that even state-of-the-art models like BAGEL can answer visual reasoning questions correctly (I2T) but fail to generate semantically consistent images (T2I) for the same concept. Figure 2

Figure 2: BAGEL correctly answers "white side wins" in I2T, but fails to generate a matching chessboard in T2I, exposing unified inconsistency.

Unified Model Architectures

The paper categorizes unified models into three architectural paradigms:

  • Shared-Weights Unified Models: A single transformer decoder handles both I2T and T2I, with either shared or distinct encoders. Examples: BAGEL, Janus, Show-o, VILA-U.
  • Partially Shared Models: Some parameters are shared, but task-specific modules handle modality-specific complexities. Example: Blip-3o.
  • Decoupled Models: Independently trained models are composed to emulate unified behavior, e.g., LLaVA for I2T paired with Stable Diffusion for T2I. Figure 3

    Figure 3: Unified model design paradigms: fully shared, partially shared, and fully decoupled architectures.

The Unified Consistency Framework (UCF-UM)

UCF-UM alternates between T2I and I2T over multiple generations, starting from either text or image, and measures semantic similarity back to the initial input using embedding-based metrics (CLIP, DINO, MPNet). The framework defines two cyclic chains:

  • Text-First-Chain: T(0)โ†’I(1)โ†’T(2)โ†’I(3)โ‹ฏT^{(0)} \rightarrow I^{(1)} \rightarrow T^{(2)} \rightarrow I^{(3)} \cdots
  • Image-First-Chain: I(0)โ†’T(1)โ†’I(2)โ†’T(3)โ‹ฏI^{(0)} \rightarrow T^{(1)} \rightarrow I^{(2)} \rightarrow T^{(3)} \cdots

Semantic similarity is measured across all generations and modalities, exposing concept drift that single-pass metrics overlook. Figure 4

Figure 4: UCF-UM cyclic evaluation alternates T2I and I2T, revealing cross-modal concept drift and semantic instability.

Quantifying Semantic Drift: Metrics

UCF-UM introduces three complementary metrics:

  • Mean Cumulative Drift (MCD): Embedding-based measure of overall semantic loss across generations.
  • Semantic Drift Rate (SDR): Power-law decay rate of semantic similarity, parameterized by ฮฑ\alpha (initial similarity), ฮฒ\beta (decay rate), and ฮณ\gamma (asymptotic baseline).
  • Multi-Generation GenEval (MGG): Object-level compliance score extending GenEval to multiple generations, using object detection to assess fidelity in attributes, counts, positions, and compositionality. Figure 5

    Figure 5: Qualitative examples of semantic drift: loss of position, object misidentification, style change, count inflation, hallucination, and color inconsistency.

Experimental Setup

The evaluation uses ND400, a benchmark dataset sampled from NoCaps and DOCCI, stressing novel objects and fine-grained details. Seven recent models are benchmarked, spanning all three architectural paradigms. Chains are constructed for both text-first and image-first settings, and metrics are computed using appropriate embedding backbones.

Empirical Findings

Semantic Drift Patterns

Qualitative analysis reveals six distinct failure modes under cyclic inference: position inconsistency, object misidentification, style transition, quantity inconsistency, object hallucination, and color inconsistency. These errors compound over generations, even when single-step outputs appear plausible.

Quantitative Results

Figure 6

Figure 6

Figure 6: SDR curves for text-first and image-first chains. BAGEL exhibits the flattest decay, indicating superior semantic stability.

Figure 7

Figure 7: MGG results on GenEval Rewritten dataset. BAGEL maintains high accuracy across generations; VILA-U and Janus 1.3B lose more than half their score within a few generations.

Figure 8

Figure 8: Model comparison across MCD and MGG. BAGEL leads in both metrics; VILA-U lags. LLaVA+SDXL and Janus 1.3B show asymmetric performance.

  • BAGEL consistently maintains the strongest cross-modal stability, with slow semantic decay and high object-level fidelity.
  • VILA-U and Janus variants exhibit rapid drift, losing semantic alignment within a few generations despite competitive single-pass scores.
  • Show-o degrades more gracefully, with slower decay in later generations.
  • LLaVA+SDXL (decoupled) performs well on object-level tasks but struggles to preserve holistic semantics, indicating a disconnect between content and meaning.

Task-Specific Vulnerabilities

MGG breakdown by task shows that compositional tasks (positioning, attribute binding) are most susceptible to semantic drift. Initial performance is high for all models, but consistency collapses rapidly for complex tasks.

Implications and Future Directions

The results demonstrate that single-pass benchmarks overstate model robustness and fail to capture cross-modal inconsistencies. Cyclic evaluation is essential for reliable assessment of unified models, as it exposes semantic drift and architectural weaknesses. The findings suggest that architectural scale, training data diversity, and careful design of shared representations are critical for semantic stability. The UCF-UM protocol and metrics provide a practical foundation for future model development and evaluation.

Potential future directions include:

  • Integrating cyclic consistency objectives into training to mitigate drift.
  • Extending cyclic evaluation to more modalities (e.g., video, audio).
  • Developing new architectures that explicitly model cross-modal semantic alignment.
  • Investigating the relationship between drift rate and downstream task reliability.

Conclusion

The paper formalizes the semantic drift problem in unified multimodal models and introduces the Unified Consistency Framework (UCF-UM) for cyclic evaluation. Empirical results across seven models reveal substantial variability in cross-modal stability, with BAGEL demonstrating the strongest resistance to drift. The proposed metrics (MCD, SDR, MGG) and cyclic protocol expose hidden inconsistencies that single-pass evaluations cannot detect. These insights are critical for advancing the reliability and robustness of unified models in real-world applications.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube