Unified Continuous Visual Representation

Updated 2 December 2025

Unified continuous visual representation is a method that encodes images and videos into a shared, real-valued latent space for both generation and understanding without quantization.
It leverages end-to-end differentiability and architectures like autoregressive transformers and diffusion models to balance detailed reconstruction with semantic abstraction.
Empirical results demonstrate its superior performance across tasks such as text-to-image synthesis, captioning, VQA, and video processing, streamlining multimodal intelligence.

Unified continuous visual representation denotes a class of representation learning methods that embed visual data—such as images and videos—into a shared, real-valued latent space specifically constructed to support both generation (e.g., text-to-image synthesis) and understanding (e.g., captioning, VQA) within a single parametric model. In contrast to earlier schemes relying on discrete codebooks or task-specific encoders, unified continuous representations bypass quantization, enable end-to-end differentiability, and allow semantically and perceptually rich information to be efficiently shared and leveraged across multimodal tasks. This paradigm underlies a wave of contemporary multimodal autoregressive frameworks and diffusion architectures, including UniFluid (Fan et al., 17 Mar 2025), TUNA (Liu et al., 1 Dec 2025), Ming-UniVision (Huang et al., 8 Oct 2025), and Harmon (Wu et al., 27 Mar 2025).

1. Principles and Rationale for Unified Continuous Visual Representation

Unifying vision-language tasks demands representations that satisfy both the fine-grained fidelity requirements of generation and the semantic abstraction demanded by understanding. Discrete visual tokenizers (e.g., VQ-VAE, codebooks) have traditionally dominated autoregressive paradigms but suffer irreversible quantization errors, loss of semantic capacity, and force artificial splits between generation-friendly and understanding-friendly tokens (Huang et al., 8 Oct 2025). The shift to continuous representations addresses these limitations by directly encoding each image region or patch as a vector in ℝᵈ, sidestepping the trade-off between expressiveness and compactness imposed by discrete codebooks.

The continuous latent spaces in models such as MingTok (Huang et al., 8 Oct 2025) or Harmon (Wu et al., 27 Mar 2025) can simultaneously support compact low-dimensionality (for modeling efficiency in generation) and high-dimensional, discriminative semantics (for robust understanding), primarily by using architectural and training innovations such as information bottlenecks, staged semantic feature expansion, and multi-objective masked pretraining.

Empirically, unified continuous representations consistently demonstrate that a carefully structured continuous embedding not only matches but often outperforms both discrete-token-based and multimodal-separate approaches on standard benchmarks across perception and generation (Fan et al., 17 Mar 2025, Liu et al., 1 Dec 2025, Huang et al., 8 Oct 2025, Wu et al., 27 Mar 2025).

2. Architectural Instantiations and Mathematical Foundations

Unified continuous representations are realized via multiple network schemata:

Autoregressive Transformers with Continuous Tokens: Frameworks such as UniFluid (Fan et al., 17 Mar 2025) tokenize images into continuous visual codes via a fluid VAE encoder, flattening the image into a sequence z = Enc_VAE(I) ∈ ℝ^{H×W×C}, later mapped via a linear projection to the model’s embedding dimension. Discrete tokens are processed alongside via shared transformer layers, with modality-specific heads for text (softmax) and image (diffusion or regression).
Cascaded Encoders: TUNA (Liu et al., 1 Dec 2025) utilizes a VAE to downsample visual input and injects semantic richness by passing latents through a frozen large foundation model (e.g., SigLIP 2), supplemented by an MLP connector before the LLM transformer. This cascade produces tokens Z in a shared visual vector space optimized for both tasks via joint loss.
Three-Stage Sequential Tokenizers: MingTok (Huang et al., 8 Oct 2025) processes images through low-level encoding (compact continuous tokens, masked feature losses), semantic expansion (upsampling into higher dimensions supervised by text-aligned models such as CLIP), and pixel-level reconstruction. The overall loss is a convex combination of masked low, semantic, and reconstruction losses.
MAR/MAE-ViT Encoders: Harmon (Wu et al., 27 Mar 2025) utilizes a Masked Autoregressive/Masked Autoencoding transformer, masking random subsets of patches, encoding with positional embeddings, and reconstructing masked content via diffusion-based objectives. There is no quantization in the pipeline—final representations are sequences of ℝ^D vectors per-patch, preserving nuanced semantic and visual detail.

Fundamental to these approaches is avoiding discrete quantization, thus precluding codebook collapse, mode dropping, and the statistical inefficiency of mapping semantic concepts to arbitrary token indices. Training objectives are adapted to support both per-token regression (diffusion, L2) and diversity/compactness constraints (information bottleneck, channel averaging).

3. Joint Training, Trade-Offs, and Loss Balancing Strategies

Co-training for both visual generation and understanding is non-trivial due to the distinct requirements: generation favors compact continuous codes for modeling efficiency and fidelity (e.g., low FID), while understanding benefits from higher-dimensional, discriminative features aligning with language (Fan et al., 17 Mar 2025, Huang et al., 8 Oct 2025).

To reconcile these demands, a weighted summation loss is typically adopted. For instance, UniFluid utilizes:

L_Text = cross-entropy over text tokens
L_Visual = per-token diffusion L2 loss over continuous visual codes
L = L_Visual + λ * L_Text

Empirical sweeps on λ demonstrate that small λ (e.g., 0.005) preserves FID while retaining ~90% of understanding accuracy; larger λ shifts the model to favor understanding at the expense of generation fidelity. Systematic tuning of this trade-off parameter is critical for achieving balanced multimodal competence.

Architectures such as TUNA further demonstrate that strong pretrained semantic encoders and joint optimization, rather than decoupled or late-fusion pipelines, are essential for mutual reinforcement between tasks. Notably, ablations show models trained only for generation or only for understanding underperform joint-trained models on both axes (Liu et al., 1 Dec 2025).

4. Modality and Task Unification: Image, Video, and Language

Unified continuous visual representations readily extend to video input, multi-modal instruction tuning, and in-context multi-round workflows. Video support is typically implemented by folding the temporal dimension into the token batch and employing independent or windowed semantic encoding (e.g., TUNA (Liu et al., 1 Dec 2025)), or via dynamic temporal clustering and token merging for token-efficient input (e.g., Chat-UniVi (Jin et al., 2023)).

Joint image–video models such as Video-LLaVA (Lin et al., 2023) and Chat-UniVi (Jin et al., 2023) demonstrate that with appropriate alignment before projection and shared continuous representation, a single model can robustly process and cross-condition on both spatial and temporal modalities, achieving state-of-the-art performance on image and video Q&A, and yielding mutual enhancement across tasks.

For instruction-following and multi-turn workflows, a unified continuous space enables seamless mixed-modality dialogs, iterative editing, multimodal instruction sequences, and unification of captioning, VQA, and text-to-image synthesis under a single tokenization and prediction framework (Huang et al., 8 Oct 2025, Liu et al., 1 Dec 2025).

5. Information Bottleneck, Regularization, and Representation Geometry

To mitigate overfitting and ensure that continuous latent codes are both discriminative and compact, several regularization strategies are deployed:

Channel Averaging and Dimension Reduction: MingTok (Huang et al., 8 Oct 2025) and related architectures compress high-dimensional patch representations to a reduced dimension d (e.g., 16 or 32), enforcing a bottleneck that prioritizes informational efficiency.
Information Bottleneck Objectives: Explicit I(x;Z) minimization subject to a reconstruction constraint is applied, aligning with underlying information theory principles (Huang et al., 8 Oct 2025).
Contrastive and Orthogonality Losses: For multi-source scenarios (BaryIR (Tang et al., 27 May 2025)), optimal transport barycenter losses, along with contrastive and orthogonality regularization, are used to decompose latent spaces into “barycenter” (degradation-agnostic) and source-specific (degradation-unique) subspaces.
Masking and Multi-Scale Fusion: Masked image modeling and multi-scale clustering synthesize representations capturing both low-level details and high-level global semantics (Wu et al., 27 Mar 2025, Jin et al., 2023).

Empirical t-SNE plots and ablation studies validate that these bottlenecks and regularizers promote generalization, avoid codebook collapse, and foster latent spaces where semantic and visual information is cleanly separated as necessary (Tang et al., 27 May 2025).

6. Empirical Results and Comparative Evaluation

Unified continuous visual representation models consistently outperform both their discrete and decoupled counterparts across standard benchmarks in understanding, generation, image editing, and zero-shot adaptation:

Model	FID (GenEval)	QA Avg	Captioning	Notable Strengths
UniFluid (0.7B)	8.39	60.3	120.3	Simultaneous strong T2I and QA (Fan et al., 17 Mar 2025)
TUNA (1.5B)	0.88 (GEval)	61.4	—	SOTA on image/video, editing (Liu et al., 1 Dec 2025)
Ming-UniVision	0.85	—	78.5*	Large-scale masked modeling (Huang et al., 8 Oct 2025)
Harmon (1.5B)	12.8	81%	—	Near-semantic-only accuracy, SOTA gen (Wu et al., 27 Mar 2025)
Chat-UniVi	—	—	84.2*	3-scale dynamic tokens, video+image (Jin et al., 2023)

*MMBench overall accuracy.

Cycle consistency and multi-round, in-latent-space editing (e.g., MingTok, Harmon) further demonstrate the flexibility of these representations, while video-capable frameworks maintain efficient token utilization and temporal coherence.

A plausible implication is that unified continuous spaces will accelerate convergence on all-in-one multimodal modeling paradigms, diminish the need for modality-specific pretraining, and allow flexible scaling to new domains (e.g., 3D, robotics) (Liu et al., 1 Dec 2025, Wu et al., 27 Mar 2025).

7. Open Challenges and Future Prospects

Despite their empirical strengths, unified continuous representations face open questions in scalability (especially for extremely high-res or long video), interpretability of the latent dimensions, and task interference as model and dataset scale increases. The information bottleneck principle, balancing of semantic and detail fidelity, and adaptivity to OOD conditions remain active research areas.

Recent advances—such as adaptive token allocation (CDD-VT (Chen et al., 3 Nov 2025)) and direct pixel-to-pixel diffusion (UniModel (Zhang et al., 21 Nov 2025))—suggest that even tighter unification along the axes of model, task, and representation is feasible, potentially obviating the need for any non-differentiable or quantized interfaces.

Unified continuous visual representation has become a central paradigm for bridging vision and language, offering a mathematically principled, empirically validated, and increasingly scalable approach foundational to next-generation multimodal intelligence.