CostNav: Unified Visual Representation

Updated 2 December 2025

CostNav is a unified framework integrating continuous and hybrid visual representations to bridge discriminative understanding with generative synthesis.
It employs cascaded encoders, optimal transport barycenters, and adaptive tokenization mechanisms to align and merge multi-modal features effectively.
Empirical evaluations demonstrate improved PSNR, accuracy, and robustness across tasks, streamlining multimodal machine learning pipelines.

CostNav refers to a class of architectural and algorithmic approaches in multimodal machine learning that seek to unify continuous visual representations across diverse tasks such as understanding, generation, editing, and cross-modal transfer. The objective is to resolve the historical split between discriminative, semantic-rich encodings for understanding and generative, detail-preserving codes for synthesis, so that a single model and feature space can optimally serve both modalities. CostNav encompasses advances in continuous latent spaces, barycentric transport methods, cross-modal alignment, and hybrid continuous/discrete tokenization. The following survey details foundations, algorithmic strategies, theoretical underpinnings, benchmark results, and ongoing challenges.

1. Foundations of Unified Continuous Visual Representation

Traditional visual encodings are either discrete (e.g., vector-quantized codes) or continuous (e.g., VAE/Gaussian latents), with each approach entailing specific trade-offs for vision-language integration. Discrete representations, as realized in VQ-GAN architectures, facilitate autoregressive (AR) generation by aligning with LLM (LM) tokenization, but quantization errors degrade semantic richness and limit discriminative capability for understanding. Conversely, continuous spaces (as in VAEs or masked autoregressive representations) enable high-fidelity reconstructions and nuanced feature geometry but complicate AR decoding and often underperform in semantic alignment.

CostNav-inspired models seek unified representations by designing architectures where a single continuous (or hybrid continuous-discrete) latent space is used for both understanding and generation. This eliminates format mismatches while retaining the advantages of both regimes. The representation space may be constructed via cascaded encoders, optimal transport barycenters, dynamic token allocation, or cross-modal alignment mechanisms (Fan et al., 17 Mar 2025, Liu et al., 1 Dec 2025, Tang et al., 27 May 2025, Huang et al., 8 Oct 2025, Chen et al., 3 Nov 2025).

2. Core Algorithmic Approaches

2.1 Cascaded Encoder Architectures

Several systems instantiate the unified space by composing two or more encoders in sequence. For example, TUNA cascades a causal VAE (Wan 2.2) for initial low-dimensional latent extraction, followed by a semantic representation encoder (e.g., SigLIP 2 or DINOv3) whose first patch-embedding is adapted to match the stride of the VAE output. The resulting features are projected into the token space used by the autoregressive decoder (Liu et al., 1 Dec 2025). This design allows end-to-end training on both understanding (language modeling loss) and generation (flow matching in latent space), with the unified continuous space serving as the locus for both information flows.

2.2 Optimal Transport Barycenter Spaces

BaryIR introduces a mathematically rigorous continuous barycenter space via multi-source optimal transport (OT) barycenters (Tang et al., 27 May 2025). Given $K$ sources of degraded images (e.g., different artifacts), it computes a latent map $T: \mathcal{Z} \to \mathcal{Z}_B$ that projects per-source features into a shared, compact, degradation-agnostic subspace. The barycenter objective,

$\mathcal{L}^* = \sup_{\sum_k \lambda_k f_k=0}\; \inf_{T} \sum_{k=1}^K\lambda_k\, \mathbb{E}_{z_k}\bigl[C_k(z_k, T(z_k)) - f_k(T(z_k))\bigr]$

with custom transport cost $C_k$ (incorporating contrastive and orthogonality regularization), yields a unified subspace $\mathcal{Z}_B$ capturing common content and orthogonal per-source fibers for domain-specific information.

2.3 Hybrid and Adaptive Tokenization

Hybrid approaches such as UniToken (Jiao et al., 6 Apr 2025) and CDD-VT (Chen et al., 3 Nov 2025) combine discrete and continuous tokens: both quantized indices (VQ-GAN) and high-level continuous features (e.g., SigLIP ViT) are concatenated or interleaved within the unified token stream. CDD-VT further advances this by adaptively determining, per sample and per patch, the allocation between discrete primitives and dense continuous codes based on local complexity, with a learned allocation network ensuring effective coverage of the latent manifold.

2.4 Multi-scale and Dynamic Summarization

Chat-UniVi (Jin et al., 2023) and related systems reduce token redundancy and mediate between detail and semantics by clustering latent patch features into dynamic, multi-scale tokens: spatial or temporal clustering identifies key regions and events, and three-stage merging yields a hierarchy where coarse tokens encode high-level concepts and fine tokens retain spatial accuracy.

2.5 Pixel-Native Frameworks

UniModel (Zhang et al., 21 Nov 2025) exemplifies a visual-only paradigm by mapping both text and images into the same pixel space. Textual prompts are rasterized into painted images, so all tasks (understanding/generation) are formalized as pixel-to-pixel diffusion via a single transformer backbone, with fully shared representation space and loss—eliminating the need for vision-language adapters.

3. Mathematical Formulation and Losses

Unified continuous spaces are optimized by blending objectives tailored for both synthesis and interpretation. Key components include:

Reconstruction/Predictive Losses: Standard VAE mean squared error for reconstituting inputs, augmented with flow-matching (rectified flow, diffusion) losses for generation (e.g., $\mathcal{L}_{\rm FM}$ in TUNA (Liu et al., 1 Dec 2025); $\mathcal{L}_{\rm flow}$ in UniModel (Zhang et al., 21 Nov 2025)).
Language Modeling/Cross-Entropy Losses: For understanding, the decoder is supervised with autoregressive cross-entropy over text tokens (e.g., $\mathcal{L}_{\rm LM}$ ), occasionally masked over continuous features.
Contrastive and Alignment Losses: Used to ensure cross-modal coherence (e.g., $\mathcal{L}_{\rm align}$ in BaryIR, CLIP-based losses in Chat-UniVi and Video-LLaVA).
Regularization/Bottlenecking: Orthogonality and contrastive regularizers (BaryIR), channel averaging (MingTok (Huang et al., 8 Oct 2025)), dynamic density control (CDD-VT), or direct spatial compression.
Adaptive Balancing: Explicit hyperparameter control (e.g., UniFluid $\lambda$ -tradeoff (Fan et al., 17 Mar 2025)) is deployed to manage the synthesis/understanding balance.

4. Architectural Designs and Implementation

State-of-the-art systems implement diverse recipes for unifying tasks:

Model	Representation Space	Encoder Stack	Token Type(s)	Generation Mechanism
UniToken	Discrete + Continuous hybrid	VQ-GAN + SigLIP	VQ indices, Transformer	Decoder-only Transformer
TUNA	Cascaded continuous (VAE+SemEnc)	Wan VAE + SigLIP 2	Latents post-MLP	Qwen-2.5, flow-matching/diffusion
Ming-UniVision	Continuous low-dim latents	ViT + transformer	Dense vectors	AR prediction + reconstruction
CDD-VT	Adaptive hybrid, codebook + cont.	PatchViT + DQP/DPA	Adaptive sum (per patch)	Standard VQ pipeline
UniModel	RGB pixel space (text as image)	VAE enc, ViT, diff.	Image pixels (float)	Rectified-flow diffusion

Each approach varies in the nature and source of its unified space: from hybrid codebooks, to cascaded continuous embeddings, to pure pixel spaces (Jiao et al., 6 Apr 2025, Liu et al., 1 Dec 2025, Huang et al., 8 Oct 2025, Zhang et al., 21 Nov 2025).

5. Benchmark Performance and Empirical Insights

Unified continuous representations achieve or surpass state-of-the-art across understanding and generation:

TUNA achieves object-focus GenEval score 0.88 (vs. 0.68 for Show-o2) and high multi-modal understanding accuracy (MME 1461.5, MMStar 54.6), also excelling in image/video editing (Liu et al., 1 Dec 2025).
BaryIR displays +2–4 dB PSNR gains on OOD restoration tasks, with explicit barycenter learning preventing overfitting to domain-specific degradations (Tang et al., 27 May 2025).
Ming-UniVision (MingTok) attains rFID ≈ 0.38 at 32× compression, SOTA or near-SOTA in multi-round editing and attribute control, and equaling dual-branch models in VQA/QA mean accuracy (Huang et al., 8 Oct 2025).
CDD-VT matches or exceeds hybrid and discrete-only baselines; ImageNet top-1 accuracy 70.5% (vs. 69.9% for hybrid), rFID 0.31 (vs. 0.35), without loss of cross-modal alignment (Chen et al., 3 Nov 2025).
Video-LLaVA demonstrates that pre-aligned, continuous space enables joint image/video training with clear mutual benefit; on MSVD-QA, joint training increases accuracy from 64.8% to 70.7%, with similar improvements on image-only datasets (Lin et al., 2023).
UniModel reaches competitive FID and captioning metrics using solely pixel-based translation, with strong cycle-consistency between modalities (Zhang et al., 21 Nov 2025).

Empirical ablation consistently demonstrates that a unified continuous feature space boosts generalization, enables parameter efficiency, and mitigates cross-modal interference.

6. Theoretical Analysis and Practical Implications

Theoretical results, as in BaryIR, offer error bounds on the learned barycenter map under convexity and duality gap assumptions (Tang et al., 27 May 2025). By enforcing shared low-dimensional subspaces and explicit geometric regularization (e.g., orthogonality, contrastive separation), generalization to unseen distributions or degradations is improved.

Practical implications:

Robust Generalization: Unified spaces support strong performance on out-of-distribution inputs, including unseen degradations (BaryIR) and novel camera poses (RoboUniView (Liu et al., 27 Jun 2024)).
Engineering Efficiency: Unified representation eliminates duplicated modules for separate tasks, simplifies pipelines (e.g., CDD-VT), and leverages end-to-end differentiability.
Seamless Multimodality: The same infrastructure supports images, video, text, and semantic editing, often yielding emergent controllability (e.g., multi-round editing, pixel-level cycles (Huang et al., 8 Oct 2025, Zhang et al., 21 Nov 2025)).

7. Open Challenges and Future Directions

Despite successes, issues remain:

Trade-off Calibration: Tuning the balance between synthesis and discrimination (e.g., UniFluid $\lambda$ ) remains empirical, with task interference possible if weighting is suboptimal (Fan et al., 17 Mar 2025).
Scalability: Very high-dimensional representations or large-scale training datasets increase resource demand; scalability to billion-scale data (demonstrated in CDD-VT (Chen et al., 3 Nov 2025)) partially alleviates this.
Interpretability: Understanding the geometry of the unified space, the degree of semantic disentanglement, and emergent compositionality remains an open research area.
Extension to New Modalities: Integration of point clouds, audio, haptics, or further temporal structures requires future work.

A plausible implication is that ongoing advances in unified continuous representation learning (CostNav and derivatives) will underpin next-generation generalist models, facilitating efficient cross-task adaptation, controllable generation, and robust reasoning in a common framework.