2D Vision-Language Models

Updated 22 December 2025

2D VLMs are models that integrate CNN/ViT-based image encoders with transformer language modules to jointly process images and texts via cross-modal mappings.
They employ both dual-encoder contrastive alignment and end-to-end fusion architectures to achieve robust zero-shot recognition and compositional reasoning.
Applications range from medical imaging and CAD-based 3D synthesis to autonomous driving, highlighting their versatility in real-world tasks.

A vision-LLM (VLM) integrates visual and textual modalities within a unified system to jointly extract, align, and reason over information from images and natural language. In 2D domains, VLMs process flat images—natural photographs, diagrams, CAD drawings, or low-level geometric renderings—by coupling specialized visual backbones (e.g., CNNs or Vision Transformers) with language modules (e.g., transformers, BERT-family, LLM decoders) and learning cross-modal mappings. The modern VLM paradigm encompasses both dual-encoder alignment and end-to-end generative architectures, creating models capable of zero-shot recognition, instruction following, compositional reasoning, scene understanding, and medical or scientific inference. A wave of recent research elucidates the strengths, limitations, and best practices of 2D VLMs across a spectrum of real-world and adversarial tasks. The following sections detail the prevailing approaches, recent experimental findings, and design challenges specific to 2D VLMs.

1. Core Architectures and Training Regimes

2D VLMs are characterized by their integration of image encoders—typically CNNs or Vision Transformers—and text encoders, often combining these via either (a) dual-encoder contrastive alignment or (b) joint fusion in autoregressive or encoder-decoder LLMs.

Dual-encoder models: CLIP, BLIP, LiT, SigLIP, and RADIOv2.5 use two separate towers to encode images and prompts, aligning representations in a shared latent space through a contrastive loss. Zero-shot classification is performed by maximizing cosine similarity between normalized image and prompt embeddings. Prompt ensembling increases robustness, with handcrafted templates and class name sets (e.g., OpenAI⁺, WordNet) impacting results by several percentage points (Volkov et al., 11 Sep 2025, Cooper et al., 3 Oct 2024).
Multimodal fusion models: Models such as Flamingo, end-to-end VLMs with cross-attention layers between vision and language, or instruction-tuned decoders (e.g., LLaVA, Qwen2.5-VL-3B, Gemma 3), allow conditional generation and visual question answering. Co-attention, merged-attention, and cross-attention are popular fusion mechanisms for information flow between modalities, particularly in domains like high-resolution medical imaging (Zheng et al., 29 Oct 2025).
Instruction-tuning and adapters: Lightweight adaptation strategies such as LoRA inject small low-rank perturbations into pre-trained VLMs, enabling data-efficient fine-tuning for dense spatial tasks (e.g., 2D keypoint estimation) while freezing the vast majority of parameters (Duangprom et al., 28 Aug 2025).

2. Zero-shot Recognition, Fusion, and Task Complementarity

The primary practical advantage of 2D VLMs is their ability to perform robust zero-shot classification and recognition across wide class vocabularies—provided prompt engineering and fusion protocols are optimized.

Language-guided vs. vision-only classification: In ImageNet-1k, vision-only k-NN using reference embeddings can rival or surpass language-only zero-shot classifiers, with optimal neighbor counts k ≈ 9–11. Language and vision modalities exhibit complementary strengths: language-based approaches excel on classes with strong textual cues and elusive prototype appearances, while vision-based k-NN excels in fine-grained and visually diverse classes (Volkov et al., 11 Sep 2025).
Fusion without retraining: A practical fusion strategy—dynamic per-class precision-based routing—selects between vision or language predictions at inference time, improving accuracy marginally yet without model retraining. This method outperforms static model selection, and a related concept appears in routing experiments using lightweight LLMs to select the optimal VLM per task (Volkov et al., 11 Sep 2025, Cooper et al., 3 Oct 2024).
Prompt and template sensitivity: Accuracies can vary by 3–4% depending on prompt formulation or class name selection. OpenAI⁺ names with prompt ensembling consistently maximize zero-shot robustness (Volkov et al., 11 Sep 2025).

3. Internals of Visual Representation and Perceptual Grounding

Recent work has elucidated the geometry and processing stages within 2D VLMs, highlighting both their ability to recover human-like perceptual axes and marked deficiencies in low-level vision.

Representational alignment with human similarity spaces: Embedding spaces derived from large VLMs (e.g., GPT-4o, Qwen2.5-VL-72B) tightly align with principal axes of human perceptual judgments for natural categories (correlations of ~0.75 per axis), outperforming both CLIP and ResNet baselines. These "denoised" AI spaces can even predict human category labels with higher explained variance (up to 90%) than human-derived psychological spaces (83.5%) (Sanders et al., 22 Oct 2025).
Stagewise processing: Object recognition proceeds in a two-stage process: shallow layers of ViTs encode local attributes (texture, color), with conceptual semantics (object category) emerging only in deeper layers (post-layer-12). This separation mirrors physiological pathways in human visual systems and has implications for segmenting attribute and global category pathways in future VLM architectures (Li et al., 23 Sep 2025).
Spatial reasoning and positional encoding: Geometric analysis of rotary positional encodings (RoPE) reveals that relative spatial relations (left/right, front/behind) become separable via direction vectors in high-dimensional embedding space, and RoPE scaling can amplify otherwise weak spatial signals, significantly improving spatial reasoning performance (Li et al., 23 Sep 2025).

4. Limitations: Visual Acuity, Neuropsychological Deficits, and Failure Modes

Multiple systematic investigations highlight critical deficits of current 2D VLMs in both low-level and mid-level vision, despite their semantic strengths.

Low-level and spatial failures: On tailored suites such as BlindTest, state-of-the-art VLMs (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) average only 58.6% on tasks involving overlap, intersection, counting geometric primitives, and grid structure—whereas human performance is near 100%. Failure modes include confusion with tangency, adjacent-label misreading, and shape-count biases, often invariant to image resolution or line thickness (Rahmanzadehgervi et al., 9 Jul 2024).
Neuropsychological benchmarking: When compared to human clinical norms across 51 perception/cognition subtests, commercial VLMs exhibit "clinically significant" deficits (>2 SD below mean) in basic element judgment (size, length, orientation), figure-ground segmentation, occlusion/relation reasoning, and robustness to cue or configuration changes. In contrast, high-level object naming and match-to-sample tasks are often superhuman, indicating a Moravec’s Paradox: semantic association is learned, not the implicit routines of perceptual grouping and comparison (Tangtartharakul et al., 15 Apr 2025).
Projection bottlenecks and recognition vs. reasoning gap: While zero-shot recognition with frozen encoders and linear probing can exceed human accuracy on socio-cultural cues (92.7–93.3% on time/region), generative VLMs with LLM heads (OpenFlamingo, LLaMA-Adapter) degrade sharply on open-ended reasoning tasks, sometimes falling below human levels due to inadequate projection modules or excessive reliance on prompt priors (Zhang et al., 2023).

5. Application Domains and Specialized Pipelines

2D VLMs are being rapidly extended to specialized domains, often outperforming baseline models with minimal fine-tuning.

Medical imaging: In high-resolution mammography, integrating convolutional visual backbones with metadata-derived, template-encoded clinical reports via co-attention fusion yields superior AUCs (up to 0.9452 for calcification, 0.9320 for malignancy), outperforming ViT-based alternatives, with the architecture generalizing across global cohorts and public datasets (Zheng et al., 29 Oct 2025).
2D to 3D synthesis: 2D VLMs are repurposed for tasks such as reconstructing parametric 3D objects (e.g., cabinets) from rasterized CAD drawings. A ViT vision encoder processes 2D drawings, feeding latents into an LLM to auto-regress Python scripts specifying object primitives and parameters. The system achieves >93% retrieval accuracy and F1 score for 3D assembly (Wang et al., 16 Dec 2024). Similarly, 2D VLMs are leveraged for volumetric 3D object semantic analysis by tiling voxel slices and decoding structured descriptors (Dao et al., 27 Mar 2025).
Autonomous driving: MiniDrive demonstrates that a lightweight VLM with large-kernel CNN backbone, mixture-of-experts feature tokenization, and cross-attentive instruction adaptation can outperform far larger systems on multi-camera perception and driving QA benchmarks (e.g., DriveLM BLEU-4 = 49.70, 83M parameters versus 3.96B) (Zhang et al., 11 Sep 2024).

6. Interpretability, Compositionality, and Future Directions

Recent work expands the interpretability and compositional reasoning capacity of VLMs by incorporating hierarchical language structures, hard negative sampling, and differential relevance mapping.

Tree-augmented training: The 3VL approach parses captions into hierarchical trees of nested noun phrases, generating both positives and synthetic hard negatives at each level. Training with this structure and Anchor/DiRe interpretability tools improves compositional language concept (CLC) understanding and attribute/relation accuracy by several points over CLIP and NegCLIP (e.g., 78.25% vs. 66.13% on VG Attributes) (Yellinek et al., 2023).
Modular plug-and-play layers: Distillation-based visual decoders and redundancy-reducing token compression layers accelerate inference by up to 50% with minimal accuracy loss, while structured positional encodings and "dual-stage" ViT heads provide opportunities for explicit 2D geometric reasoning and model debugging (Li et al., 23 Sep 2025).
Open problems and recommendations: To close foundational gaps, research suggests: early-fusion architectures allowing visual attention to be guided by language context (“region-guided early fusion”); explicit spatial reasoning losses and synthetic low-level tasks during pretraining; fine-tuned adapters on mid-level features; and adaptive vision backbones designed to efficiently represent relational and geometric primitives (Rahmanzadehgervi et al., 9 Jul 2024, Tangtartharakul et al., 15 Apr 2025).

7. Summary Table: Key Performance Metrics and Failure Modes

Task / Benchmark	Best VLM Result	Human Norm	Known Failure/Implication
BlindTest low-level spatial tasks	74.9% (Claude 3.5 Sonnet)	~100%	Overlap, intersection, adjacency
Attribute axis alignment (similarity)	r ≈ 0.75 (VLM)	—	Denser, "denoised" human structure
Mammography (AUC, calcification)	0.9452	—	Outperforms ViT-based approaches
COCO zero-shot compositionality	78.25% (3VL)	—	Enhanced with tree-structured loss
Time/region recognition (WikiTiLo)	92.7–93.3% (Linear probe)	67.4–62.4%	But reasoning degrades (OpenFlamingo)
VL-Checklist VG Object tasks	89.28% (3VL)	—	CLIP baseline: 79.01%

Limitations, interpretability, and further advancements remain active research directions in the quest for robust, human-aligned, and operationally reliable 2D vision-LLMs.