LLaVA Visual Feature Extraction

Updated 9 January 2026

LLaVA-extracted visual features are intermediate token representations generated by vision-language models that map image patches into language model embeddings.
They integrate spatial, semantic, and instruction-driven modalities to support tasks such as visual question answering, spatial reasoning, and scene understanding.
Adaptive token pruning and multi-scale fusion techniques enable efficient computation and improved performance across diverse multimodal benchmarks.

LLaVA-extracted visual features refer to the intermediate token representations produced by LLaVA-style multimodal vision-LLMs (VLMs), encompassing a family of architectures and algorithms designed to integrate image and video patch features into LLMs for tasks such as visual question answering (VQA), spatial reasoning, scene understanding, and multimodal instruction following. These features are characterized by their origin in vision transformers (typically CLIP-ViT or SigLIP), their dimensional structure (patch tokens mapped to LLM embedding space), and, in state-of-the-art extensions, the addition of high-resolution, spatial, semantic, and text- or instruction-guided modalities. The pipeline for feature extraction, transformation, and injection into LLMs defines a critical component of MLLM design and directly impacts task performance across diverse benchmarks.

1. Architectural Foundations and Raw Visual Tokens

The canonical LLaVA pipeline employs a frozen vision encoder, typically CLIP-ViT-L/14, which processes an input image (or video frame) into a grid of patch embeddings. For an input of size $H \times W$ , the image is partitioned into $N$ square patches (e.g., $N=24 \times 24=576$ for CLIP-ViT-L/14 with 336 pixels), each patch represented by a hidden vector $h^0_p \in \mathbb{R}^{d_v}$ , where $d_v$ is the encoder’s embedding dimension (e.g., $d_v=768$ or $1024$) (Yu et al., 2024, Cocchi et al., 19 Mar 2025, Chen et al., 30 Apr 2025, Lou et al., 1 Jul 2025).

The visual tokens $V = H_v \in \mathbb{R}^{N \times d_v}$ are then linearly projected (often single-layer, occasionally multi-layer MLP) into the LLM embedding space $\mathbb{R}^{N \times d}$ ( $d$ typically $4096$ for Vicuna-7B), resulting in the final set $E_{\rm img}$ of visual tokens for autoregressive decoding (Yu et al., 2024). This sequence is prepended or concatenated to text tokens, so visual context is directly available in language modeling.

Recent extensions include:

Use of alternative vision backbones (e.g., DINOv2, SigLIP, SigLIP2) with distinct patch sizes and channel dimensions, influencing token granularity and semantic consistency (Cocchi et al., 19 Mar 2025).
Multi-scale or multi-frame aggregation for video and high-resolution image understanding (Zhao et al., 9 Jan 2025, Zhang et al., 2024).

2. Enhanced Spatial and Semantic Feature Extraction

Standard patch tokens capture primarily local information and global averages. However, detailed scene understanding and fine-grained reasoning benefit from spatially enriched representations:

LLaVA-SP introduces six additional spatial visual tokens derived via convolutional kernels from the 2D grid of ViT embeddings. Two extraction modes are provided:
- Cropping: tokens extracted from concentric crops, progressing from center to entire image (local-to-global).
- Pooling: tokens produced by adaptive average pooling, simulating abstract-to-specific progression (Lou et al., 1 Jul 2025).

A cross-attention module (Detail Feature Integrator, DFI) further fuses these spatial tokens with fine feature maps, augmenting local and global information. The final visual sequence is then $N+6$ tokens, all in LLM space (Lou et al., 1 Jul 2025).

LLaVA-UHD v2 constructs a high-resolution semantic pyramid via progressive up-sampling and content-adaptive convolution (Joint Bilateral Upsampling, JBU), generating multi-scale feature maps ( $F^0, F^1, F^2$ ) which are compressed by hierarchical window attention to an ordered grid of $N \times N$ tokens for multi-slice integration (Zhang et al., 2024). This inverse semantic pyramid preserves both low-level and semantic granularity, significantly boosting OCR and fine detail perception benchmarks.

3. Text-Guided and Instruction-Driven Feature Modulation

TG-LLaVA injects learnable latent embeddings as textual queries into the vision encoder, producing masks and selection weights that reweight or supplement visual features at both global and local scales (Yan et al., 2024). Instruction embeddings $L_m$ cross-attend to global pooled text, generating optimization masks $M_t$ for visual features. Local-patch embeddings $L_h$ extract detail-rich tokens via cross-attention to fine-grained text and high-res image patches.

This process yields a concatenated token output $T^{\rm fin}_{v} = [C(F^*_i); F_i^h]$ , where $F^*_i$ are globally guided features and $F_i^h$ are high-resolution detail tokens, each matched semantically to the LLM’s text tokens.

LLaVA-Octopus, targeting video, employs three parallel projectors (static per-frame, spatial-temporal, and long-range token-compressed connectors), each producing an aligned token grid. Instruction-driven adaptive weighting is computed via language features embedded by BERT, generating softmax-normalized weights $(w_1, w_2, w_3)$ applied to the respective feature sets $F_1, F_2, F_3$ (Zhao et al., 9 Jan 2025). The fused matrix $F = w_1F_1 + w_2F_2 + w_3F_3$ provides dynamic selection of spatiotemporal cues appropriate to query semantics.

4. 3D and 4D Scene Tokenization: Cubist and Spatiotemporal Omnidirectionality

LLaVA $^3$ $^{3}$ (“LLaVA-Cube”) reconstructs 3D scenes from multi-view 2D images via NeRF-style grid-based feature fields. Key innovations include:
- Joint radiance and feature field reconstruction: at each 3D grid point, both view-invariant $f_{VI}(X)$ and view-dependent $\delta_{VD}(X,d)$ representations are stored and mapped via small MLPs to VLM space (Petit et al., 20 Nov 2025).
- Hierarchical 3D segmentation is performed via embedding clustering and CLIP centroid refinement, yielding trees of objects, parts, sub-parts.
- Omni-directional, object-centric token sampling: tokens are cast across all object parts/sub-parts, balanced between semantic (fVI) and relational (fVD) features, ordered deterministically (by world-centric polar angle/radius) to serve as “virtual images.” Each object segment’s tokens are presented as a sequence $[IMG:ID=i]\langle t_1,\dots,t_{N_{fi}} \rangle$ .

This “Cubist painter” approach achieves significant gains on 3D VQA and grounding, outperforming standard multi-view and NeRF-token pipelines on ScanQA, MSR3D, and Replica benchmarks.

LLaVA-4D extends to (x, y, z, t) spatiotemporal embeddings by encoding pixelwise world coordinates and timestamps into dynamic 4D prompts via learnable Fourier features, optical-flow-based motion scaling, and explicit disentanglement of spatial and temporal components (Zhou et al., 18 May 2025). These prompts are fused into visual features via cross-attention blocks, enabling both static and dynamic reasoning, spatiotemporal grounding, and temporal captioning.

This formulation uniquely supports dynamic object tracking and temporal queries, surpassing pure 3D LMMs in Chat4D and ScanQA benchmarks.

5. Layer Selection, Adaptive Pruning, and Feature Utility

Extensive layerwise analysis reveals substantial task-dependent variation in the utility of CLIP-ViT layers:
- Shallow layers (1–12) encode fine visual detail but weak textual semantics; middle layers (13–20) excel at counting and localization; deep layers (21–24) are essential for OCR and text-image alignment (Chen et al., 30 Apr 2025).

A lightweight fusion (e.g., $f = \mathrm{Concat}(H^{(3)}, H^{(18)}, H^{(23)})$ followed by linear projection) robustly improves overall performance across MLLM tasks.

Adaptive token reduction, based on autoencoder-style reconstruction and Gumbel-Softmax selection mechanisms, allows the retention of only the most informative tokens (Allakhverdov et al., 20 Mar 2025). For OCR and visually dense tasks, pruning down to 50% of tokens via learned selection preserves nearly full accuracy ( $\leq$ 0.5% loss), while random selection results in substantial degradation. For general reasoning, retention as low as 30% yields minimal impact, enabling scalable, low-overhead inference.

Below is a summary table contrasting major LLaVA-extracted feature variants:

Approach	Feature Source	Augmentation/Selection Mechanism	Application Domains
Standard LLaVA	CLIP-ViT patch tokens	Single or fused layers, linear projection	VQA, captioning
LLaVA-SP	CLIP-ViT patch + spatial tokens	Convolutional extractor, cross-attention	Detail, grounding
TG-LLaVA	CLIP-ViT patch tokens	Text-guided latent masks, high-res detail	Instruction-guided VQA
LLaVA $^3$	Multi-view SigLIP tokens, NeRF	3D field, hierarchical segmentation, omni-sampling	3D VQA/grounding
LLaVA-4D	CNN/ViT video tokens	(x,y,z,t) Fourier-coord prompts, disentanglement	4D dynamic scene
Octopus	SigLIP/MLP/STC/Compress	Instruction-driven adaptive fusion	Video QA, long-video
Token Pruning	CLIP-ViT patches	Autoencoder+Gumbel-Softmax selection	Efficient inference

6. Mechanistic Interpretability and Practical Guidance

Recent mechanistic interpretability analysis (Yu et al., 2024) demonstrates that visual token patch embeddings (post-projection) retain high semantic fidelity, allowing identification of critical image regions via log-probability-increase metrics. Attention head analysis reveals strong correspondence between visual and text QA circuits, with in-context learning mechanisms generalized from textual to visual tokens.

LLaVA’s architectural transparency enables practitioners to extract, interpret, and debug visual feature paths; patch-wise attributions and headwise sensitivities clarify how local and global features contribute to prediction, and fine-tuning protocols recommend selective freezing of backbones and connectors to preserve learned utility.

7. Quantitative Impact and Performance Trends

Across multimodal and domain-specific benchmarks, enriched and adaptively selected LLaVA-extracted features deliver consistent performance gains:

LLaVA $^3$ reports +6.85 CIDEr points by transitioning to object-centric omni-sampling, +2.09 by deterministic token ordering, and up to +146% mIoU on 3D segmentation over prior NeRF-based methods (Petit et al., 20 Nov 2025).
LLaVA-UHD v2’s hierarchical window transformer yields +3.7% average improvement on 14 benchmarks, +9.3% on DocVQA (Zhang et al., 2024).
TG-LLaVA’s text-guided modules bring 1–3 point average improvement at marginal extra compute, without augmented data (Yan et al., 2024).
Adaptive token pruning preserves nearly all accuracy at 2–3× lower computational cost on OCR (0.5% loss at 50% token retention) but should be selectively patterned for visually intensive versus reasoning tasks (Allakhverdov et al., 20 Mar 2025).
Instruction-driven projector fusion in LLaVA-Octopus achieves superior performance versus uniform fusion, boosting VideoMME accuracy by +2.1% over baseline (Zhao et al., 9 Jan 2025).

These results highlight that the method of extracting, enriching, and selectively modulating LLaVA visual features is pivotal to high-performing, scalable multimodal models; the discipline-specific selection and fusion of tokens tailored to both data, task, and semantic guidance is central to current best practices.