ConvLLaVA: Efficient High-Res LMM Architecture

Updated 15 February 2026

ConvLLaVA is a large multimodal model that uses a five-stage hierarchical ConvNeXt backbone to efficiently encode high-resolution images.
It integrates successive downsampling and tailored retraining to drastically reduce token count and computational overhead compared to ViT-based models.
The model flexibly handles arbitrary image resolutions and aspect ratios, ensuring minimal performance loss across diverse vision-language tasks.

ConvLLaVA is a large multimodal model (LMM) architecture that leverages a hierarchical convolutional backbone to efficiently encode high-resolution visual inputs for downstream LLMs. By integrating a five-stage ConvNeXt as its visual encoder and introducing optimization strategies tailored for high-resolution processing, ConvLLaVA addresses both token redundancy and quadratic compute scaling that limit Vision Transformer (ViT)-based LMMs. It achieves highly competitive results on general and fine-grained vision-language benchmarks while maintaining flexibility for arbitrary image resolutions and aspect ratios (Ge et al., 2024).

1. Motivation and Background

High-resolution LMMs must balance computational efficiency with information retention. ViT-based encoders split images of size $H \times W$ into $P \times P$ patches, yielding

$N_{\rm ViT} = \frac{H\,W}{P^2}$

visual tokens and introducing self-attention complexity of

$\mathcal{O}(N_{\rm ViT}^2) = \mathcal{O}\left(\frac{H^2W^2}{P^4}\right).$

At $1536 \times 1536$ resolution with $P=16$ , this results in $9216$ tokens, creating significant overhead in memory and FLOPs due to quadratic scaling.

Hierarchical backbones such as ConvNeXt sequentially downsample feature maps, resulting in a final spatial shape of $H/32 \times W/32$ after four stages. The token count for a four-stage ConvNeXt is

$N_{\rm 4-stage} = \frac{H\,W}{32^2},$

approximately one-ninth of ViT at equivalent resolution, while the depth-wise convolutions scale linearly as $\mathcal{O}(k^2 C N)$ . Figure 1 in the source demonstrates an approximate $8\times$ reduction in compute for ConvNeXt versus ViT-L when processing high-resolution images (Ge et al., 2024).

2. Model Architecture and Visual Token Compression

ConvLLaVA replaces the ViT visual encoder with a ConvNeXt-L backbone, originally comprising four stages and a $4 \times 4$ convolutional patch embed for up to $32\times$ downsampling. To address persistent redundancy at higher resolutions, ConvLLaVA extends the backbone with a fifth ConvNeXt stage, reaching a $64\times$ total downsampling:

$N_{\rm 5-stage} = \frac{H\,W}{64^2}.$

For $1536 \times 1536$ images, this amounts to just $576$ visual tokens, a drastic reduction compared to conventional approaches.

Two primary optimizations enable this architecture to perform on par or better than ViT-based models at high resolutions:

Visual Encoder Re-Tuning for High Resolution: The pretrained CLIP-ConvNeXt-L, optimized for images up to $320$ px, exhibits degraded performance when evaluated on larger inputs. ConvLLaVA addresses this through a three-stage training pipeline:
- Projector initialization using $558$ K caption pairs.
- Vision–Language pretraining with $\sim2$ M ShareGPT4V-PT caption pairs.
- Instruction tuning with $665$ K LLaVA examples. During the latter two stages, the last $18$ ConvNeXt blocks, the projector, and the LLM adapter are unfrozen and trained to bridge the gap between low- and high-resolution regimes.
Successive Downsampling (Fifth Stage): An additional ConvNeXt stage (six blocks) further downscales feature maps, maintaining the token count at $576$ for $1536$ px images without loss of fine-grained detail. Table 4 in the source shows that this enables improvements on tasks such as TextVQA and DocVQA as resolution increases (Ge et al., 2024).

3. Flexible Handling of Resolutions and Aspect Ratios

ConvLLaVA’s convolutional visual backbone is translation-equivariant and agnostic to image aspect ratio. After training at a canonical “square” resolution (e.g., $1536 \times 1536$ ), the system can be deployed on images of arbitrary aspect ratio: only the short side must be resized to $1536$ px, while the long side is preserved. Empirical results (Table 7) indicate minimal performance degradation, and sometimes performance gains on OCR-oriented tasks like DocVQA. Inference at even higher resolutions (e.g., short side $1664$ px) offers further improvements for select benchmarks (Ge et al., 2024).

4. Experimental Evaluation and Quantitative Comparison

The ConvLLaVA model suite is evaluated on a spectrum of mainstream benchmarks, including:

General capability: MME, MMBench, SEEDBench, RealWorldQA, MMMU, MMVet, POPE
Fine-grained OCR: TextVQA, DocVQA
Referring comprehension: RefCOCO, RefCOCO+, RefCOCOg

All closed-form vision–language responses are scored using VLMEVALKIT (exact match/accuracy), and grounding is measured via standard IOU $\geq0.5$ .

Compute and Token Count Efficiency

A major contribution is reduction in both token count and computation:

ViT at $336$ px: $576$ tokens (quadratic compute)
MiniGemini-HD at $1536$ px: $2880$ tokens (cropping+Enum-aspect)
ConvLLaVA-1536: $576$ tokens (linear compute)

At $768$ px, ConvNeXt offers an $8\times$ compute saving over ViT; with the fifth stage, an additional $6\times$ reduction at $1536$ px is observed (see Figure 1).

Comparative Performance

A summary from Table 5 is provided below for representative 7B models:

Method	Res	#Tokens	MMBench	SEEDBench	TextVQA	DocVQA	POPE
LLaVA-1.5 (7B)	336	576	64.3	66.2	45.5	21.6	85.9
ShareGPT4V (7B)	336	576	68.8	69.7	51.1	26.6	86.0
MiniGemini-HD	1536	2880	65.8	—	—	—	—
ConvLLaVA (7B)	1536	576	68.7	70.2	65.8	59.0	87.3

ConvLLaVA at $1536$ px matches or exceeds models of similar or greater scale on MMBench and SEEDBench with substantially fewer tokens and demonstrates dramatic improvements on OCR tasks (e.g., DocVQA: $59.0$ vs $26.6$ for $336$ px models). On referring comprehension tasks, ConvLLaVA outperforms LLaVA-1.5 (Table 6), achieving an average $82.3\%$ versus $79.3\%$ (Ge et al., 2024).

5. Implications and Technical Significance

ConvLLaVA demonstrates that hierarchical, convolutional visual encoders—when appropriately extended and retrained—can obviate the quadratic bottlenecks of ViT at high resolution. By maintaining only a few hundred tokens even for megapixel-scale images, ConvLLaVA achieves $\sim6$ – $8\times$ compute savings compared to leading high-resolution LMMs, while retaining or improving performance on both general and fine-grained tasks.

Key architectural properties include linear spatial complexity ( $\mathcal{O}(N)$ ), token compression via staged downsampling, and equivariance permitting arbitrary aspect ratio inference without retraining.

6. Limitations and Prospects for Advancement

The ConvNeXt backbone’s configuration—kernel sizes, stage depths, and pretraining regimen—remains optimized for low-resolution settings. Adapting backbones specifically for high-resolution input, for example by enlarging kernel sizes or rebalancing stage depths, potentially yields better representations. The trade-off between compressing information and preserving small-object details persists: excessive downsampling risks information loss; retaining excessive tokens burdens subsequent LLM processing. Adaptive or content-aware compression schemes represent a promising direction.

The architecture’s principles are broadly applicable to multi-image, video, or interleaved vision/text settings, although explosive token growth in these domains will require further innovations in visual token adaptation and complexity control (Ge et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ConvLLaVA.

ConvLLaVA: Efficient High-Res LMM Architecture

1. Motivation and Background

2. Model Architecture and Visual Token Compression

3. Flexible Handling of Resolutions and Aspect Ratios

4. Experimental Evaluation and Quantitative Comparison

Compute and Token Count Efficiency

Comparative Performance

5. Implications and Technical Significance

6. Limitations and Prospects for Advancement

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ConvLLaVA: Efficient High-Res LMM Architecture

1. Motivation and Background

2. Model Architecture and Visual Token Compression

3. Flexible Handling of Resolutions and Aspect Ratios

4. Experimental Evaluation and Quantitative Comparison

Compute and Token Count Efficiency

Comparative Performance

5. Implications and Technical Significance

6. Limitations and Prospects for Advancement

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research