Papers
Topics
Authors
Recent
Search
2000 character limit reached

ConvLLaVA: Efficient High-Res LMM Architecture

Updated 15 February 2026
  • ConvLLaVA is a large multimodal model that uses a five-stage hierarchical ConvNeXt backbone to efficiently encode high-resolution images.
  • It integrates successive downsampling and tailored retraining to drastically reduce token count and computational overhead compared to ViT-based models.
  • The model flexibly handles arbitrary image resolutions and aspect ratios, ensuring minimal performance loss across diverse vision-language tasks.

ConvLLaVA is a large multimodal model (LMM) architecture that leverages a hierarchical convolutional backbone to efficiently encode high-resolution visual inputs for downstream LLMs. By integrating a five-stage ConvNeXt as its visual encoder and introducing optimization strategies tailored for high-resolution processing, ConvLLaVA addresses both token redundancy and quadratic compute scaling that limit Vision Transformer (ViT)-based LMMs. It achieves highly competitive results on general and fine-grained vision-language benchmarks while maintaining flexibility for arbitrary image resolutions and aspect ratios (Ge et al., 2024).

1. Motivation and Background

High-resolution LMMs must balance computational efficiency with information retention. ViT-based encoders split images of size H×WH \times W into P×PP \times P patches, yielding

NViT=HWP2N_{\rm ViT} = \frac{H\,W}{P^2}

visual tokens and introducing self-attention complexity of

O(NViT2)=O(H2W2P4).\mathcal{O}(N_{\rm ViT}^2) = \mathcal{O}\left(\frac{H^2W^2}{P^4}\right).

At 1536×15361536 \times 1536 resolution with P=16P=16, this results in $9216$ tokens, creating significant overhead in memory and FLOPs due to quadratic scaling.

Hierarchical backbones such as ConvNeXt sequentially downsample feature maps, resulting in a final spatial shape of H/32×W/32H/32 \times W/32 after four stages. The token count for a four-stage ConvNeXt is

N4stage=HW322,N_{\rm 4-stage} = \frac{H\,W}{32^2},

approximately one-ninth of ViT at equivalent resolution, while the depth-wise convolutions scale linearly as O(k2CN)\mathcal{O}(k^2 C N). Figure 1 in the source demonstrates an approximate 8×8\times reduction in compute for ConvNeXt versus ViT-L when processing high-resolution images (Ge et al., 2024).

2. Model Architecture and Visual Token Compression

ConvLLaVA replaces the ViT visual encoder with a ConvNeXt-L backbone, originally comprising four stages and a 4×44 \times 4 convolutional patch embed for up to 32×32\times downsampling. To address persistent redundancy at higher resolutions, ConvLLaVA extends the backbone with a fifth ConvNeXt stage, reaching a 64×64\times total downsampling:

N5stage=HW642.N_{\rm 5-stage} = \frac{H\,W}{64^2}.

For 1536×15361536 \times 1536 images, this amounts to just $576$ visual tokens, a drastic reduction compared to conventional approaches.

Two primary optimizations enable this architecture to perform on par or better than ViT-based models at high resolutions:

  1. Visual Encoder Re-Tuning for High Resolution: The pretrained CLIP-ConvNeXt-L, optimized for images up to $320$ px, exhibits degraded performance when evaluated on larger inputs. ConvLLaVA addresses this through a three-stage training pipeline:
    • Projector initialization using $558$ K caption pairs.
    • Vision–Language pretraining with 2\sim2 M ShareGPT4V-PT caption pairs.
    • Instruction tuning with $665$ K LLaVA examples. During the latter two stages, the last $18$ ConvNeXt blocks, the projector, and the LLM adapter are unfrozen and trained to bridge the gap between low- and high-resolution regimes.
  2. Successive Downsampling (Fifth Stage): An additional ConvNeXt stage (six blocks) further downscales feature maps, maintaining the token count at $576$ for $1536$ px images without loss of fine-grained detail. Table 4 in the source shows that this enables improvements on tasks such as TextVQA and DocVQA as resolution increases (Ge et al., 2024).

3. Flexible Handling of Resolutions and Aspect Ratios

ConvLLaVA’s convolutional visual backbone is translation-equivariant and agnostic to image aspect ratio. After training at a canonical “square” resolution (e.g., 1536×15361536 \times 1536), the system can be deployed on images of arbitrary aspect ratio: only the short side must be resized to $1536$ px, while the long side is preserved. Empirical results (Table 7) indicate minimal performance degradation, and sometimes performance gains on OCR-oriented tasks like DocVQA. Inference at even higher resolutions (e.g., short side $1664$ px) offers further improvements for select benchmarks (Ge et al., 2024).

4. Experimental Evaluation and Quantitative Comparison

The ConvLLaVA model suite is evaluated on a spectrum of mainstream benchmarks, including:

  • General capability: MME, MMBench, SEEDBench, RealWorldQA, MMMU, MMVet, POPE
  • Fine-grained OCR: TextVQA, DocVQA
  • Referring comprehension: RefCOCO, RefCOCO+, RefCOCOg

All closed-form vision–language responses are scored using VLMEVALKIT (exact match/accuracy), and grounding is measured via standard IOU 0.5\geq0.5.

Compute and Token Count Efficiency

A major contribution is reduction in both token count and computation:

  • ViT at $336$ px: $576$ tokens (quadratic compute)
  • MiniGemini-HD at $1536$ px: $2880$ tokens (cropping+Enum-aspect)
  • ConvLLaVA-1536: $576$ tokens (linear compute)

At $768$ px, ConvNeXt offers an 8×8\times compute saving over ViT; with the fifth stage, an additional 6×6\times reduction at $1536$ px is observed (see Figure 1).

Comparative Performance

A summary from Table 5 is provided below for representative 7B models:

Method Res #Tokens MMBench SEEDBench TextVQA DocVQA POPE
LLaVA-1.5 (7B) 336 576 64.3 66.2 45.5 21.6 85.9
ShareGPT4V (7B) 336 576 68.8 69.7 51.1 26.6 86.0
MiniGemini-HD 1536 2880 65.8
ConvLLaVA (7B) 1536 576 68.7 70.2 65.8 59.0 87.3

ConvLLaVA at $1536$ px matches or exceeds models of similar or greater scale on MMBench and SEEDBench with substantially fewer tokens and demonstrates dramatic improvements on OCR tasks (e.g., DocVQA: $59.0$ vs $26.6$ for $336$ px models). On referring comprehension tasks, ConvLLaVA outperforms LLaVA-1.5 (Table 6), achieving an average 82.3%82.3\% versus 79.3%79.3\% (Ge et al., 2024).

5. Implications and Technical Significance

ConvLLaVA demonstrates that hierarchical, convolutional visual encoders—when appropriately extended and retrained—can obviate the quadratic bottlenecks of ViT at high resolution. By maintaining only a few hundred tokens even for megapixel-scale images, ConvLLaVA achieves 6\sim68×8\times compute savings compared to leading high-resolution LMMs, while retaining or improving performance on both general and fine-grained tasks.

Key architectural properties include linear spatial complexity (O(N)\mathcal{O}(N)), token compression via staged downsampling, and equivariance permitting arbitrary aspect ratio inference without retraining.

6. Limitations and Prospects for Advancement

The ConvNeXt backbone’s configuration—kernel sizes, stage depths, and pretraining regimen—remains optimized for low-resolution settings. Adapting backbones specifically for high-resolution input, for example by enlarging kernel sizes or rebalancing stage depths, potentially yields better representations. The trade-off between compressing information and preserving small-object details persists: excessive downsampling risks information loss; retaining excessive tokens burdens subsequent LLM processing. Adaptive or content-aware compression schemes represent a promising direction.

The architecture’s principles are broadly applicable to multi-image, video, or interleaved vision/text settings, although explosive token growth in these domains will require further innovations in visual token adaptation and complexity control (Ge et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ConvLLaVA.