Papers
Topics
Authors
Recent
2000 character limit reached

Jina-VLM: Multilingual VQA Model

Updated 8 December 2025
  • Jina-VLM is a 2.4B-parameter decoder-only vision-language model that delivers state-of-the-art multilingual visual question answering using an integrated architecture.
  • It combines a 27-layer SigLIP2 vision transformer and a 32-layer Qwen3 language decoder via an innovative attention-pooling connector for efficient token fusion.
  • Optimized for both accuracy and efficiency, the model leverages tile pooling and a robust two-stage training protocol to balance high-fidelity image processing with text reasoning.

Jina-VLM is a 2.4B-parameter decoder-only vision-LLM designed to deliver state-of-the-art performance on multilingual visual question answering (VQA) tasks within the open sub-2B VLM space. The model integrates a SigLIP2 vision encoder and a Qwen3 language backbone through an attention-pooling connector, enabling efficient, high-fidelity, and multilingual image-text reasoning. Jina-VLM achieves leading accuracy across numerous VQA and multimodal benchmarks while maintaining competitive text-only LLM abilities (Koukounas et al., 3 Dec 2025).

1. Model Architecture

Jina-VLM comprises three principal modules: a 400M-parameter SigLIP2-So400M/14-384 Vision Transformer (ViT), a 1.7B-parameter Qwen3-1.7B-Base decoder-only LLM, and an approximately 300M-parameter attention-pooling connector. The full stack reaches ∼2.4B parameters.

Vision Encoder

The SigLIP2-So400M/14-384 module is a 27-layer ViT with hidden size dv=1024d_v=1024, MLP dimension 4096, and patch size 14 operating on 378×378 images. It is pretrained via contrastive, multilingual image-text supervision and fine-tuned end-to-end during Jina-VLM training.

Language Decoder

The Qwen3-1.7B-Base backbone is a 32-layer transformer with dâ„“=2048d_\ell=2048, 32 attention heads, and an MLP dimension of 8192. All 1.7B parameters are trainable. Visual tokens are delimited with special <im_start>, <im_col>, and <im_end> tokens to support arbitrary-length, tokenized image input streams.

Attention-Pooling Connector

Instead of utilizing only the top ViT layer, the connector concatenates two intermediate patch embedding layers (–3 and –9 from the output, i.e., layers 24 and 18) to capture both high- and low-level perceptual signals:

Hconcat=[H(−3);H(−9)]H_{\text{concat}} = [H^{(-3)}; H^{(-9)}]

A tile-pooling mechanism over 2×2 patch neighborhoods reduces visual tokens by a factor of four, crucial for fitting the long sequence of vision tokens. Multi-head attention pooling is then employed:

Hpooled=softmax(QWQ(HconcatWK)⊤/dk)⋅(HconcatWV)⋅WO.H_{\text{pooled}} = \mathrm{softmax}\left( QW_Q (H_{\text{concat}} W_K)^\top / \sqrt{d_k} \right) \cdot (H_{\text{concat}} W_V) \cdot W_O.

Finally, SwiGLU projections map these tokens into the LLM’s input space for token-level fusion.

Image Preprocessing

During inference, each input image is subdivided into up to 12 overlapping 378×378 tiles plus a thumbnail. Each tile yields 27×27=729 patches, resulting in up to 13×729≈9500 visual tokens per image. The attention-pooling connector reduces this to ≈2375 visual tokens, which are combined with up to 512 text tokens per instance.

Pseudocode: Tiling + Thumbnail (Excerpt)

1
2
3
4
5
6
7
8
9
GetAllTilesOverlapAndResize(I, b=(378,378), p=14, M=12, margins=(4,4)):
  m_tot = p*(mL+mR)
  s_win = (floor(b_h/p)-(mL+mR))*p
  (t_h,t_w) = SelectTiling(..., M)
  H'=t_h*s_win + m_tot; W'=t_w*s_win + m_tot
  I_grid=Resize(I,[H',W'])
  G = ExtractTiles(I_grid,(t_h,t_w),s_win,b_h)
  T = Resize(I,[b_h,b_w])  // thumbnail
  return [T]+G, (t_h,t_w)

2. Training Protocol

Jina-VLM employs a two-stage, end-to-end training paradigm with all weights unfrozen throughout.

Stage 1: Alignment Pre-training

  • Dataset: 3.2M multimodal samples from PixMoCap, PangeaIns, etc. (30+ languages, highly diverse), plus a 15% fraction of text-only samples from PleiAS/common_corpus.
  • Objectives: Combined contrastive and language modeling over captions.
  • Training regime: 25K steps, batch size 128 images, totaling ∼10B text tokens.
  • Learning rates (AdamW): vision encoder (6e–6), connector (2e–4), LLM (2e–5) with 10%/1%/10% warmup.

Stage 2: Instruction Fine-tuning

  • Dataset: 15.3M multimodal instructions (LLaVA-OneVision, Cauldron, Cambrian, etc.), plus additional text-only instances (Aya dataset).
  • Format: Multilingual instruction-answer pairs.
  • Training regime: 60K steps, batch size 256; ∼37B tokens.
  • Learning rates: all modules (5e–6, except LLM at 1e–5), each with 10% warmup.

Data Preprocessing

Images are tilized and resized without additional augmentation. Text is tokenized with Moses-style pre-tokenization and Qwen3 vocabulary, with balancing for ∼50% English, ∼50% other languages (30 total).

3. Evaluation and Benchmarks

Jina-VLM is evaluated across a range of standard and multilingual VQA, reasoning, and text-only benchmarks.

Task/domain Best performer Notable results (Jina-VLM) Peer comparison
General VQA Jina-VLM 72.3% avg; AI2D 82.0%; ChartQA 81.9% Qwen2-VL-2B: 66.4%
Real-World Multimodal Jina-VLM RealWorldQA 68.2%; MMBench 67.4%
Multi-Image Reasoning Jina-VLM BLINK/Muir/MNT 47.3%
Hallucination (POPE) Jina-VLM 90.3%
Math/Logic Reasoning – MMMU/MathVista 33.1% InternVL3-2B: 35.3
Pure text (MMLU etc.) Qwen3 58.9% avg Qwen3: 63.3%
Multilingual VQA Jina-VLM MMMB 78.8%; Mul-MMBench 74.3% Qwen3-VL-2B: 75.0/72.3

Jina-VLM achieves state-of-the-art results among open 2B-scale models for general and multilingual VQA, and consistently outperforms or matches comparable models on most multimodal and real-world understanding tasks. The model's hallucination rate is notably low on POPE (90.3%, best among peers), indicating improved factual reliability. In mathematical/logical reasoning, Jina-VLM scores comparably to InternVL3-2B (33.1% vs. 35.3%) and surpasses Qwen2-VL-2B (25.3%). Text-only performance is slightly degraded relative to the Qwen3 backbone (–4.4 points avg), but remains robust: MMLU (56.1), GSM-8K (71.3), ARC-C (77.3), HellaSwag (59.4).

4. Design Analysis and Ablation Studies

Direct ablations are limited, but key engineering analyses indicate:

  • The token-pooling connector reduces patch tokens by 4× with minimal (<1%) accuracy loss, critically enabling ∼2,400-token vision input within feasible context window limits.
  • Concatenating two ViT layers (–3 and –9) for attention pooling yields a ∼1-point VQA accuracy gain over single-layer-only alternatives.
  • Mixing 15% text-only data in pre-training prevents a ∼5-point drop in subsequent text-only evaluation scores.
  • Qualitative analysis highlights the model's strengths in cross-lingual OCR, fine-grained font detection, and diagram labeling, with current limitations in multi-image coherence (due to data scarcity in fine-tuning) and some failures in numerical reasoning.

5. Inference Efficiency and Deployment Scenarios

Jina-VLM prioritizes both token and compute efficiency. A typical image (12-tile + thumbnail) yields ~2,400 visual tokens, which can be combined with up to 512 text tokens, yielding sequence lengths well within 3K tokens.

  • An inference pass (image + query) completes at ≈4 examples/s with 512-token autoregressive text generation on an NVIDIA A100-40GB.
  • The attention-pooling connector achieves a ~75% reduction in FLOPs compared to naive patch-level fusion.
  • Memory usage scales linearly with the number of tiles; typical settings require ~20GB for 1 image + 512 text tokens per batch.
  • Supported applications include multilingual VQA agents, document understanding workflows, and low-resource research deployments, with quantized runs on 8×A10G (16GB per GPU).

6. Context within the Multimodal Model Landscape

Jina-VLM shares research lineage with prior CLIP-based and multilingual vision-LLMs, but contrasts in architecture (decoder-only with integrated vision and language processing) and performance priorities. Notably, the inclusion of a multilingual, contrastively pretrained vision encoder (SigLIP2), dense language grounding (Qwen3), and innovative attention pooling with tile thumbnail techniques enable superior multilingual and multimodal scalability (Koukounas et al., 3 Dec 2025).

Its design choices, particularly the attention pooling and multi-layer fusion, directly address scaling and efficiency constraints common to large-scale VLMs, establishing Jina-VLM as a state-of-the-art solution for multilingual visual-linguistic reasoning at modest computational cost.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Jina-VLM.