Jina-Embeddings-V4: Unified Multimodal Embedding Model
Jina-Embeddings-V4 is a multimodal, multilingual embedding model with a 3.8 billion parameter architecture, designed to unify the representation of text and images for efficient and accurate retrieval across diverse, real-world scenarios. The model introduces a series of architectural advances—namely a unified vision-language transformer backbone, support for both dense and late interaction (“multi-vector”) embedding modes, and specialized Low-Rank Adaptation (LoRA) adapters for task-specific optimization. Jina-Embeddings-V4 has demonstrated state-of-the-art performance across single-modal and cross-modal retrieval benchmarks, with particular strength in processing visually rich and heterogeneous documents. Its development marks a significant progression in the field of universal embedding models.
1. Unified Multimodal Architecture
Jina-Embeddings-V4 departs from the dual-encoder paradigm of CLIP-style models by employing a single transformer-based encoder built atop the Qwen2.5-VL vision-language backbone. This architecture accepts both tokenized text and images, the latter converted to “image tokens” via a dedicated image encoder, and concatenates them into a unified token stream. All input modalities are thereby embedded into the same latent space, enabling seamless mixed-media representation. The model supports both single-vector and multi-vector output heads, conferring flexibility in downstream retrieval and ranking.
- Single-vector head: Mean pooling the final hidden states across all tokens yields a truncatable 2048-dimensional embedding, following Matryoshka Representation Learning.
- Multi-vector (Late Interaction) head: Outputs a sequence of 128-dimensional vectors, one per token or patch, supporting high-granularity matching especially for long or visually complex documents.
2. Embedding Strategies: Dense and Late Interaction
Jina-Embeddings-V4 jointly optimizes for two distinct embedding usage modes:
- Single-vector/Dense embeddings:
- Compact representation (2048d default, truncatable to 128d via Matryoshka training).
- Used for fast, memory-efficient retrieval tasks—whole-document or whole-image similarity via cosine similarity or dot product.
- Multi-vector/Late interaction embeddings:
- Outputs a vector per token (typically 128d).
- Employs a late-interaction similarity function:
where , are token-level embeddings from the query and document/image, respectively. - The late interaction score is often normalized by the query length:
where is the query token count. - Provides superior retrieval accuracy, especially for complex, structured, or heterogeneous content by preserving local structure.
The model’s joint training for both modes ensures no regression in dense retrieval performance while enabling late interaction for maximal accuracy where computational resources permit.
3. Task-Specific Adaptation via LoRA Adapters
Jina-Embeddings-V4 incorporates multiple task-specific LoRA (Low-Rank Adaptation) adapters, each comprising only ~60M parameters (<2% overhead) and loaded dynamically at inference. These adapters specialize the backbone for:
- Asymmetric query-document retrieval: Optimized for classic search scenarios with short queries and long or structured targets. Prefix tokens condition the encoder for either query or document mode.
- Semantic similarity / Symmetric retrieval: Tuned for tasks like sentence similarity, duplicate detection, or paraphrase retrieval, typically using the CoSENT loss.
- Code retrieval: Specializes the encoder for code search (text-to-code and code-to-code), incorporating code-specific corpora and retrieval objectives.
This modular adaptation allows Jina-Embeddings-V4 to balance broad multilingual, multimodal generalization with fine-tuned performance in specialized retrieval scenarios, and for efficient domain adaptation in production.
4. Benchmark Performance and Empirical Evaluation
Jina-Embeddings-V4 establishes new state-of-the-art results across an array of benchmarks:
- Text retrieval (MTEB/MMTEB): Achieves or approaches top performance across both English and multilingual datasets, outperforming established baselines including BGE-M3, e5, Voyage, Cohere, and OpenAI text-embedding-3-large.
- Long document retrieval: Excels in LongEmbed evaluations, outperforming most generic models except highly specialized alternatives, with strong robustness on extended-context retrieval tasks.
- Textual semantic similarity: Places at or near the top of English STS leaderboards in terms of Spearman correlation.
- Cross-modal retrieval: Demonstrates superior performance on CLIP-style (text-to-image), XTD10 (multilingual), and ViDoRe retrieval tasks, especially in late interaction mode.
- Mixed-media and visually-rich document retrieval: Notably outperforms previous methods on Jina-VDR and ViDoRe, benchmarks focused on tables, charts, scanned documents, and complex multimodal data, recording nDCG@5/10 scores that exceed all reported alternatives.
- Code retrieval: Performs competitively on the MTEB-CoIR benchmark for code-centric search scenarios.
The model’s joint single/multi-vector optimization and LoRA adaptation deliver leading accuracy across modalities, languages, and formats.
5. Handling of Visually Rich Content and Mixed Media
Through its unified encoder and “image token” mechanism, Jina-Embeddings-V4 effectively processes hybrid documents—those containing arbitrary mixtures of text, images, tables, diagrams, and charts. Documents are represented holistically rather than as discrete segments, an approach well-suited to enterprise, scientific, and real-world retrieval applications where high-fidelity structural understanding is essential.
Late interaction embeddings enable fine-grained matching of visual or tabular regions against queries, overcoming the typical limitations of dense pooling. The model is explicitly evaluated using hand-annotated and synthetic benchmarks containing legal scan PDFs, financial tables, scientific plots, maps, and multi-modal instructions—showing significant performance gains.
6. Jina-VDR Benchmark: Evaluation for Visually-Rich Document Retrieval
To facilitate comprehensive evaluation, Jina-Embeddings-V4 introduces the Jina-VDR benchmark. This suite extends beyond previous standards (e.g., ViDoRe) to include over 30 tasks involving hand-annotated and synthetic visually complex documents in 20+ languages. The benchmark incorporates diverse retrieval objectives—questions, keyphrases, descriptions, facts, instruction-style queries—mirroring real-world mixed-media search.
Jina-VDR includes fine-grained per-language and per-format metrics, supporting detailed comparative analysis of model performance. Jina-Embeddings-V4 sets new records on this benchmark, especially in late interaction mode, validating its design for practical information retrieval over challenging document types.
Summary Table: Core Aspects of Jina-Embeddings-V4
Aspect | Details |
---|---|
Architecture | Unified multimodal transformer (Qwen2.5-VL base); supports text, images, mixed-media |
Embedding Modes | Single-vector (2048d, truncatable) & multi-vector/late interaction (128d/token) |
Adaptation | Task-specific LoRA adapters: asymmetric retrieval, semantic similarity, code |
Retrieval Performance | SOTA or near-SOTA on MTEB, MMTEB, LongEmbed, ViDoRe, Jina-VDR benchmarks |
Mixed-media Strength | Best-in-class on tables, charts, diagrams, legal scans, complex visual and text mixtures |
Benchmarks Introduced | Jina-VDR (visually-rich, multilingual, multi-format document retrieval) |
Application Scope | Multilingual and cross-modal search, enterprise and scientific retrieval, code search, RAG |
Jina-Embeddings-V4 advances the state of the art in universal, scalable embedding models through unified architecture, dual embedding modes, specialized adaptation, and comprehensive empirical validation. Its strengths in processing complex, mixed-media, and visually rich content, together with robust performance across traditional and novel benchmarks, emphasize its suitability for demanding modern retrieval applications.