LLM2Vec: Converting LLMs into Encoders
- LLM2Vec is a framework that transforms autoregressive LLMs into high-quality text encoders using modified attention masks and lightweight adapter tuning.
- It employs a multi-stage protocol combining decontextualized token pooling, bidirectional attention, and unsupervised contrastive learning to generate robust embeddings.
- The approach enables superior performance across diverse tasks, ranging from word analogies and multilingual retrieval to EHR data encoding and multimodal applications.
LLM2Vec is a principled framework and family of methods for transforming large, typically decoder-only, LLMs into high-quality text encoders suited for dense semantic embeddings. The term encompasses both a generic, model-agnostic recipe for extracting decontextualized or contextualized representations directly from frozen LLMs, as well as a concrete multi-stage adaptation protocol that enables strong performance on a wide spectrum of word-level, sequence-level, multilingual, multimodal, and even non-linguistic tasks. It is notable for bridging the gap between unidirectional language generation models and bidirectional, information-rich encoders, often matching or exceeding specialized models on a variety of structured and unstructured prediction benchmarks (BehnamGhader et al., 2024, Hegselmann et al., 24 Feb 2025, Mahajan et al., 26 Feb 2025, Ehrmanntraut et al., 19 May 2025, Musacchio et al., 12 Mar 2025, Shamrai et al., 8 Aug 2025).
1. Motivation and Theoretical Foundations
LLMs such as GPT, LLaMA, and Mistral are typically pre-trained with causal attention (strict left-to-right, autoregressive factorization), which impedes their capacity to construct full-sentence or whole-sequence embeddings, a crucial requirement for downstream tasks like retrieval, clustering, and classification (BehnamGhader et al., 2024). Nevertheless, these models are pre-trained on massive corpora, yield powerful hidden states, and already encapsulate considerable semantic knowledge, which remains underutilized outside generation.
LLM2Vec aims to efficiently expose these latent embedding capabilities. Unlike prior approaches reliant on encoder-only transformers or heavy supervised contrastive tuning, LLM2Vec—both as a minimal extraction procedure and as a multi-stage adapter-based protocol—offers a direct, computationally efficient route to universal dense encodings. It is broadly applicable across languages, modalities, and domains. The method reveals that full-context embedding power is latent in modern decoder LLMs and can be unlocked with modest architectural and objective modifications (BehnamGhader et al., 2024).
2. LLM2Vec Procedures and Key Variants
Minimal LLM2Vec Extraction (Word/Token Embeddings)
For any decoder-only LLM with layers and hidden size , decontextualized word embeddings are extracted as follows (Mahajan et al., 26 Feb 2025):
- Tokenization: Tokenize each word into subwords: .
- Forward Pass: Compute final-layer hidden states for all subwords.
- Pooling: Construct (mean-pooling).
- Optional Projection: Optionally post-process using learned or output embedding matrices.
This yields high-dimensional, static word embeddings that outperform classical methods (Word2Vec, GloVe) in analogy structure and semantic clustering, particularly with models like PaLM and ADA (Mahajan et al., 26 Feb 2025).
Sequence-Level LLM2Vec (Full Recipe)
To enable bidirectional, information-dense sentence and document embeddings, the three-step LLM2Vec protocol is applied (BehnamGhader et al., 2024, Ehrmanntraut et al., 19 May 2025):
- Enable Bidirectional Attention: Replace causal attention masks with all-ones , allowing every token to attend globally.
- Masked Next-Token Prediction (MNTP): Train with an MNTP objective, combining elements of masked-language modeling and unidirectional prediction:
where is a set of masked positions, using small LoRA adapters while freezing the base model.
- Unsupervised Contrastive Learning (SimCSE): For sequence-level tasks, apply SimCSE (BehnamGhader et al., 2024):
where is a dropout-perturbed view of ; other batch sentences serve as negatives.
Learnable adapters are injected into each block (bottleneck shape), but all core model parameters remain frozen, preserving efficiency and scalability.
Specialized LLM2Vec Adaptation
- Domain and Language Adaptation: For specialized settings, e.g., German text (LLäMmlein2Vec/ModernGBERT), small adapters are fine-tuned using MNTP loss to match the masked token objectives of domain-specific encoders, extended for long contexts and with practical batching/loss schedules (Ehrmanntraut et al., 19 May 2025).
- Multilingual/Multimodal Embedding: Self-knowledge distillation (xVLM2Vec) aligns non-English or multimodal representations to a shared space via MSE loss between English teacher and non-English student embeddings (Musacchio et al., 12 Mar 2025).
- Structural Data Serialization: For non-linguistic data (e.g., EHR), structured inputs are mapped into Markdown-formatted text, with code-to-description substitution, then embedded using a prompt-plus-data concatenation, with pooling applied over non-prompt tokens (Hegselmann et al., 24 Feb 2025).
Embedding Extraction (General)
Given token representations :
- Skip initial prompt (if present).
- Pool over selected tokens (where indexes relevant content):
with downstream L2 normalization as .
3. Quantitative Performance and Empirical Results
LLM2Vec achieves or surpasses specialized baselines across diverse tasks:
- Textual Benchmarks: On the MTEB suite (56 tasks), LLM2Vec-adapted models achieve new unsupervised state-of-the-art (e.g., Mistral-7B: 56.8 avg, LLaMA-2-7B: 55.4, exceeding SimCSE/BERT baselines) (BehnamGhader et al., 2024).
- Word Analogy and Clustering: GPT-ADA and PaLM LLM2Vec embeddings obtain top-1 accuracy of approximately 40% (3CosAdd) on BATS analogies, surpassing classical Word2Vec and GloVe (32–35%) (Mahajan et al., 26 Feb 2025).
- Structured Data (EHR): On 15 EHRSHOT clinical tasks, LLM2Vec (Llama-3.1-8B-Instruct) approaches or matches specialized EHR encoders (macro-AUROC: 0.742 vs. 0.769 for CLMBR-T-Base), and even outperforms in out-of-distribution generalization (Hegselmann et al., 24 Feb 2025).
| Model | MTEB Avg (Unsupervised) | BATS Analogy (3CosAdd) | EHRSHOT Macro-AUROC |
|---|---|---|---|
| LLM2Vec Mistral-7B | 56.8 | ~40% | – |
| LLM2Vec Llama-3.1-8B (EHR) | – | – | 0.742 |
| SimCSE BERT+SimCSE | 45.5 | 24% | – |
| Word2Vec/GloVe | – | 32–35% | – |
| CLMBR-T-Base (EHR, 141M params) | – | – | 0.769 |
LLM2Vec-adapted encoders in German (LLäMmlein2Vec) approach or nearly match scratch-trained ModernGBERT on text embedding, but the latter wins on parameter efficiency and inference speed (Ehrmanntraut et al., 19 May 2025). For multimodal multilingual settings, xVLM2Vec delivers Precision@1 gains from 13.6% to 45–57.6% on image-to-text and related tasks (Musacchio et al., 12 Mar 2025).
4. Domain and Data Modality Adaptation
LLM2Vec extends beyond unstructured text encoding:
- EHR and Structured Clinical Data: LLM2Vec serializes EHR data as enriched Markdown, with ordered inclusion of demographics, key time series (labs/vitals), visit summaries, and detailed event histories. Medical codes are mapped to natural language descriptions, leveraging LLM's pretraining on diverse general and clinical corpora (Hegselmann et al., 24 Feb 2025). Performance is robust to code mapping variations and generalizes across patient populations.
- Multilingual and Multimodal Input: xVLM2Vec adapts an English-anchored visual-LLM into a competitive multilingual/multimodal encoder using synthetic parallel corpora and MSE-based self-distillation, enabling multitask retrieval, VQA, and grounding across five languages, all while using a single linear embedding space (Musacchio et al., 12 Mar 2025).
- Language Geometry: LLM2Vec has been utilized to construct metric spaces of entire languages by pruning-based weight-importance fingerprinting, clustering 106 languages and revealing both genetic and areal/typological proximity (Shamrai et al., 8 Aug 2025).
5. Architectural and Training Approaches
- Attention Mask Replacement: Direct swap of the causal (lower-triangular) attention mask with a full (all-zero) mask enables bidirectional context flow (Ehrmanntraut et al., 19 May 2025, BehnamGhader et al., 2024).
- Adapters: Lightweight LoRA or bottleneck adapters are injected per transformer block and trained with MNTP/contrastive objectives while freezing core parameters, affording parameter efficiency (BehnamGhader et al., 2024, Ehrmanntraut et al., 19 May 2025).
- Batching and Hardware: Default settings on long-context data involve sequence lengths up to 8k tokens, with batch sizes, optimizers, and learning rates specialized per model size and adapter variant (Ehrmanntraut et al., 19 May 2025).
- Structured Data Serialization and Prompting: For structured data (EHR), the prompt+serialization paradigm encodes domain priors into the initial text before input, with careful ordering to control information salience (Hegselmann et al., 24 Feb 2025).
6. Practical Recommendations and Limitations
LLM2Vec deployment in new domains requires pragmatic mapping:
- Pilot prompt/serialization strategies, include high-predictive codes early, and aggregate long time series to manage context length (Hegselmann et al., 24 Feb 2025).
- Empirically, mean-pooling final token representations is optimal for both word- and sequence-level tasks (BehnamGhader et al., 2024).
- For contexts exceeding maximum token windows (e.g., > 4096), chunking plus mean-pooling is recommended (Hegselmann et al., 24 Feb 2025).
- Performance is sensitive to prompt design and may degrade if older but clinically relevant information is truncated (Hegselmann et al., 24 Feb 2025).
- For calibrated probabilistic outputs, LLM2Vec embeddings can be combined with lightweight, domain-adapted heads or models (Hegselmann et al., 24 Feb 2025).
- LLM2Vec does not require full-model fine-tuning or access to synthetic data and is often computationally efficient (thousands of steps, small adapter updates).
When resource constraints or strict inference efficiency are paramount, compact encoder-only models (e.g., ModernGBERT) may still outperform LLM2Vec-converted decoders, especially in language-specific deployments (Ehrmanntraut et al., 19 May 2025).
7. Compression and Retrieval: Binary LLM2Vec Embeddings
High-dimensional LLM2Vec embeddings can be compressed for efficient retrieval. Isolation Kernel Embedding (IKE) provides a learning-free, binary coding mechanism over LLM2Vec outputs (Zhang et al., 14 Jan 2026):
- IKE ensembles random
iTree-based partitions, each mapping an LLM2Vec embedding to a unique cell. Concatenation of per-tree indices yields a compact binary code. - IKE achieves up to 16.7× faster search and 8–16× reduction in memory over dense LLM2Vec representations, retaining ≥98–101% of retrieval accuracy on MTEB (Zhang et al., 14 Jan 2026).
- The approach is robust to code-length trade-offs (ensemble size, granularity ψ), with minimal accuracy loss unless extremely compressed (t ≪ d).
Supported by practical recommendations for hyperparameters, data structure, and compatibility with ANN indices, IKE augments the scalability of LLM2Vec in retrieval and semantic search contexts.
LLM2Vec establishes a general-purpose, modular methodology for converting backbone autoregressive LLMs to universal sequence and word encoders, enabling high-quality embeddings across text, structured data, and multimodal scenarios, with systematic empirical validation and tunable efficiency–quality tradeoffs (BehnamGhader et al., 2024, Hegselmann et al., 24 Feb 2025, Ehrmanntraut et al., 19 May 2025, Musacchio et al., 12 Mar 2025, Mahajan et al., 26 Feb 2025, Shamrai et al., 8 Aug 2025, Zhang et al., 14 Jan 2026).