Papers
Topics
Authors
Recent
2000 character limit reached

Sentence-T5: Dual-Encoder Sentence Embeddings

Updated 12 November 2025
  • Sentence-T5 is a family of large-scale sentence encoder models that repurpose the T5 architecture into dual-encoder setups for extracting fixed-dimensional, semantically meaningful embeddings.
  • The model employs strategies like encoder-only (first-token and mean-pooling) and encoder-decoder (first-decoder-token) approaches, with learnable projection and normalization to optimize cosine similarity.
  • Empirical evaluations on benchmarks such as SentEval and SentGLUE, including multilingual extensions, demonstrate state-of-the-art performance and robust transferability even in low-resource language scenarios.

Sentence-T5 (ST5) denotes a family of large-scale sentence encoder models that adapt pre-trained text-to-text transformer architectures—specifically T5—into dual-encoder setups capable of producing semantically meaningful fixed-dimensional sentence embeddings. ST5 achieves state-of-the-art performance on various semantic textual similarity (STS) and sentence representation transfer benchmarks, including SentEval and SentGLUE. The approach generalizes effectively to multilingual domains, as demonstrated by the Multilingual Sentence-T5 (m-ST5), where scaling up parameter counts and employing parameter-efficient fine-tuning yield further gains, especially for low-resource and typologically distant languages. Models, code, and pretrained checkpoints are available at both TensorFlow Hub and HuggingFace repositories.

1. Sentence Embedding Extraction Architectures

ST5 transforms a pre-trained T5 encoder–decoder into a dual‐encoder sentence‐embedding architecture, supporting three principal strategies for extracting a vector representation eRde \in \mathbb{R}^d from an input sentence xx:

  1. Encoder-Only, First-Token (“ST5-Enc_first”) Pass xx through the T5 encoder, yielding token-wise hidden states Henc=[h1enc,,hLenc]RL×EH^{\mathrm{enc}} = [h_1^{\mathrm{enc}}, \ldots, h_L^{\mathrm{enc}}] \in \mathbb{R}^{L \times E}. The representation is eraw=h1ence_\mathrm{raw} = h_1^{\mathrm{enc}}, which can be projected and L2-normalized: e=Norm(Weraw+b),e = \mathrm{Norm}(W e_\mathrm{raw} + b), where WRd×EW \in \mathbb{R}^{d \times E} and bRdb \in \mathbb{R}^d.
  2. Encoder-Only, Mean-Pooling (“ST5-Enc_mean”) Compute the mean across token-wise encoder outputs: eraw=1Li=1Lhienc,e_\mathrm{raw} = \frac{1}{L} \sum_{i=1}^L h_i^{\mathrm{enc}}, followed by the same projection and normalization.
  3. Encoder–Decoder, First-Decoder-Token (“ST5-EncDec_first”) Pass input through the encoder as above; in the decoder, provide only the start token (“<<s>>”) and use its output before the softmax, h0dech_0^{\mathrm{dec}}, as the sentence vector: eraw=h0dec.e_\mathrm{raw} = h_0^{\mathrm{dec}}. Project and normalize as above.

In practice, the projection dimension dd is a configurable hyperparameter (e.g., 768). For “raw” evaluation (no fine-tuning), the model sometimes omits W,bW, b and L2 normalization. However, once fine-tuned, all practical deployments apply a learnable W,bW, b and L2 normalization to enforce cosine similarity: sim(e1,e2)=e1e2\mathrm{sim}(e_1, e_2) = e_1 \cdot e_2.

2. Training Protocols: Pre-training and Contrastive Fine-tuning

Pre-training

ST5 inherits the T5 training regime, which is based on unsupervised span-corruption. Random spans of text are replaced with sentinel tokens, and the model is trained to reconstruct the masked segments in a sequence-to-sequence setup.

Dual-Encoder Contrastive Fine-tuning

ST5 is then fine-tuned with two-stage dual-encoder contrastive learning:

  • Stage 1: Fine-tune with ~2B question–answer pairs from community QA.
  • Stage 2: Further fine-tune on 275K Natural Language Inference (NLI) entailment pairs.

For a batch of BB paired examples, with in-batch negatives, the contrastive loss is:

L=i=1Blogexp(sim(ei,ei+)/τ)j=1Bexp(sim(ei,ej+)/τ)\mathcal{L} = -\sum_{i=1}^B \log \frac{ \exp(\mathrm{sim}(e_i, e_i^+)/\tau) }{ \sum_{j=1}^B \exp(\mathrm{sim}(e_i, e_j^+)/\tau) }

with τ=0.01\tau=0.01. When hard negatives vjv_j^- are available, they augment the denominator:

Li=logexp(sim(ei,ei+)/τ)j[exp(sim(ei,ej+)/τ)+exp(sim(ei,ej)/τ)]\mathcal{L}_i = -\log \frac{ \exp(\mathrm{sim}(e_i, e_i^+)/\tau) }{ \sum_j [\exp(\mathrm{sim}(e_i, e_j^+)/\tau) + \exp(\mathrm{sim}(e_i, e_j^-)/\tau)] }

Optimization uses Adafactor with a linear learning rate decay from 1e-3 (10% warm-up, 90% decay). Large batches are used (2048 for QA, 512 for NLI); input sequence lengths extend up to 128 tokens during inference.

3. Scaling Laws and Model Variants

ST5 inherits from T5 checkpoints at multiple scales:

Variant Encoder Params Enc–Dec Params
Base ≈110M ≈220M
Large 335M 770M
3B 1.24B 3B
11B 4.8B 11B

Empirically, transfer task performance (SentEval 7-task average accuracy) improves monotonically with increasing model size for ST5-Enc_mean: Base → Large → 3B → 11B scores: 89.0 → 90.35 → 91.15 → 91.63 (after full QA+NLI fine-tuning).

Raw (non-finetuned) models achieve strong transfer but relatively poor STS results (55–60 Spearman ρ). After fine-tuning, 11B Enc_mean achieves transfer 91.63 and STS 84.96 (vs. SimCSE-RoBERTa_Large 90.23/83.76). The decoder variant (EncDec_first) slightly lags on transfer but leads on STS (state-of-the-art 84.94 at 11B).

4. Benchmarks and Empirical Evaluation

SentEval and SentGLUE

  • SentEval: Evaluates transfer (MR, CR, SUBJ, MPQA, SST-2, TREC, MRPC) and STS (STS12–16, STS-B, SICK-R) via logistic regression (transfer) and Spearman ρ (STS).
  • SentGLUE: Extends SentEval to nine GLUE tasks, including CoLA, SST-2, MRPC, STS-B, QQP, MNLI-m/mm, QNLI, RTE. Models produce embeddings without cross-attention; linear classifiers are trained on top.
  • At 11B, after QA+NLI fine-tuning, ST5-Enc achieves transfer 91.63, STS 84.96 on SentEval, and GLUE-avg ≈ 80.07 on SentGLUE dev.

Statistical Analysis

Consistent gains (+1.4 to 2.2 pts) over strong baselines, including SimCSE-RoBERTa_Large and SBERT_RoBERTa_Large, are observed across multiple runs and model sizes, though formal confidence intervals are not provided.

Embedding Space Properties

Uniformity (Luniform\mathcal{L}_\mathrm{uniform}) and alignment (Lalign\mathcal{L}_\mathrm{align} as per Wang & Isola 2020) both improve as model size increases, correlating with observed STS gains.

5. Multilingual Extension: m-ST5

Multilingual Sentence-T5 (m-ST5) utilizes the encoder of mT5-xxl (≈5.7B params), relying on SentencePiece tokenization with no modification. Sentence embeddings are obtained via average pooling of encoder token outputs. LoRA (Low-Rank Adapters, r=8r=8, α=32\alpha=32) enables parameter-efficient fine-tuning, either on the query and value matrices (q+v) or all linear layers (all-lin).

Contrastive NLI-based fine-tuning uses:

  • XNLI: 15-language crowd-translated MultiNLI (1.96M triplets).
  • en-NLI: SNLI + MNLI (276K triplets).
  • Language-specific corpora: JSNLI (Japanese), KorNLI/STC, CMNLI, Chinese STS.

Standard InfoNCE loss with in-batch negatives is deployed; temperature τ\tau is not explicitly specified, plausibly defaulting to values from SimCSE.

Empirical highlights:

  • Cross-lingual retrieval (Tatoeba-36): m-ST5 (all-lin, XNLI) achieves up to 99.6% language-specific retrieval, with a much narrower spread across languages than prior models.
  • Monolingual STS (Spearman ρ): m-ST5 attains 84.1 (ja), 81.1 (ko), 79.6 (zh).
  • Scaling law: Performance rises monotonically with log(model size), with largest improvements for low-resource and non-English-like languages.

6. Practical Guidance and Recommendations

Task and Architecture Selection:

  • For transfer/ranking/classification, ST5-Enc_mean with QA+NLI fine-tuning is recommended.
  • For pure semantic similarity (STS, ranking), ST5-EncDec_first slightly outperforms.

Throughput:

Model/Hardware Throughput
11B ST5-Enc, TPU-v8 ~274 sentences/sec
11B ST5-Enc, 4×V100 ~27 sentences/sec
11B ST5-Enc, CPU ~0.5 sentences/sec
Large (335M) 2–3× 11B throughput

Production Considerations:

  • For large batch or production-scale encoding, ST5-Enc_mean Large (335M) offers sub-second encoding with only 0.5–1 STS/transfer point drop relative to 11B.
  • Always apply the two-stage QA+NLI contrastive fine-tuning to maximize performance.
  • Use the largest possible model size for applications emphasizing low-resource, typologically diverse, or non-English languages, as performance improvements are accentuated here.

7. Extensions and Future Directions

ST5 establishes that text-to-text transformers can be effectively repurposed as sentence encoders through pooling, projection, and contrastive dual-encoder training. The methodology scales robustly in both parameter space and multilinguality, with m-ST5 representing a straightforward extension wherein LoRA adapters serve as the adaptation mechanism. Scaling effects persist, and the available evidence suggests future increases in model capacity and more diverse NLI datasets will continue to benefit semantic alignment, particularly for underrepresented languages. The strong empirical results across SentEval, GLUE, and multilingual benchmarks position ST5 and its extensions as reference architectures for large-scale sentence embedding.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Sentence-T5 Model.