Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 168 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 122 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 464 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Poly-encoders: Efficient Multi-Sentence Scoring

Updated 11 October 2025
  • Poly-encoders are transformer models that use multiple global attention vectors to achieve candidate-sensitive context encoding.
  • They bridge the gap between Bi- and Cross-encoders by combining fast candidate caching with near state-of-the-art accuracy in dialogue and retrieval tasks.
  • Empirical evaluations demonstrate that Poly-encoders offer significant inference speedups while maintaining competitive scoring accuracy with optimized pre-training and fine-tuning.

A Poly-encoder is a transformer-based neural architecture for fast and accurate multi-sentence scoring, particularly suited to tasks requiring pairwise comparison between a context and candidate sequence. Unlike conventional Bi-encoders or Cross-encoders, Poly-encoders leverage multiple global attention vectors to encode context, enabling both efficient pre-computation and candidate-sensitive interaction. This design achieves a computation-accuracy trade-off, yielding near state-of-the-art performance on dialogue and retrieval tasks with tractable inference costs.

1. Architectural Foundations

Poly-encoders are grounded in the contrast between Bi-encoders and Cross-encoders. In Bi-encoders, context and candidates are independently encoded to vectors—often using the output of a special token such as [S]—and scored via dot products. This allows candidate embeddings to be cached, enabling high inference speed, but at the cost of neglecting fine-grained context-candidate interactions.

Cross-encoders concatenate context and candidate tokens, feeding them jointly into a transformer encoder so every token can attend to every other. This results in rich context-candidate interactions and maximal accuracy, but inference is expensive, as candidate caching is impossible and encoding must be repeated for every candidate.

Poly-encoders maintain separate encoders for context and candidate, enabling candidate representation pre-computation like Bi-encoders. They augment the context encoding with mm "global" vectors, either using mm learned code vectors that attend over all context token outputs or by extracting the first mm output vectors from the context encoder. For a given candidate, its embedding ycandy_{\text{cand}} attends over the mm context vectors {yctxt1,,yctxtm}\{y_{\text{ctxt}}^1, \dots, y_{\text{ctxt}}^m\} via an attention mechanism to produce a candidate-conditioned context summary:

  1. Compute mm global context vectors:

yctxti=jwj(ci)hjy_{\text{ctxt}}^i = \sum_j w_j^{(c_i)} h_j

with

(w1(ci),,wN(ci))=softmax(cih1,,cihN)(w_1^{(c_i)}, \dots, w_N^{(c_i)}) = \operatorname{softmax}(c_i \cdot h_1, \dots, c_i \cdot h_N)

where hjh_j are context hidden states and cic_i are learned code vectors.

  1. Candidate attends over global context vectors:

yctxt=iwiyctxtiy_{\text{ctxt}} = \sum_i w_i y_{\text{ctxt}}^i

with

(w1,,wm)=softmax(ycandyctxt1,,ycandyctxtm)(w_1, \dots, w_m) = \operatorname{softmax}(y_{\text{cand}} \cdot y_{\text{ctxt}}^1, \dots, y_{\text{cand}} \cdot y_{\text{ctxt}}^m)

  1. Final context-candidate score:

score=ycandyctxt\text{score} = y_{\text{cand}} \cdot y_{\text{ctxt}}

This workflow enables candidate-dependent context aggregation and preserves computational tractability via candidate caching.

2. Global Self-Attention Dynamics

The efficiency and discriminative power of Poly-encoders arise from their "global" self-attention mechanism. Each context is summarized not by a single vector, but by mm global features yctxtiy_{\text{ctxt}}^i, each produced by a learned attention code cic_i. These codes attend (via dot-product and softmax weights) over all layer outputs from the context transformer, yielding distinct global summaries.

Subsequently, attention is performed again, this time with the candidate embedding ycandy_{\text{cand}} as query, yielding weights over the mm global vectors, then their weighted sum as the candidate-specific context. Since mNm \ll N, where NN is the number of context tokens, this second attention is computationally light compared to full joint token attention in Cross-encoders.

This dual-stage attention strategy permits both diverse aspect extraction from the context and selective focus based on candidate semantics, yielding stronger model expressivity than Bi-encoders while vastly reducing the cost relative to Cross-encoders.

3. Comparative Evaluation Across Multi-Sentence Scoring

Poly-encoders have been empirically evaluated alongside Bi- and Cross-encoder baselines across several domains, notably dialogue response selection (ConvAI2, DSTC7, Ubuntu V2) and document retrieval (Wikipedia IR).

Findings:

  • On ConvAI2, with Reddit pre-training, Bi-encoder achieves 84.8% R@1, Poly-encoder reaches ~86.8% R@1 (with 360 codes), while Cross-encoder reaches ~87.9% R@1.
  • Cross-encoders, while scoring highest, are approximately two orders of magnitude slower at inference since every candidate requires context-candidate joint encoding.
  • Poly-encoders outperform Bi-encoders and approach Cross-encoder accuracy while retaining Bi-encoder-level efficiency.
  • On DSTC7 and Ubuntu V2, Poly-encoders match or slightly exceed Bi-encoder accuracy with dramatic computational savings over Cross-encoders.

The accuracy-speed profile of Poly-encoders marks a significant advance in practical multi-sentence scoring scenarios where latency is critical and candidate pools are large.

4. Pre-Training and Fine-Tuning Regimens

Model effectiveness hinges on pre-training and fine-tuning choices. Two strategies have been assessed:

  • BERT-style pre-training on general corpora (Toronto Books + Wikipedia) provides robust initial representations.
  • Pre-training from scratch on large Reddit conversational datasets (stylistically closer to target dialogue tasks) yields further performance improvement for all encoder variants.

During fine-tuning, optimal strategies include:

  • Cross-entropy loss over batches containing the correct candidate and multiple negatives, often drawing negatives from other batch instances.
  • Fine-tuning of nearly all transformer layers (optionally freezing embeddings) ensures model adaptation.

These regimens, especially domain-relevant pre-training (Reddit data) and large-batch negative sampling, enable the Poly-encoder to achieve state-of-the-art results on task benchmarks.

5. Practical Applications and Theoretical Implications

Poly-encoders are well-suited for high-throughput, real-time applications in:

  • Retrieval-based dialogue systems and chatbots: Enables rapid scoring of thousands of candidate responses with rich context-dependent interactions (e.g., ConvAI2, DSTC7).
  • Information retrieval: Facilitates fast query-document matching across large corpora (e.g., Wikipedia IR), supporting scaling to millions of candidates with competitive accuracy.
  • Multi-sentence and multi-document ranking tasks: Useful for recommendation systems, question answering, dialogue act classification, and any paradigm where large pools of candidate texts must be efficiently scored.

Implications for future research include the potential adaptation of candidate-driven global vector representations to other modalities (e.g., vision-language), integration with approximate nearest-neighbor search algorithms, and further exploration of pre-training objectives tailored to specific downstream datasets and domains.

6. Summary of Core Contributions

The Poly-encoder architecture establishes a learned global attention methodology for efficient, candidate-sensitive context encoding in transformer models. By combining candidate caching and candidate-driven context aggregation, Poly-encoders bridge the gap between Bi-encoder efficiency and Cross-encoder accuracy. Empirical validation demonstrates superiority over Bi-encoders and near-parity with Cross-encoders across key dialogue and information retrieval benchmarks, with decisive gains in inference scalability. The architecture further highlights the criticality of domain-aligned pre-training and fine-tuning—particularly large-batch negative sampling—in achieving state-of-the-art performance. Poly-encoders represent an important tool for practitioners requiring rapid, accurate multi-sentence scoring in settings where candidate numbers and response latency preclude full joint encoding.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Poly-encoders.