Poly-encoders: Efficient Multi-Sentence Scoring
- Poly-encoders are transformer models that use multiple global attention vectors to achieve candidate-sensitive context encoding.
- They bridge the gap between Bi- and Cross-encoders by combining fast candidate caching with near state-of-the-art accuracy in dialogue and retrieval tasks.
- Empirical evaluations demonstrate that Poly-encoders offer significant inference speedups while maintaining competitive scoring accuracy with optimized pre-training and fine-tuning.
A Poly-encoder is a transformer-based neural architecture for fast and accurate multi-sentence scoring, particularly suited to tasks requiring pairwise comparison between a context and candidate sequence. Unlike conventional Bi-encoders or Cross-encoders, Poly-encoders leverage multiple global attention vectors to encode context, enabling both efficient pre-computation and candidate-sensitive interaction. This design achieves a computation-accuracy trade-off, yielding near state-of-the-art performance on dialogue and retrieval tasks with tractable inference costs.
1. Architectural Foundations
Poly-encoders are grounded in the contrast between Bi-encoders and Cross-encoders. In Bi-encoders, context and candidates are independently encoded to vectors—often using the output of a special token such as [S]—and scored via dot products. This allows candidate embeddings to be cached, enabling high inference speed, but at the cost of neglecting fine-grained context-candidate interactions.
Cross-encoders concatenate context and candidate tokens, feeding them jointly into a transformer encoder so every token can attend to every other. This results in rich context-candidate interactions and maximal accuracy, but inference is expensive, as candidate caching is impossible and encoding must be repeated for every candidate.
Poly-encoders maintain separate encoders for context and candidate, enabling candidate representation pre-computation like Bi-encoders. They augment the context encoding with "global" vectors, either using learned code vectors that attend over all context token outputs or by extracting the first output vectors from the context encoder. For a given candidate, its embedding attends over the context vectors via an attention mechanism to produce a candidate-conditioned context summary:
- Compute global context vectors:
with
where are context hidden states and are learned code vectors.
- Candidate attends over global context vectors:
with
- Final context-candidate score:
This workflow enables candidate-dependent context aggregation and preserves computational tractability via candidate caching.
2. Global Self-Attention Dynamics
The efficiency and discriminative power of Poly-encoders arise from their "global" self-attention mechanism. Each context is summarized not by a single vector, but by global features , each produced by a learned attention code . These codes attend (via dot-product and softmax weights) over all layer outputs from the context transformer, yielding distinct global summaries.
Subsequently, attention is performed again, this time with the candidate embedding as query, yielding weights over the global vectors, then their weighted sum as the candidate-specific context. Since , where is the number of context tokens, this second attention is computationally light compared to full joint token attention in Cross-encoders.
This dual-stage attention strategy permits both diverse aspect extraction from the context and selective focus based on candidate semantics, yielding stronger model expressivity than Bi-encoders while vastly reducing the cost relative to Cross-encoders.
3. Comparative Evaluation Across Multi-Sentence Scoring
Poly-encoders have been empirically evaluated alongside Bi- and Cross-encoder baselines across several domains, notably dialogue response selection (ConvAI2, DSTC7, Ubuntu V2) and document retrieval (Wikipedia IR).
Findings:
- On ConvAI2, with Reddit pre-training, Bi-encoder achieves 84.8% R@1, Poly-encoder reaches ~86.8% R@1 (with 360 codes), while Cross-encoder reaches ~87.9% R@1.
- Cross-encoders, while scoring highest, are approximately two orders of magnitude slower at inference since every candidate requires context-candidate joint encoding.
- Poly-encoders outperform Bi-encoders and approach Cross-encoder accuracy while retaining Bi-encoder-level efficiency.
- On DSTC7 and Ubuntu V2, Poly-encoders match or slightly exceed Bi-encoder accuracy with dramatic computational savings over Cross-encoders.
The accuracy-speed profile of Poly-encoders marks a significant advance in practical multi-sentence scoring scenarios where latency is critical and candidate pools are large.
4. Pre-Training and Fine-Tuning Regimens
Model effectiveness hinges on pre-training and fine-tuning choices. Two strategies have been assessed:
- BERT-style pre-training on general corpora (Toronto Books + Wikipedia) provides robust initial representations.
- Pre-training from scratch on large Reddit conversational datasets (stylistically closer to target dialogue tasks) yields further performance improvement for all encoder variants.
During fine-tuning, optimal strategies include:
- Cross-entropy loss over batches containing the correct candidate and multiple negatives, often drawing negatives from other batch instances.
- Fine-tuning of nearly all transformer layers (optionally freezing embeddings) ensures model adaptation.
These regimens, especially domain-relevant pre-training (Reddit data) and large-batch negative sampling, enable the Poly-encoder to achieve state-of-the-art results on task benchmarks.
5. Practical Applications and Theoretical Implications
Poly-encoders are well-suited for high-throughput, real-time applications in:
- Retrieval-based dialogue systems and chatbots: Enables rapid scoring of thousands of candidate responses with rich context-dependent interactions (e.g., ConvAI2, DSTC7).
- Information retrieval: Facilitates fast query-document matching across large corpora (e.g., Wikipedia IR), supporting scaling to millions of candidates with competitive accuracy.
- Multi-sentence and multi-document ranking tasks: Useful for recommendation systems, question answering, dialogue act classification, and any paradigm where large pools of candidate texts must be efficiently scored.
Implications for future research include the potential adaptation of candidate-driven global vector representations to other modalities (e.g., vision-language), integration with approximate nearest-neighbor search algorithms, and further exploration of pre-training objectives tailored to specific downstream datasets and domains.
6. Summary of Core Contributions
The Poly-encoder architecture establishes a learned global attention methodology for efficient, candidate-sensitive context encoding in transformer models. By combining candidate caching and candidate-driven context aggregation, Poly-encoders bridge the gap between Bi-encoder efficiency and Cross-encoder accuracy. Empirical validation demonstrates superiority over Bi-encoders and near-parity with Cross-encoders across key dialogue and information retrieval benchmarks, with decisive gains in inference scalability. The architecture further highlights the criticality of domain-aligned pre-training and fine-tuning—particularly large-batch negative sampling—in achieving state-of-the-art performance. Poly-encoders represent an important tool for practitioners requiring rapid, accurate multi-sentence scoring in settings where candidate numbers and response latency preclude full joint encoding.