Papers
Topics
Authors
Recent
Search
2000 character limit reached

Neural MT with Contrastive Memories

Updated 24 March 2026
  • The paper demonstrates that decoupling retrieval and generation representations via supervised contrastive learning significantly enhances translation precision.
  • Contrastive memories improve retrieval diversity by selecting complementary, non-redundant segments using hard negative mining and hierarchical group attention.
  • Empirical results reveal notable BLEU gains and higher retrieval accuracy across diverse domains, validating the contrastive memory approach.

Neural Machine Translation (NMT) with contrastive memories encompasses a class of retrieval-augmented architectures that incorporate explicitly contrastive objectives or contrast-driven memory selection into NMT systems. These approaches aim to improve the informativeness, diversity, and precision of external memory retrieval—be it through nearest-neighbor lookup in latent space or via translation memory (TM) retrieval at the sentence- or phrase-level. Recent works, notably “Learning Decoupled Retrieval Representation for Nearest Neighbour Neural Machine Translation” (Wang et al., 2022) and “Neural Machine Translation with Contrastive Translation Memories” (Cheng et al., 2022), have demonstrated substantial performance improvements by applying supervised contrastive learning to the retrieval module, decoupling retrieval and generation representations, and selecting or encoding memories to maximize their complementary information content.

1. Motivation for Contrastive Memories in Neural Machine Translation

Standard k-Nearest Neighbor NMT (kNN-MT) [Khandelwal et al., 2021] and classic retrieval-augmented models utilize external memory by retrieving sentences or context vectors similar to the current source or decoder state. However, in vanilla kNN-MT, the context representation used both for next-token prediction and memory retrieval—typically the output of the last decoder layer—was optimized solely for label likelihood, not for discriminating among fine-grained lexical alternatives in memory. Similarly, classical TM retrieval methods often select multiple top-similar segments that are mutually redundant, leading to low information diversity and inefficiency.

Contrastive memory models address these limitations by:

  • Decoupling the representation for neural generation from that used for memory retrieval, allowing for specialized retrieval-optimized representations.
  • Formulating retrieval or memory selection as a contrastive process to ensure that retrieved memories are both holistically relevant and mutually distinctive.
  • Incorporating supervised contrastive learning objectives that sharpen the separation of representations in latent space, especially beneficial for rare or ambiguous terms.
  • Designing hierarchical encoding and aggregation mechanisms to extract both local (within-memory) and global (across-memory) context, promoting more informative cross-memory interactions (Cheng et al., 2022).

2. Decoupled Retrieval Representations and Supervised Contrastive Learning

In “Learning Decoupled Retrieval Representation for Nearest Neighbour Neural Machine Translation” (Wang et al., 2022), the standard kNN-MT pipeline is modified by introducing a lightweight retrieval adapter—a small feed-forward network (FFN)—applied to the decoder output. The base NMT decoder is either frozen or lightly fine-tuned, and only the FFN is trained by a supervised contrastive loss.

Given decoder output vectors uiu_i corresponding to context–word pairs in the memory, the retrieval adapter produces zi=FFN(ui)z_i = \mathrm{FFN}(u_i). The supervised contrastive objective operates over clusters of ziz_is sharing the same target word viv_i, denoted CvC_v. Positives (zi,z+)(z_i, z^+) are sampled from Cv{zi}C_v \setminus \{z_i\}, while negatives (zi,z)(z_i, z^-) are drawn from the union of other clusters CwvC_{w \neq v}. The contrastive loss for anchor ziz_i is:

Li=logp=1Mexp(a(zi,zp+))p=1Mexp(a(zi,zp+))+n=1Nexp(a(zi,zn)),L_i = -\log \frac{ \sum_{p=1}^M \exp(a(z_i, z_p^+)) }{ \sum_{p=1}^M \exp(a(z_i, z_p^+)) + \sum_{n=1}^N \exp(a(z_i, z_n^-)) },

with

a(z,w)=1Tzwzw,a(z, w) = \frac{1}{T'} \frac{z^\top w}{\|z\| \|w\|},

where TT' is a temperature parameter. This decoupled retrieval representation ensures that memory queries are specialized for distinguishing among target words, improving nearest-neighbor selection fidelity.

3. Contrastive Retrieval and Encoding of Translation Memories

“Neural Machine Translation with Contrastive Translation Memories” (Cheng et al., 2022) extends contrastive memory ideas to translation memory retrieval and encoding at the sentence or segment level. The retrieval phase uses a contrastive selection heuristic: after fetching a large candidate set KK from the TM using a full-text engine, the final memory set MM is assembled incrementally by optimizing for maximal similarity to the input source while minimizing similarity among selected memories:

xnew=argmaxxiKM[sim(x,xi)αMxjMsim(xi,xj)],x^{\mathrm{new}} = \arg\max_{x^i \in K\setminus M} \left[ \mathrm{sim}(x, x^i) - \frac{\alpha}{|M|} \sum_{x^j \in M}\mathrm{sim}(x^i, x^j) \right],

where sim(,)\mathrm{sim}(\cdot,\cdot) is normalized edit distance and α\alpha controls redundancy penalization. This ensures that each memory in MM is not only individually relevant but also provides maximal information gain relative to other retrieved segments.

For memory encoding, a Hierarchical Group Attention (HGA) module builds a two-level graph among tokens (local intra-memory context) and “super-nodes” (global inter-memory context), enabling rich message passing both within and across retrieved segments.

4. Hard Negative Mining and Contrastive Memory Objectives

Both approaches employ hard negative mining to maximize the effectiveness of contrastive learning. In (Wang et al., 2022), instead of sampling negatives uniformly, “cluster-center” mining is performed: for each vocabulary word vv, the mean of its cluster in representation space (Cˉv\bar{C}_v) is precomputed. For anchor ziz_i labeled vv, the algorithm selects NN negatives from the KK nearest cluster centers for other words, reducing computational overhead while focusing training on challenging confounders.

In (Cheng et al., 2022), beyond contrastive retrieval, multi-TM contrastive learning (MTCL) is used at the training objective level. Let hyh_y denote the super-node representation of the gold target, and hyih_{y^i} for each TM. The loss is:

LMTCL=i=1Mlogexp(sim(hyi,hy)/τ)j=1Mexp(sim(hyj,hy)/τ)\mathcal{L}_{\mathrm{MTCL}} = -\sum_{i=1}^{|M|} \log \frac{ \exp(\mathrm{sim}(h_{y^i}, h_y)/\tau) }{ \sum_{j=1}^{|M|} \exp(\mathrm{sim}(h_{y^j}, h_y)/\tau) }

where sim\mathrm{sim} is cosine similarity and τ\tau is a temperature. This loss compels each TM representation to approach the gold target while being contrastively separated from other TMs, enhancing the diversity and informativeness of memory utilization.

5. Integration into NMT Architectures and Inference Procedure

Both lines of work integrate contrastive memories into the NMT pipeline at the retrieval, representation, and decoding stages.

In the decoupled kNN-MT framework (Wang et al., 2022):

  • After training, all context vectors in the memory datastore are projected via the FFN, optionally reduced by PCA and L2-normalized.
  • During inference, the current decoder state is mapped through the retrieval adapter, normalized, and used to retrieve kk-nearest neighbors via inner product.
  • Retrieval-based probabilities pr(y)p_r(y) are computed from the top-kk neighbors and interpolated with the base NMT model output pc(y)p_c(y):

p(y)=(1λ)pc(y)+λpr(y)p(y) = (1-\lambda) p_c(y) + \lambda p_r(y)

Adaptive interpolation λ\lambda^* is also supported, scaling λ\lambda by average similarity of retrieved neighbors.

In contrastive TM models (Cheng et al., 2022):

  • The encoder outputs for the source and for retrieved TMs are hierarchically aggregated by the HGA module. Decoder layers attend first to source, then to memory encodings.
  • Token-level generation probabilities interpolate standard vocabulary output and scores derived from memory attention and a copy mechanism:

p(yt)=(1pcopy)pvocab(yt)+pcopyi=1zmαiI[zim=yt]p(y_t) = (1-p_{\mathrm{copy}}) p_{\mathrm{vocab}}(y_t) + p_{\mathrm{copy}} \sum_{i=1}^{|z^m|} \alpha_i \mathbb{I}[z^m_i = y_t]

where αi\alpha_i is the attention over memory tokens.

6. Empirical Results and Quantitative Evaluation

Quantitative results from both works demonstrate consistent improvements in BLEU and retrieval accuracy versus strong baselines.

(Wang et al., 2022) reports:

Method Med Law IT Koran Subt Avg
Baseline (WMT‘19) 39.91 45.71 37.98 16.30 29.21 33.82
kNN-MT 54.41 61.01 45.20 21.07 29.67 42.27
Clknn (in-domain) 55.86 61.92 47.77 21.46 31.02 43.61
Clknn + λ* 55.87 62.01 47.84 21.81 31.05 43.72

Retrieval accuracy at top-1 on the IT domain: kNN-MT (50.8%) vs. Clknn (56.6%); at k=32k=32: kNN-MT (33.0%) vs. Clknn (38.8%).

Ablations reveal best BLEU when using two positives and 32 hard negatives per anchor. t-SNE visualizations confirm that Clknn's learned adapter forms tighter and more separated clusters in latent space, especially for rare targets.

(Cheng et al., 2022) presents test set BLEU improvements (CMM = Contrastive Memory Model):

Direction Transformer CMM Gain
Es→En 64.63 67.76 +3.13
En→Es 61.80 64.04 +2.24
De→En 60.16 64.33 +4.17
En→De 55.07 58.69 +3.62

Ablation studies indicate that removal of MTCL or HGA, or reverting to greedy retrieval, reduces BLEU by 0.6–0.8 points. Performance peaks at contrastive factor α0.7\alpha \approx 0.7 with M=5|M|=5 retrieved memories.

7. Limitations, Implications, and Future Directions

Contrastive memory models yield more informative and de-duplicated retrieval, boost retrieval precision especially for rare or ambiguous patterns, and enhance translation performance with modest computational overhead.

Limitations include dependence on in-domain memory resources—out-of-domain or low-resource settings may attenuate gains. Non-parametric retrieval (edit distance, Lucene) in (Cheng et al., 2022) could plausibly be replaced or supplemented by learned retrievers for further benefit.

Future prospects include:

  • Joint optimization of memory retrieval, hierarchical encoding, and contrastive objectives end-to-end.
  • Extending contrastive memory concepts to other retrieval-augmented tasks such as summarization and dialogue.
  • Dynamic tuning of memory set size and redundancy penalization per instance.

These findings establish contrastive memory-augmented NMT as a robust, flexible paradigm for memory integration, with proven improvements over conventional retrieval-augmented and non-contrastive baselines (Wang et al., 2022, Cheng et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Neural Machine Translation with Contrastive Memories.