Selective Token Repetition in NLP & Code

Updated 30 December 2025

Selective token repetition is a targeted approach in NLP and code generation that modulates the repetition of specific tokens through dynamic weighting and architecture-based strategies.
Techniques like TLDR and REP selectively adjust token loss contributions and copy mechanisms, leading to improved performance in dialogue systems and code autocompletion tasks.
Mechanistic interventions such as attention sink patching and contrastive token losses mitigate pathological repetition while maintaining necessary redundancy across different applications.

Selective token repetition refers to a set of mechanisms and modeling strategies in NLP and code generation that target the explicit learning, suppression, or enhancement of specific tokens' propensity to repeat in a generated sequence. Distinct from global anti-repetition penalties or unlikelihood approaches that act uniformly, selective methods leverage architecture, loss design, or token type to modulate repetition at a fine-grained level. Such approaches have been motivated by issues including pathological self-repetition in neural conversation models, code autocompletion accuracy, and the failure of LLMs to generate controlled repeated outputs.

1. Motivations for Selective Token Repetition Control

Repetition in neural sequence generation manifests as undesirable cycles (e.g., loops in chatbots, hallucinated redundancy in translation) or, conversely, as an inability to perform exact token repetition on demand (as with copy mechanisms in code models or LLMs failing at controlled string repetition). In encoder–decoder NLG, ubiquitous repetition—"I've been out of town. I've been out of town…"—stems from an overreliance on high-probability ("easy") tokens, especially when optimization dynamics stall model learning for low-frequency ("hard") tokens, leading to over-sampling of these "safe" outputs (Jiang et al., 2020). Similarly, code models must distinguish between tokens exhibiting local repeatedness (like variable names) and other tokens whose repetition is neither desirable nor informative (literals, types) (Yang, 2020).

2. Dynamic Token-Level Weighting: The TLDR Algorithm

Token Loss Dynamic Reweighting (TLDR) is a general strategy for modulating the gradient contributions of target tokens during sequence model training, based on their prediction difficulty. Rather than treating every ground-truth token equally in the loss, TLDR assigns a differentiable, data-dependent weight $w_t$ to each token's cross-entropy loss:

$L(x, y; \theta) = \sum_{t=1}^{|y|} w_t [-\log p_t]$

where $p_t = p(y_t | y_{<t}, x; \theta)$ and

$w_t = \cos(\pi p_t) + 1 \qquad (w_t \in [0,2])$

Tokens with low predicted probability ( $p_t < 0.5$ , interpreted as "hard") are up-weighted ( $w_t > 1$ ), accelerating learning for under-represented tokens and suppressing reliance on over-learned "easy" tokens. The method is fully differentiable and token-specific, recalculating weights at each forward pass. Ablation studies on open-domain conversation datasets demonstrate that TLDR reduces repetition metrics (e.g., lower WL2) more effectively than uniform reweighting or focal loss; it generalizes to both RNN and transformer-based models without architectural modification (Jiang et al., 2020).

3. Selective Copying in Code Models

In code autocompletion, naive repetition learning is detrimental—directly applying copy/REP heads to the entire output token stream leads to sharp drops in model performance (Yang, 2020). Physical parsing of code structure via Abstract Syntax Trees (ASTs) enables fine-grained filtering: only non-grammar tokens (excluding type and punctuation markers) are considered for repetition modeling, and, of these, only identifiers (e.g., variables) present a statistically significant tendency to repeat in local context. Within identifiers, further discrimination—“Cared Nodes”—removes package names, method names, and qualified names that rarely repeat. This yields a highly selective context for copy attention. The REP model, maintaining a circular buffer of recent cared-identifier hidden states, only activates copy mechanisms within this subspace, leading to substantial improvements in top- $k$ accuracy on code completion tasks over nonselective baselines.

$P_{\mathrm{rep}}(w) = P_{\mathrm{repeat}} \cdot \mathbb{I}[w = x_{k^*}] + P_{\mathrm{no\,repeat}} \cdot P_{\mathrm{vocab}}(w)$

Empirical results confirm that selective repetition modeling is essential—applying copy attention to irrelevant tokens or identifiers degrades performance, while the selective strategy shows consistent gains, especially for unseen or rare identifiers (Yang, 2020).

4. Mechanistic Barriers to Repetition: Attention Sink Phenomena in LLMs

LLMs exhibit failures both in over-producing and in failing to repeat tokens as prompted. The "attention sink" effect describes a structural phenomenon in transformer attention, where, due to MLP "sink neurons" in early layers, the first token in the input sequence absorbs almost all later attention mass, establishing a neural circuit for sequence initialization (Yona et al., 11 Mar 2025). When sequences with very long repetitions are fed (e.g., thousands of identical tokens), the first-layer attention mistakenly treats all repeated tokens as new sequence starts, triggering the sink circuit iteratively and inducing divergence from intended repeated output. Analytically, as the repeat count $n$ grows, the representation of the final token becomes indistinguishable from the representation of a single-token sequence:

$\lim_{n\to\infty} \| T(S_n)_n - T([\text{token}])_1 \| = 0$

A minimal architectural patch—clamping the sink-neuron activation post-first token—restores correct selective repetition behavior without degrading general model performance. This demonstrates that token repetition control can require mechanism-level interventions on foundational circuits, not just auxiliary loss terms (Yona et al., 11 Mar 2025).

5. Contrastive Token Loss with Dynamic Suppression in NMT

Contrastive Token Learning with Similarity Decay (CTSD) introduces a token-level contrastive objective for sequence-to-sequence tasks (e.g., NMT), driving the model to separate the representations of the correct token at time $t$ from those of recently repeated tokens, while dynamically modulating the penalty based on both recency and attention-context similarity:

$\mathcal{L}_{\text{CTSD}}^t = \log\left( 1 + \sum_{y_t^- \in S_N^t} [\alpha_d \alpha_s] \exp(h_t^\top W_{y_t^-} - h_t^\top W_{y_t}) \right)$

where

$\alpha_d = \exp(\frac{t_- - t}{T})$ is the distance decay,
$\alpha_s$ is cosine similarity between cross-attention weight vectors at positions $t$ and $t_-$ .

This framework penalizes recent and contextually similar repeats far more aggressively than "blind" anti-repetition techniques, preserving necessary redundancy (e.g., repeated attribute words in product titles) while eliminating pathological oscillation. CTSD operates purely at the loss level, integrating seamlessly with standard transformer-based NMT pipelines. Quantitative benchmarks on e-commerce and general datasets show dramatic reductions in $n$ -gram repetition rates—with the rep-2 metric dropping from 36.2% to 0.75% (NLLB-1.3B)—and simultaneous improvements in SacreBLEU and COMET scores. Online A/B testing further substantiates its practical benefit, with measurable user and business metric improvements (Dai et al., 2024).

6. Comparative Analysis of Selective Methods

A diversity of selective repetition strategies have been formalized, each addressing a domain-specific facet of the phenomenon:

Method	Selection Principle	Key Domain
TLDR (Jiang et al., 2020)	Token difficulty (hardness via $p_t$ )	NLG, dialogue
REP (Yang, 2020)	Token type (identifiers, “Cared” nodes)	Code completion
Sink patch (Yona et al., 11 Mar 2025)	Neuron/position-specific attention	Repetition control in LLMs
CTSD (Dai et al., 2024)	Attention similarity & recency	NMT, e-commerce

All methods share the commonality of operating at a lower granularity than sequence-level penalties or generic decoding-time blocks; they selectively modulate training or architecture to focus on tokens, token classes, or neural representations that empirically drive unwanted repetition. This targeted granularity enables them to reconcile minimization of pathological repetition with the preservation of required or naturalistic redundancy.

7. Limitations, Extensions, and Open Problems

Selective token repetition methods are limited by their scope of selection and dependence on static heuristics or architecture-derived signals. TLDR, for example, only modulates target-side tokens and does not yet leverage source token difficulty, though future directions include extension to encoder-side weighting and to decoder-only generative models. In code tasks, selection often depends on robust AST parsing and resolution; misidentification of token classes can lead to erroneous suppression or amplification. Mechanistic interventions such as sink-neuron patching require precise interpretability and may need adjustment as model scaling or architectural changes shift the underlying circuits. Other open areas include the joint application of selective loss shaping and unlikelihood training, hybrid approaches that integrate token selection with harmful-example mining, and the exploration of alternative smooth weighting or contrastive functions beyond cosine or focal loss formulations (Jiang et al., 2020, Yang, 2020, Yona et al., 11 Mar 2025, Dai et al., 2024).

In summary, selective token repetition frameworks anchor advancement in generation fidelity, robustness, and controllability by aligning model dynamics with empirically derived, context-sensitive patterns of repetition and uniqueness across natural language and code domains.