Copy-Enriched Seq2Seq Models

Updated 7 December 2025

Copy-Enriched Seq2Seq models are neural architectures that combine standard generation with explicit copying to accurately reproduce source tokens, including rare or OOV words.
They employ differentiable attention mechanisms such as pointer-generator networks and span-copy methods to decide between generating novel tokens or copying input segments.
These models have demonstrated strong performance in tasks like semantic role labeling, text summarization, code repair, and dialogue, while also addressing challenges like overcopying and context integration.

Copy-enriched sequence-to-sequence (seq2seq) models are neural generative architectures that interleave standard decoding with operations that directly copy segments of the input sequence into the output. Designed to unify abstract generation with strict copying, these models enable faithful reproduction of key source tokens (including rare or OOV words, entities, or structured tokens) and facilitate tasks that require both content fidelity and flexible annotation or transformation. By integrating differentiable mechanisms—either soft (pointer-generator, joint softmax) or hard (span-level copy, explicit decision structures)—these models have become foundational in domains ranging from semantic role labeling and text summarization to code repair and question generation.

1. Core Architectures and Mathematical Foundations

Copy-enriched seq2seq models augment classic encoder–decoder architectures with mechanisms that combine token-level generation and copying. The prototypical design uses a bidirectional recurrent encoder (frequently a BiLSTM or BiGRU) to process the source sequence $X = (x_1,\ldots,x_{T_x})$ , producing hidden states $h_j$ . A unidirectional decoder (typically LSTM or GRU) generates the target sequence, attending to encoder outputs at each step and choosing between generating from a fixed vocabulary, outputting special label tokens, or copying a source token.

The canonical mathematical treatment is as follows:

Attention: Alignment scores $e_{i,j} = s_{i-1}^\top h_j$ are normalized to give attention weights $\alpha_{i,j}$ , producing context vectors $c_i = \sum_j \alpha_{i,j} h_j$ (Daza et al., 2018).
Generation and Copy Scores: For each next token $y_i$ $y_{i}$
- Generation score for $v \in \mathcal{V}$ : $\psi_g(y_i=v) = W_o [s_i; c_i]$
- Copy score for $x_j \in \mathcal{X}$ : $\psi_c(y_i=x_j) = \tanh(h_j^\top W_c s_i)$
Mixture Probability: A single softmax yields a distribution over both generation and copy candidates:

$p(y_i|s_i, y_{<i}, X) = p(y_i, \mathrm{gen}) + p(y_i, \mathrm{copy}),$

where the two terms are normalized by the same partition function $Z$ as in CopyNet (Gu et al., 2016, Daza et al., 2018).

Pointer-generator networks generalize this with a scalar gate $p_{\mathrm{gen}} \in [0,1]$ (usually a function of decoder state, context, and previous token), resulting in:

$p(y_t = w) = p_{\mathrm{gen}} P_{\mathrm{vocab}}(w) + (1 - p_{\mathrm{gen}}) \sum_{i:x_i = w} \alpha_{t,i}$

where $P_{\mathrm{vocab}}$ is the decoder vocabulary distribution (Chen et al., 2018, Zeng et al., 2016).

2. Mechanistic Variants: Token and Span Copying

Early models like CopyNet (Gu et al., 2016) and S4 (Mathews et al., 2018) restricted copying to single input tokens. However, several architectures have extended this to contiguous spans:

SeqCopyNet: Employs a pointer network to select a start and end position, copying spans $x_{start:end}$ as atomic output chunks; the switch between generate and copy is learned via an MLP over decoder memory (Zhou et al., 2018).
CopyNext: Formalizes decoding as a sequence of explicit hard-alignment actions: "start copy at $i$ ", "extend span" (via CopyNext token), "label" (terminate span). This approach supports fast, explicit modeling of non-overlapping spans and outputs token-level copy alignments (Singh et al., 2020).
BioCopy: Supervises which positions start and continue spans using BIO tags, enforcing span-level copying directly by masking the vocabulary distribution at each step (Liu et al., 2021).
Span-copy Marginalization: Copy that! (Panthaplackel et al., 2020) introduces an action space with arbitrary span-copy actions and derives an exact marginal likelihood objective over all valid action sequences generating the reference output.

These span-copy models improve faithfulness when large or multi-token entities must be copied (e.g., structured facts, long entity names in information extraction), often yielding lower error rates on long-span copying than token-level approaches (Zhou et al., 2018, Singh et al., 2020, Panthaplackel et al., 2020).

3. Supervision and Specialized Losses

The interpolation or gating between generate and copy modes is generally learned implicitly. However, supervised or semi-supervised variants introduce explicit losses to direct the network's switch:

Binary Mode Predictors: Models like CoRe supervise a copy-vs-generate gate $\lambda_t$ by matching it to the ground-truth writing mode, augmenting cross-entropy loss with a binary copy/generate loss (Cao et al., 2016).
Force-Copy Losses: Enhanced Copy models enforce forced copying (switch off $p_{\mathrm{gen}}$ ) when the gold target word is a source token, or only when it is OOV (Choi et al., 2021). The per-step loss decomposes into a sum of $\mathcal{L}_{\mathrm{vocab}}, \mathcal{L}_{\mathrm{attn}},$ and supervised mode losses.
BIO-Tagging for Spans: BioCopy uses gold span extraction as a training signal for BIO tag prediction, which at inference time strictly constrains output vocabulary masking (Liu et al., 2021).

These strategies address common challenges: under-copying (missing required tokens), over-copying (copying irrelevant spans), and fidelity-abstractness trade-offs. Explicit supervision of the copy control often narrows factuality gaps and improves entity/number precision in data-to-text or factual summarization (Choi et al., 2021, Liu et al., 2021).

4. Applications, Empirical Results, and Scope

Copy-enriched seq2seq models are the empirical backbone of tasks where OOV, rare, or structured inputs must be faithfully reproduced in generation:

Semantic Role Labeling: The copy-augmented decoder in (Daza et al., 2018) achieves near-perfect input+label regeneration (~97% exact-length reproduction, 99.9% balanced brackets) and competitive argument F1.
Program Repair: SequenceR demonstrates that pointer-generator style copying is essential for learning bug-fixing edits in code, consistently outperforming pure seq2seq by 4–5 $\times$ in perfect prediction rate (Chen et al., 2018).
Nested NER and IE: Span-level models such as CopyNext approach state-of-the-art F1 on nested NER, doubling decoding speed compared to graph-based models while maintaining high precision/recall (Singh et al., 2020).
Data-to-Text Generation: Character-level copy models reach parity or surpass strong word-based systems on open-vocabulary benchmarks (e.g., E2E+), obtaining BLEU improvements of $+25$ points from the copy extension (Roberti et al., 2019).
Dialogue, Summarization, Paraphrase: Pointer-generator and joint copy/generate decoders boost entity precision, response accuracy, and ROUGE/BLEU across DSTC2, CNN/DailyMail, and simplification corpora (Eric et al., 2017, Zeng et al., 2016, Mathews et al., 2018, Cao et al., 2016).

Performance improvements are most pronounced on (a) entity-rich, variable, or OOV-heavy data, (b) low-resource tasks (few-shot lexical learning), and (c) tasks with strong faithfulness constraints.

5. Extensions: Lexicon Learning, Structural Constraints, and Pretraining

Advancements in copy-enriched models include integration of soft lexicon tables and adaptation to few-shot settings:

Learned Lexicons: Models that generalize copying to a parameterized lexicon $L$ enable token-level translation (not just copying)—critical for systematic generalization and low-data scenarios (Akyürek et al., 2021).
Structured Decoding: Limitations in argument structure fidelity motivate the addition of global sequence constraints, such as CRF layers or integer linear programming overlays, to enforce legality of copied and generated token sequences (Daza et al., 2018).
Pretraining for Copying: Self-supervised methods like MAPGN sequence masking pretrain pointer-generator models to robustly handle the copy-vs-generate decision by reconstructing masked spans, which strongly benefits spoken language normalization in low-resource domains (Ihori et al., 2021).

The building blocks of the copy mechanism—attention distributions reused as pointer distributions, mixture softmax, and structured switching—are now applied across a broad class of language and code generation problems.

6. Limitations, Open Problems, and Future Directions

While copy-enriched seq2seq models remain state-of-the-art for content-fidelity-intensive generation, several weaknesses persist:

Overcopying and Under-abstractness: Excessive reliance on copying can suppress generation of novel, more abstract, or paraphrased content (Choi et al., 2021).
Span Limitations: Even advanced span copy models typically require exact surface-form emergent alignment; fuzzy, paraphrased, or noncontiguous copying still eludes most methods (Liu et al., 2021, Zhou et al., 2018).
Supervision Cost: BioCopy and supervised copy losses depend on gold span tagging or oracle copy/generate splits, which may not be available in all settings (Liu et al., 2021, Choi et al., 2021).
Context Integration: Copy mechanisms act over the immediate source; for cross-document, cross-knowledge-base, or multi-source tasks, scalable pointer or retrieval augmentation is still an open research area (Ji et al., 2020).
Structural and Global Constraints: Lack of explicit structural decoding leads to duplicate span labeling, missed or excess arguments, or unbalanced bracketings (Daza et al., 2018). Incorporation of global sequence constraints remains an active line of work.

Prospective research directions include adaptive copy/generation tradeoff controllers, richer structured prediction overlays, integration with large pretrained models (T5, BART), and unsupervised or semi-supervised lexicon induction for enhanced generalization beyond exact copying.

References

“A Sequence-to-Sequence Model for Semantic Role Labeling” (Daza et al., 2018)
“SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair” (Chen et al., 2018)
“CopyNext: Explicit Span Copying and Alignment in Sequence to Sequence Models” (Singh et al., 2020)
“Efficient Summarization with Read-Again and Copy Mechanism” (Zeng et al., 2016)
“A Copy Mechanism for Handling Knowledge Base Elements in SPARQL Neural Machine Translation” (Hirigoyen et al., 2022)
“Joint Copying and Restricted Generation for Paraphrase” (Cao et al., 2016)
“Shakespearizing Modern Language Using Copy-Enriched Sequence-to-Sequence Models” (Jhamtani et al., 2017)
“Sequence-to-Sequence Learning for Indonesian Automatic Question Generator” (Muis et al., 2020)
“Sequential Copying Networks” (Zhou et al., 2018)
“Cross Copy Network for Dialogue Generation” (Ji et al., 2020)
“Copy that! Editing Sequences by Copying Spans” (Panthaplackel et al., 2020)
“Incorporating Copying Mechanism in Sequence-to-Sequence Learning” (Gu et al., 2016)
“Copy mechanism and tailored training for character-based data-to-text generation” (Roberti et al., 2019)
“A Copy-Augmented Sequence-to-Sequence Architecture Gives Good Performance on Task-Oriented Dialogue” (Eric et al., 2017)
“Lexicon Learning for Few-Shot Neural Sequence Modeling” (Akyürek et al., 2021)
“MAPGN: MAsked Pointer-Generator Network for sequence-to-sequence pre-training” (Ihori et al., 2021)
“May the Force Be with Your Copy Mechanism: Enhanced Supervised-Copy Method for Natural Language Generation” (Choi et al., 2021)
“BioCopy: A Plug-And-Play Span Copy Mechanism in Seq2Seq Models” (Liu et al., 2021)
“Simplifying Sentences with Sequence to Sequence Models” (Mathews et al., 2018)