Cloze Task for Training in NLP

Updated 10 October 2025

Cloze task is a learning paradigm where selected spans in a text are masked and the model is trained to recover them using context.
It leverages multi-perspective aggregation methods, including attention and dilated convolutions, to capture both local and long-range dependencies.
Empirical studies show cloze-driven models improve accuracy on benchmarks like CLOTH and mathematical reasoning tasks by using tailored sampling and decoding strategies.

A cloze task for training is a supervised or self-supervised learning paradigm wherein one or more spans (typically words, entities, or equations) in a text sequence are masked or removed, and the model is trained to recover these masked elements based on their context. The cloze framework is utilized across diverse domains—ranging from reading comprehension and question answering to mathematical reasoning and multimodal learning—both as a diagnostic/proxy objective and as a means to inject explicit context aggregation biases into representation learning architectures.

1. Architectural Foundations and Modeling

Cloze-based training architectures typically follow a multi-stage process:

Input Encoding: The input text (or, in multimodal settings, other structured modalities such as video or handwriting) is tokenized and embedded, often using pretrained word vectors or contextual encoders (e.g., BiGRU, transformer blocks, or CNNs) (Wang et al., 2018, Baevski et al., 2019, Luo et al., 2020, Zhang et al., 2022).
Aggregation/Contextualization Modules: These modules aggregate contextual information to enable effective prediction of the masked spans. Approaches include bidirectional recurrence (GRU/LSTM), dilated convolutions for multi-scale context, attention modules for capturing long-range dependencies, and explicit n-gram statistics for local collocation modeling (Wang et al., 2018).
Candidate Representation and Selection: Candidate spans (words, phrases, entities, equations) are encoded with the same or distinct aggregation modules, often enabling joint reasoning between context and candidate (Wang et al., 2018, Gatta et al., 2021, Pham et al., 4 Jun 2025).
Decoding and Inference: Final prediction is accomplished via pointer networks, softmax scoring, or generative decoders (autoregressive or infilling)—with optional integration of gating/refinement mechanisms or ensemble strategies to combine multiple perspectives (Wang et al., 2018, Schick et al., 2020, Li et al., 13 Feb 2024).

Notably, some modern cloze frameworks train bi-directional architectures (as in masked LLM pre-training), whereas others (such as ClozeMath) focus on text infilling over structured elements like equations to better align with domain-specific reasoning (Baevski et al., 2019, Pham et al., 4 Jun 2025).

2. Multi-Perspective Aggregation and Context Utilization

A central tenet in advanced cloze models is the explicit aggregation of context from multiple perspectives or “experts”:

Module Type	Description	Strength
Selective Copying	Use local contextual hidden state at blank	High local fidelity
Attentive Reader	Attend over all tokens with candidate	Long-range reasoning
Dilated Convolution	Capture global/multi-scale semantics	Structure invariance
N-gram Statistics	Integrate lexicographical co-occurrence	Collocation modeling

For example, MPNet forms a context representation by concatenating local (selective copying) and global (dilated convolution) features, while candidate representations are enriched with attention-based and n-gram-derived statistics (Wang et al., 2018). Such modularity yields robust representations that can capture both local syntactic and broader semantic dependencies, a property critically validated on datasets requiring both word-level and discourse-level inference (e.g., CLOTH, SQuAD, multi-party dialogue) (Wang et al., 2018, Li et al., 2019).

3. Semi- and Unsupervised Data Construction and Sampling

Given the cost of manual annotation, cloze frameworks often leverage unlabeled or weakly labeled corpora:

Efficient Distribution-Matching Sampling: To preserve candidate answer distribution fidelity, positive samples are drawn from unlabeled corpora to match empirical distributions observed in gold datasets. Sampling probabilities $p(w_i)$ are adjusted based on candidate frequencies in labeled versus unlabeled pools, using normalization constraints and scaling coefficients for balance (Wang et al., 2018).
Negative Candidate Sampling: Negatives are selected from the candidate vocabulary using weighted mixtures of uniform and empirical co-occurrence statistics, introducing controlled variance and coverage in auxiliary examples (Wang et al., 2018).
Unsupervised QA Generation: Unsupervised pipelines generate (context, answer, cloze-question) triples by random noun phrase/NE selection, syntactic masking, and—optionally—NMT-based translation from cloze form to natural questions (Lewis et al., 2019). The generative process is formalized as $p(q, a, c) = p(c) \cdot p(a|c) \cdot p(q|a,c)$ .

Empirical findings indicate that cloze-driven augmentation can bridge substantial performance gaps in low-resource contexts, and—in some QA setups—surpass early supervised approaches by leveraging the structure of large unlabeled sources (Lewis et al., 2019).

4. Training Objectives, Loss Functions, and Decoding Strategies

Cloze training leverages various objective structures:

Mask Prediction and Scoring: Loss functions typically sum the negative log-probabilities of the masked spans conditioned on context:

$L = -\sum_{i \in M} \log P(x_i \mid \text{context}),$

where $M$ indexes masked positions (Pham et al., 4 Jun 2025).

Gated Refinement: Candidate representations may be refined via gating functions before pointer selection and softmax scoring (Wang et al., 2018).
Auxiliary Losses: For few-shot and semi-supervised settings, auxiliary language modeling objectives may prevent catastrophic forgetting or regularize the learning process (Schick et al., 2020).
Decoding Algorithms: Structured output is generated via greedy, beam search, or chain-of-thought (CoT) decoding. Chain-of-thought decoding is shown to enhance multi-step mathematical reasoning in models trained on equation-cloze tasks (Pham et al., 4 Jun 2025).

Careful ablation studies demonstrate that both the formulation of masking/in-filling spans and the sophistication of decoding routines directly impact downstream inference accuracy and robustness (Pham et al., 4 Jun 2025).

5. Empirical Results and Benchmarking

Performance evaluations consistently demonstrate advantages for multi-perspective and cloze-driven frameworks:

On the CLOTH cloze test, semi-supervised MPNet achieved 60.9% accuracy (vs. 48–50% for strong baselines), exceeding prior state-of-the-art using only label-efficient augmentation methods (Wang et al., 2018).
Extended with external corpora and ensembling, cloze-augmented models reached 74.9% accuracy—still trailing human-level upper bounds (86%) but marking substantial gains.
In mathematical reasoning benchmarks such as GSM8K and MATH, ClozeMath training resulted in higher accuracy and greater solution robustness than token-level masking approaches, particularly when paired with reasoning-enhancing decoding (Pham et al., 4 Jun 2025).
Similar gains are seen in few-shot classification and NER, where cloze-formatted input patterns combined with pattern-verbalizer pairs enable models to reliably generalize from very limited samples (Schick et al., 2020, Gatta et al., 2021).

6. Challenges, Interpretability, and Alignment

Despite robust performance on standardized tasks, several limitations are observed:

Calibration and Alignment: Large-scale evaluations reveal that pretrained LMs systematically under-estimate human-cloze response probabilities, over-rank rare completions, and misalign both lexical and semantic clusters when compared to human-generated cloze responses (Jacobs et al., 15 Oct 2024). The semantic space of model predictions diverges sharply from that of human language, raising cautions for psycholinguistic modeling.
Error Analysis: Performance degrades on tasks requiring dialogue understanding, informal reasoning, or cross-utterance entity tracking—attributable to limitations in coreference resolution or the handling of idiomatic shifts (Li et al., 2019).
Data Distribution Sensitivity: Successful data augmentation hinges on exact matching of candidate distributions and judicious regularization; naively constructed pseudo-examples can degrade performance or introduce sampling bias (Wang et al., 2018).
Interpretable Learning: Ablation studies and modular aggregation schemes highlight the need for explainable context contributions. Inclusion of explicit n-gram statistics or global pooling improves generalization when important cues are dispersed or appear as lexical collocations.

7. Extensions and Future Directions

The cloze task for training continues to shape research directions across NLP and multimodal learning:

Generalization Beyond NLP: Equation- and entity-level cloze objectives (as in ClozeMath or prompt-based few-shot QA) may serve as templates for domains requiring multi-step, context-sensitive inference (Pham et al., 4 Jun 2025, Chen et al., 2023).
Heuristic-free Extraction: Sequence-tagging cloze answer extraction enables adaptive TAPT augmentation across arbitrary MRC tasks without manual heuristics, further closing the performance gap in low-resource or zero-shot conditions (Lovenia et al., 2022).
Multimodal and Multilingual Settings: Innovations such as multimodal cloze tasks (e.g., for Chinese handwriting, video) transfer the self-supervised context-prediction paradigm into vision and speech, supporting robust and fine-grained error analysis and correction (Zhang et al., 2022, Luo et al., 2020).

A plausible implication is that further integration of cloze-inspired objectives—combined with task-specific knowledge and careful data engineering—will remain foundational for advances in efficient, interpretable, and generalizable AI reasoning systems.