Open-Ended Text Generation

Updated 13 November 2025

Open-ended text generation is the process of producing fluent and coherent continuations from arbitrary prompts, emphasizing creativity and diversity.
Techniques such as contrastive decoding, dynamic focus decoding, and hybrid methods balance coherence, diversity, and factuality through adaptive sampling and penalty mechanisms.
Evaluation strategies employ metrics like MAUVE, distinct-n, and human assessments to address trade-offs between fluency, topical relevance, and controlled creativity.

Open-ended text generation is the task of producing fluent and coherent continuations from arbitrary prompts without fixed target forms or constrained output length. Unlike structured generation tasks such as translation or summarization, open-ended generation prioritizes flexibility, diversity, and creativity, while facing unique challenges in coherence, topical relevance, and evaluation. Large-scale LMs—notably autoregressive Transformers—have underpinned the recent progress, but the space of algorithms and evaluation criteria is defined by complex trade-offs between diversity, coherence, factuality, and computational efficiency.

1. Foundations and Task Definition

Open-ended text generation tasks present unbounded continuation spaces: the model is prompted with a fragment $x$ and produces candidate continuations $y$ satisfying properties such as fluency, context relevance, and stylistic appropriateness (Becker et al., 24 May 2024). Three primary sub-tasks dominate current research:

Open-domain continuation: Unconstrained completion given general-domain prompts (e.g., Wikipedia, Reddit).
Story generation: Requires long-range narrative structure, event causality, and character progression.
Dialogue continuation: Next-turn or multi-turn prediction with explicit or latent persona, turn-taking, and pragmatic context.

The formal model objective is generally:

$P(y\mid x) = \prod_{i=1}^{n} P(y_i \mid x, y_{<i}),$

where $y_i$ are tokens in the output sequence.

Distinct from conditional generation (summarization, data-to-text), open-ended outputs do not admit tight reference sets, necessitating evaluation criteria centered on output distribution diversity, sample coherence, and human-likeness rather than overlap with references.

2. Canonical Decoding Paradigms and Their Pathologies

Autoregressive LMs generate text via decoding algorithms that map output token probabilities to actual outputs. The principal paradigms exhibit sharp trade-offs:

Maximum-likelihood decoding (greedy, beam search):
- Always selects the highest-probability next token.
- Prone to producing short, highly repetitive, and generic text (Li et al., 2022, Wang et al., 2020).
- Suffering from label bias—i.e., states with low next-token entropy are overly favored, leading to degeneration loops and lack of diversity (Wang et al., 2020).
Sampling-based methods (top- $k$ , nucleus/top- $p$ , temperature):
- Inject controlled randomness to promote output variety.
- Parameters ( $k$ , $p$ , temperature $\tau$ ) modulate lexical diversity and the risk of incoherent or off-topic continuations (Becker et al., 24 May 2024, Arias et al., 8 Oct 2024).
- Too high diversity leads to topic drift; too low collapses to local optima.
Hybrid and contrastive approaches: Recent work confronts these limitations through optimization-based or structure-augmented decoding that explicitly balances the objectives of coherence and diversity (Li et al., 2022, Luo et al., 11 Mar 2025, Xu et al., 2023, Lan et al., 2022, Zhu et al., 2023, Ding et al., 28 Aug 2025). Each method differs in its formal trade-off mechanism and computational requirements.

3. Modern Decoding Innovations

The last few years have witnessed a proliferation of decoding algorithms tailored for open-ended text. Key approaches include:

CD employs two autoregressive LMs—a large expert and a smaller amateur—sharing tokenization and vocabulary. For a candidate token $w$ , the contrastive score is:

$\Delta(w|x_{<i}) = \log P_\text{expert}(w|x_{<i}) - \log P_\text{amateur}(w|x_{<i})\ ,$

with a plausibility constraint restricting candidate tokens to those for which $P_\text{expert}(w|x_{<i}) \geq \alpha \max_v P_\text{expert}(v|x_{<i})$ (typically $\alpha=0.1$ ). Decoding proceeds by maximizing the sum of these scores per sequence, either greedily or with a beam of moderate size.

Empirical findings: CD outperforms nucleus and top- $k$ sampling in both coherence and MAUVE score while maintaining competitive diversity. Human raters significantly preferred CD for fluency and coherence (2.6× vs. nucleus; 6.4× vs. typical decoding).

CS applies a one-model approach, penalizing candidate tokens within the decoding beam that are "too similar" (via cosine-similarity in hidden states) to prior generations:

$x_t = \arg\max_{v \in V^{(k)}_{x_{<t}}} \left[ (1-\alpha)\log P_\theta(v|x_{<t}) - \alpha \max_{j < t}\, \text{sim}(h_v, h_{x_j}) \right].$

Balanced choices of $\alpha$ and $k$ yield superior diversity and coherence compared to CD, and strong preference in human evaluation (Su et al., 2022).

DFD adaptively estimates the token-level "knowledge reliance" at each decoding step by measuring the divergence between distributions from deep (fact-intensive) and shallow (syntactic/stylistic) Transformer layers:

$\mathrm{KA}_t = \frac{1}{N-1} \sum_{i=1}^{N-1} \mathrm{KL}(p^{(N)}(\cdot|C) \,\Vert\, p^{(i)}(\cdot|C)),$

where $p^{(i)}$ is the distribution from layer $i$ . This signal determines an adaptive temperature $T_t$ for the sampling step:

$T_t = T_0 \exp\left(\ln\tfrac{1}{2}\frac{\mathrm{KA}_t}{\sigma}\right).$

Low $T_t$ enforces determinism for high-knowledge steps; high $T_t$ permits diversity for creative steps. DFD is computationally lightweight, requiring only additional LM-head evaluations per layer.

Empirical findings: DFD consistently raises factual accuracy (e.g., +3–4 points TruthfulQA) and diversity (e.g., Distinct-2/+4) across diverse tasks with only 4–7% added compute (Luo et al., 11 Mar 2025).

This method tracks the KL divergence between the current and historical next-token distributions, triggering a "repetition alarm" if $D_\text{KL}(p_t \| p_s)$ falls below threshold $\lambda$ . Upon alarm, the top $K$ candidates are randomly sampled or weighted by prefix-coherence to eliminate repetitions and topic drift.

Findings: Delivers closest-to-human MAUVE and highest observed coherence; outperforms nucleus and typical sampling in both human and automatic evaluation.

Addresses self-reinforcement in LMs by penalizing logits of previously generated tokens. The penalty applies only within a window of past tokens (window size $w\sim100$ ), and a length penalty constrains excessive shortening. This achieves stronger diversity and factuality than sampling, and human raters prefer it over leading baselines.

At each token, dynamically expands the candidate set by including tokens that increase "confidence"—computed via entropy normalization—more than a threshold $\tau$ :

$\mathrm{Conf}(p^{(t)}) = 1 - \frac{H(p^{(t)})}{\log|V|}.$

The candidate set adapts to the local predictive uncertainty, yielding tight sets in easy contexts (boosting coherence) and broader sets in uncertain contexts (improving diversity).

Combines global (exponentially weighted moving average) and local (instantaneous) entropy to derive dynamic candidate set sizes and an uncertainty-adaptive token-count penalty, achieving near-maximal diversity and coherence with 3–4× speedups over standard contrastive search.

Formalizes generation as graph exploration, penalizing token candidates that induce "loops" (repetitions) via a resistance function dependent on loop depth. This restores greedy maximality outside loops while controlling degeneracy within loops; computational cost equals greedy decoding.

4. Quality-Diversity-Coherence Trade-offs: Evaluation and Hyperparameter Sensitivity

A central theme is the necessity of multicriteria optimization (Becker et al., 24 May 2024, Arias et al., 8 Oct 2024, Arias et al., 24 Oct 2024):

Diversity: Unique $n$ -grams (Distinct- $n$ ), self-BLEU, rep- $n$ statistics.
Coherence: Average log-likelihood under a large LM evaluator, SimCSE-based similarity between prompt and continuation.
Factuality: Dataset-specific task accuracy, automatic metrics (e.g., MAFE (Brahman et al., 2022) for factual grounded generation, MAUVE for distributional similarity).
Human preference: Pairwise head-to-head comparisons for fluency, coherence, informativeness.

Recent work emphasizes that MAUVE and other distributional metrics may misalign with actual human judgments—balanced diversity and coherence, as measured by human annotation, are more predictive of preference (Su et al., 2022, Arias et al., 24 Oct 2024). Empirical studies have also established that best-in-class decoding strategies depend on both domain and hyperparameter configuration: top- $p$ sampling at $p=0.95$ and contrastive search at $\alpha=0.4$ –$0.6$, $k=3$ –$10$ typically yield the best trade-offs, while deterministic beam search consistently underperforms in diversity and human rating (Arias et al., 8 Oct 2024).

5. Specialized Directions: Factuality, Planning, and Multimodal Signal Integration

Event-based skeleton planning (Li et al., 2022): Coarse-to-fine text generation via an explicit event transition planner (using ASER for event extraction, then sequence modeling) improves global coherence and diversity in long stories or dialogues.
Grounded keys-to-text (Brahman et al., 2022): Hybrid retrieval-generation frameworks combine sparse keys for controllability with retrieved evidence passages, significantly increasing factual precision and recall.
Multimodal prefixing (iNLG) (Zhu et al., 2022): Prepending context-conditioned image features—obtained via StableDiffusion, CLIP, and mapping networks—enables visually-inspired generation. This approach robustly increases coherence and diversity in few-shot settings, as demonstrated on story and concept-to-text tasks.
Personalized and robust evaluation (Wang et al., 2023, Gu et al., 2020, Karpinska et al., 2021): Evaluation remains a fundamental challenge; learned metrics (Perception Score, PerSE) trained on human annotation outperform BLEU, ROUGE, and BERTScore in correlation with human judgment and must account for reviewer-specific preferences. Human evaluation protocols require careful design (calibration, reference inclusion, expert raters) to ensure reproducibility and valid discrimination, especially in open-ended domains.

6. Remaining Challenges and Future Directions

Scalable multicriteria evaluation: New frameworks such as Q*Text (harmonic mean penalizing extremal values), union-free generic depth (partial order centrality), and Bradley–Terry models for method ranking seek to resolve incomparability among methods and metrics (Arias et al., 24 Oct 2024).
Long-context and multi-turn generation: Efficient architectures for context windows spanning thousands of tokens (beyond chunked or sparse attention) remain an open research frontier (Becker et al., 24 May 2024).
Bias, hallucination, and controllability: Explicit boosting or suppression for factuality, undesired content, or stylistic constraints must avoid oversuppression that collapses diversity or creative value.
Adaptive and hyperparameter-free decoding: Hyperparameter sensitivity is a persistent challenge; methods such as ACS (Arias et al., 8 Oct 2024), GUARD (Ding et al., 28 Aug 2025), and entropy-based adaptation (Zhu et al., 28 Feb 2024) offer promising directions for reliable, context-aware decoding.
Multi-agent and multi-party outputs: Modeling realistic dialogues with more than two participants or extended discursive structure is largely unexplored.
Transparent and reproducible evaluation: Standardized reporting protocols for rater qualifications, scale calibration, timing, and inter-annotator agreement are critical for reliable human evaluation (Karpinska et al., 2021).

Open-ended text generation is a rapidly evolving domain with continual innovation in decoding methodology, planning, grounding, multi-modal integration, and evaluation. Progress hinges on resolving fine-grained trade-offs among fluency, diversity, controllability, and factuality, supported by nuanced and reproducible evaluation frameworks and robust, adaptive decoding strategies.