Data-to-Text Generation
- Data-to-Text Generation is the process of transforming structured data into coherent natural language text using methodologies like segmentation, planning, and copy mechanisms.
- Advanced models blend neural architectures, hierarchical encoding, and style control to ensure factual accuracy and maintain content order.
- Challenges include managing content selection, reducing hallucinations, and adapting systems for scalability and cross-lingual applications.
Data-to-Text Generation (DTG) describes the automatic transformation of structured input data—such as databases, tables, and knowledge graphs—into coherent and fluent natural language text. DTG is foundational within natural language generation (NLG), characterized by complex requirements for content selection, factual accuracy, ordering, and linguistic realization. State-of-the-art approaches blend planning, sequence modeling, copying, segmentation, and style control, driven by extensive benchmarks and rigorous evaluation metrics.
1. Formalization and Historical Context
DTG occupies the intersection of structured data representation and linguistic realization, often modeled as conditional text generation: where denotes input records, graphs, or tables and the output sequence (Sharma et al., 2022). Early pipelined architectures comprise (i) signal analysis, (ii) data interpretation, (iii) document planning with content selection, and (iv) microplanning and realization (Gkatzia, 2016). Neural sequence-to-sequence (seq2seq) models with attention and copy mechanisms now predominate, although explicit planning remains essential for content fidelity and discourse ordering.
2. Content Selection, Planning, and Segmentation
Content selection—the determination of which facts to verbalize—is a central sub-problem. Rule-based methods encode logical pattern-to-message mappings, while trainable methods employ HMMs, classifiers, or neural gates to learn inclusion and ordering from corpora (Gkatzia, 2016). Several recent systems partition the output into interpretable segments explicitly grounded in input records, enabling coverage and hallucination constraints. Notably, the DTG architecture in (Shen et al., 2020) adopts a latent variable model over text segmentation and record correspondence : This segmentation mechanism ensures each output fragment aligns to a specific data item, regularized by a statistically constrained segment cardinality.
Dynamic and macro content planners decompose long-form generation into high-level plan selection and subsequential realization (Puduppully et al., 2021, Puduppully et al., 2022, Puduppully et al., 2018). Hierarchical encoders model multi-level data structure, e.g., entity → record (Rebuffel et al., 2019, Puduppully et al., 2019). Explicit planning yields improved content selection and ordering metrics.
3. Generation Mechanisms, Neural Architectures, and Copying
Neural DTG systems apply RNN or Transformer-based encoders and decoders, frequently augmented by copy/pointer mechanisms ensuring data fidelity by copying tokens directly from the input (Sharma et al., 2022). Several extensions incorporate gating between copy and generate modes, coverage penalties, and attention regularizers. Modular approaches may first lexicalize data items via templates, then iteratively edit and fuse text using neural sentence fusion or sequence tagging architectures (Kasner et al., 2020).
Advanced architectures feature:
- Segmented decoding over aligned input records (Shen et al., 2020).
- Hierarchical latent plans for paragraph- or entity-level structuring (Puduppully et al., 2021, Puduppully et al., 2018).
- Multi-task and multi-branch decoders controlling hallucination at word-level via fine-grained label supervision (Rebuffel et al., 2021).
- Stylization with explicit control over linguistic style, logic planning-guided data embedding, and style-mask extraction (Jing et al., 2023).
- Pixel-based table representations enabling visual structure preservation prior to text decoding (Alonso et al., 2023).
4. Factuality, Hallucination Control, and Evaluation Protocols
Factual consistency and hallucination minimization are acute challenges. Black-box attention architectures are prone to omissions, repetitions, and unsupported statements (Shen et al., 2020). Segmented models and hard coverage constraints reduce error rates: in human evaluations, DTG yielded zero “wrong facts” and fewer omissions compared to pointer-generator baselines (Shen et al., 2020).
Robustness to hallucinations is advanced via word-level alignment scoring and multi-branch decoding, which gate decoding pathways between factual, hallucinated, and fluency branches (Rebuffel et al., 2021). Factuality is now quantitatively benchmarked using NLI-based metrics (SummaC-Conv), entity overlap (NEOverlap), learned alignment scoring, and QA-based evaluation (QAFactEval) (Mahapatra et al., 2024).
Leading LLMs (e.g., Llama 2) achieve superior factual consistency, particularly on lexically rich datasets, but compact encoder–decoder models (T5, BART) remain competitive on low-diversity corpora (Mahapatra et al., 2024). Model size generally correlates with factuality (positive AROC), but with diminishing marginal returns.
5. Extensions: Stylization, Continual and Cross-Lingual Learning
Stylized DTG introduces the requirement for style-conditioned generation, integrating logic planning, mask-based style representation, and unbiased data augmentation (Jing et al., 2023). StyleD2T demonstrated high coverage and style accuracy, with pseudo-triplet augmentation mitigating domain bias.
Self-training from self-memory (STSM) reduces the need for full parallel data: models train on a fraction of the data, validating outputs for coverage and reversibility, and achieve comparable BLEU/METEOR to full-data training (Ta, 2024). This unlocks continual learning scenarios.
Cross-lingual DTG leverages curriculum learning on noisy aligned corpora, where difficulty is scored by alignment quality, sequence length, or word rarity. Annealing schedules focus training on progressively cleaner data shards, yielding gains up to BLEU and faithfulness across 11 low-resource languages (Hari et al., 2024).
6. Benchmarks, Datasets, and Quantitative Results
Core DTG benchmarks include E2E (restaurant MR-to-text), WebNLG (RDF graph-to-text), WikiBio (infobox-to-biography), RotoWire/MLB (long-form sports summaries), ToTTo (table-to-cell description), DART (open-domain records), and stylized Taobao (product attribute–style triplets) (Sharma et al., 2022, Jing et al., 2023).
Representative test-set metrics:
| Model / Task | BLEU | METEOR | RG-P (%) | CS-F1 (%) | CO (%) | Human Errors |
|---|---|---|---|---|---|---|
| DTG (E2E) (Shen et al., 2020) | 0.647 | 0.453 | 0 wrong facts | |||
| PG baseline | 0.638 | 0.449 | ~15 wrong facts | |||
| Macro (RotoWire) | 15.46 | 97.6 | 42.9 | 17.7 | Minimal contradiction | |
| PixT3 (ToTTo-LControl) | 45.4 | 72% faithful | ||||
| TrICy (E2E NLG) | 69.29 | |||||
| StyleD2T (Taobao) | 5.77 | 97.49% style acc. |
All major systems report results across automatic metrics (BLEU, METEOR, ROUGE, PARENT, CIDEr), extraction-based factuality (RG, CS, CO), and complementary human studies assessing support, contradiction, and fluency.
7. Challenges, Limitations, and Future Directions
DTG faces ongoing challenges:
- Ensuring groundedness and mitigating hallucinations, especially in open-domain or noisy input scenarios.
- Scalability with respect to input size and output length, managed via hierarchical encoders, segmental planners, or visual table rendering.
- Model interpretability and controllable generation, demanding explicit segmental alignment, plan disclosure, and style specification.
- Cross-lingual and continual learning, requiring curriculum-learning strategies and efficient use of limited parallel data.
- Fairness and accountability in content selection, moving towards explainable planners and dataset documentation frameworks.
Future research is directed at unified architectures capable of implicit structure reasoning, improved integration of external reasoning engines for numeracy, scalable cross-domain transfer, and multimodal or stylized output control. Benchmark innovation continues around diverse datasets, multilinguality, and richer evaluation protocols for both automatic and human-centered assessment.