T2R-bench: Bilingual Table-to-Report Benchmark
- T2R-bench is a bilingual benchmark that converts complex industrial tables into detailed narrative reports with actionable data analysis.
- It rigorously evaluates LLMs using metrics such as numerical factuality, keypoint coverage, and overall qualitative report quality.
- The benchmark covers diverse industrial domains and table structures, highlighting challenges in multi-table and real-world data reporting.
T2R-bench is a bilingual benchmark for the generation of article-level reports from real-world industrial tables, designed to rigorously evaluate and advance the capacity of LLMs and related models in table-to-report generation under industrial constraints. Unlike previous tasks focusing on table-to-text or question answering, T2R-bench emphasizes the transformation of complex, often multi-table structured data into comprehensive narrative reports containing descriptions, analyses, and conclusions. It addresses the emerging industrial requirement for end-to-end automation of data-driven reporting, accounting for the diversity, size, and heterogeneity inherent in practical table data.
1. Task Definition and Motivation
The T2R-bench benchmark formalizes the table-to-report task, which entails converting structured industrial tables into detailed narrative reports. The objectives extend beyond information extraction or high-level summarization: the output is required to exhibit logical flow, deep data analysis, and actionable conclusions as typically expected in industrial analytics reports.
Motivating factors for T2R-bench include:
- The prevalence of complex, real-world table structures in industrial settings (e.g., hierarchical headers, non-uniform layouts, merged cells).
- The inadequacy of existing table benchmarks in evaluating the practical capabilities of models to perform detailed and context-sensitive report generation.
- The need for fine-grained, reliable evaluation methods to distinguish between shallow data transformation and genuine analytical reporting.
2. Dataset Construction and Composition
T2R-bench comprises 457 tables sourced from publicly available industrial data repositories such as municipal open data platforms, national statistics bureaus, and industry association sites. The dataset encompasses 19 sub-domains within 6 core industry areas: engineering & science, environmental stewardship, transportation & logistics, social policy & administration, consumer lifestyle, and financial economics.
Table types included are:
- Single tables: Isolated, well-formatted industrial tables.
- Multiple tables: Sets of interlinked tables or sheets whose joint information is required to answer analytical queries.
- Complex structured tables: Tables with hierarchical indices and non-uniform, merged-cell layouts.
- Large-size tables: Tables containing millions of cells, imposing context-length and memory constraints.
Paired with these tables are 910 questions and 4,320 manually annotated keypoints (∼4.75 per question) representing gold-standard informational units that a high-quality industrial report should cover.
3. Evaluation Framework and Criteria
The evaluation of table-to-report systems under T2R-bench employs three principal criteria:
Criterion | Primary Function | Technical Approach |
---|---|---|
NAC | Numerical factuality of report statements | Automatic verification with 3 code LLMs and majority voting |
ICC | Coverage of annotated keypoints in generated reports | Semantic similarity (BERTScore) with mutual information |
GEC | Overall qualitative report quality | LLM-as-judge scoring on multiple axes |
3.1 Numerical Accuracy Criterion (NAC)
NAC fact-checks all numerical statements in the generated report by:
- Segmenting the report into sentence clusters containing numbers.
- Auto-generating verification questions for those facts.
- Employing three code-generation LLMs (Qwen2.5-32B-Coder-Instruct, Deepseek-Coder, CodeLlama-70B-Instruct) to compute answers.
- Using majority voting to establish the correct value. The NAC score quantifies alignment between report numerics and the source table, revealing whether the model preserves and manipulates quantitative data correctly.
3.2 Information Coverage Criterion (ICC)
ICC measures how much of the key informational content (as distilled in annotated keypoints) is present in the generated report:
where the and marginals are calculated from the similarity matrix (via BERTScore) between each keypoint and sentence cluster . Normalization yields values in .
3.3 General Evaluation Criterion (GEC)
GEC amalgamates qualitative aspects of report quality:
- Depth of reasoning
- Human-likeness
- Practicality (e.g., actionable insights)
- Content completeness
- Logical coherence
LLM-as-judge approaches with specific, multi-dimensional prompts compute individual aspect scores, which are averaged for the final GEC.
4. Empirical Results and Analysis
The benchmark evaluation includes 25 state-of-the-art LLMs (both open- and closed-source). No model surpasses an average overall score of 62.71 (achieved by Deepseek-R1), highlighting persistent challenges.
Performance patterns include:
- Input modality: Markdown for representing tables leads to the highest model performance relative to HTML or JSON.
- Table complexity sensitivity: There is a marked decrease in performance with increased table cell counts, and especially with multiple or complex-structure tables. NAC and ICC are particularly affected.
- Bilingual robustness: Most models perform similarly in Chinese and English settings, although some (notably Llama variants) show language-related divergence.
- Task granularity: Single-table reports are easier for LLMs, with multi-table and large tables posing more severe reasoning and truncation issues.
These findings indicate that current general-purpose LLMs do not yet robustly handle the nontrivial transformations and analyses demanded by realistic industrial reporting.
5. Domain Coverage and Representativeness
T2R-bench was explicitly constructed to maximize both breadth and realism in industrial reporting scenarios. The choice of six major industry domains and 19 sub-domains ensures broad applicability and prevents overfitting to narrow domains or tabular styles.
The diverse sources and complex, heterogeneous table structures ensure that the benchmark cannot be trivially solved with pattern-matching, short-context, or non-compositional reasoning mechanisms. This suggests that further pretraining or architectural adaptations may be necessary for reliable table-to-report generation in industrial applications.
6. Research Directions and Open Problems
Key open problems and future research opportunities highlighted by T2R-bench include:
- Model scaling for very large tables and collections of interdependent tables, potentially necessitating context-aware and memory-augmented architectures.
- Enhanced pretraining or domain adaptation to more effectively model the intricate, multi-source data flows found in industrial reporting.
- Improved evaluation, particularly with respect to human correlation for coverage and numerical consistency, possibly via further enhancement of the numerical accuracy and keypoint matching modules.
- Development of new paradigms for selective content curation from extremely large or noisy tabular inputs.
A plausible implication is that industrial table-to-report generation will eventually require purpose-built models or modeling strategies that go beyond current general-purpose LLMs.
7. Significance and Impact
T2R-bench establishes itself as a foundational benchmark for industrial table-to-report tasks, providing a dataset and evaluation methodology that is both realistic and rigorous. By exposing the current limitations of LLMs even at state-of-the-art scale and serving as a catalyst for new methods, it prompts the development of more advanced techniques tailored for industrial data analytics applications. The comprehensive coverage of table types, the availability of bilingual resources, and the multifaceted evaluation framework position T2R-bench as a critical resource for both academic research and practical deployment of AI-driven industrial reporting systems (Zhang et al., 27 Aug 2025).