Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 209 tok/s Pro
GPT OSS 120B 458 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables (2508.19813v1)

Published 27 Aug 2025 in cs.CL

Abstract: Extensive research has been conducted to explore the capabilities of LLMs in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel benchmark that evaluates LLMs on generating comprehensive, article-level reports from real-world industrial tables.
  • It employs a multi-stage annotation pipeline and three distinct evaluation criteria (NAC, ICC, GEC) to ensure numerical accuracy, content coverage, and reasoning depth.
  • The study reveals that even top-performing models struggle with large and complex tables, highlighting the need for improved table understanding and synthesis.

T2R-bench: A Comprehensive Benchmark for Article-Level Report Generation from Real-World Industrial Tables

Motivation and Task Definition

The T2R-bench paper addresses the critical gap in evaluating LLMs for the table-to-report (T2R) task, which requires generating comprehensive, article-level reports from complex, real-world industrial tables. Unlike prior benchmarks that focus on table question answering (QA) or table-to-text generation, T2R-bench targets the synthesis of multi-paragraph, analytical, and actionable reports that reflect the demands of business intelligence, industrial analytics, and enterprise reporting. Figure 1

Figure 1: The table-to-report task requires models to analyze numerical data from tables and generate comprehensive, coherent, and accurate reports, including descriptions, analysis, and conclusions.

The T2R task is characterized by several unique challenges: (1) high table complexity and diversity (multi-table, complex headers, extremely large tables), (2) the need for deep reasoning and synthesis beyond fact extraction, and (3) the lack of suitable evaluation metrics for long-form, data-grounded report generation.

Benchmark Construction and Dataset Characteristics

T2R-bench comprises 457 real-world industrial tables, spanning 19 domains and four table types: single tables, multiple tables, complex structured tables, and extremely large-size tables. The data is sourced from public industrial datasets, government statistics, and open data platforms, with rigorous manual curation to ensure domain relevance, information density, and privacy compliance. Figure 2

Figure 2: The construction pipeline for T2R-bench includes table data collection, question annotation, and report reference annotation, with multi-stage human and LLM involvement.

The annotation pipeline involves: (1) semi-automatic question generation using expert-designed prompts and LLM self-instruct, (2) dual-annotator filtering for question quality, and (3) report reference keypoint extraction, where multiple LLM-generated reports are distilled into core keypoints and further refined by human annotators. This process yields 910 high-quality questions and 4,320 annotated report keypoints. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: T2R-bench exhibits broad domain coverage, bilingual (Chinese/English) support, a high proportion of complex and large tables, and a diverse distribution of report keypoints per question.

Key dataset properties include:

  • 8.3% of tables are extremely large (over 50,000 cells).
  • 28.9% are complex structured tables (hierarchical headers, merged cells).
  • 23.6% are multi-table scenarios.
  • Bilingual coverage (Chinese and English) with balanced performance across languages.

Evaluation Criteria: Beyond Surface Metrics

Recognizing the inadequacy of standard text generation metrics (BLEU, ROUGE) for T2R, the authors propose a three-pronged evaluation framework:

  1. Numerical Accuracy Criterion (NAC): Measures the factual correctness of numerical statements in generated reports by extracting numerical claims, generating verification questions, and using code-generation LLMs to programmatically validate answers against the source tables. Majority voting among three code LLMs (Qwen2.5-32B-Coder-Instruct, Deepseek-Coder, CodeLlama-70B-Instruct) ensures robustness.
  2. Information Coverage Criterion (ICC): Quantifies semantic alignment between generated reports and annotated keypoints using a normalized mutual information (MI) formulation, with BERTScore as the similarity kernel. This captures both coverage and relevance of critical content.
  3. General Evaluation Criterion (GEC): Employs LLM-as-a-judge to rate reports on reasoning depth, human-like style, practicality, content completeness, and logical coherence, using a strict scoring rubric.

This framework is validated against human expert judgments, achieving high correlation (Pearson's r=0.908r = 0.908), and is more stringent than human evaluation, especially for factual and coverage errors.

Experimental Results and Analysis

Twenty-five state-of-the-art LLMs (open and closed source) are evaluated on T2R-bench. The best-performing model, Deepseek-R1, achieves only 62.71% average score across NAC, ICC, and GEC, with all models exhibiting substantial performance degradation on extremely large tables and multi-table scenarios. Figure 4

Figure 4: LLM performance on NAC and ICC drops sharply as table cell count increases, highlighting the challenge of scaling to large tabular inputs.

Key findings:

  • Deepseek-R1 and Qwen3-32B lead overall, but no model exceeds 65% on any criterion for the hardest cases.
  • Performance is consistent across Chinese and English, except for Llama-3.3-70B, which underperforms on Chinese.
  • Markdown is the most effective table input format, outperforming HTML and JSON.
  • Human-generated reports set a much higher baseline (96.52% human evaluation), with LLMs lagging by over 30 points. Figure 5

    Figure 5: Example of a DeepSeek-R1-generated report with critical errors, including numerical hallucinations and table selection mistakes.

Case studies and error analysis reveal:

  • Frequent numerical hallucinations, especially in large or multi-table settings.
  • Structural misinterpretation of complex headers and cross-table references.
  • Truncation errors when table size exceeds context window.
  • Missing keypoints and incomplete coverage, directly impacting ICC. Figure 6

    Figure 6: Case paper of an English extremely large-size table, illustrating truncation and coverage errors in LLM-generated reports.

    Figure 7

    Figure 7: Case paper of a Chinese complex structured table, highlighting numerical and structural reasoning failures.

Implications, Limitations, and Future Directions

T2R-bench exposes fundamental limitations in current LLMs' ability to perform deep, reliable, and scalable table-to-report generation. The low NAC and ICC scores, especially on large and complex tables, indicate that existing models lack robust mechanisms for numerical reasoning, structural comprehension, and long-context synthesis in industrial settings.

The benchmark's design—real-world data, bilingual support, multi-table and large-table coverage, and keypoint-based evaluation—sets a new standard for assessing practical table understanding and report generation. However, the authors note that further expansion in table diversity and the development of specialized architectures (e.g., hybrid symbolic-neural models, retrieval-augmented LLMs, or table-aware pretraining) are necessary to close the gap with human performance.

The evaluation framework, particularly the use of code LLMs for numerical verification and MI-based coverage metrics, provides a template for future benchmarks in data-to-text and long-form factual generation.

Conclusion

T2R-bench establishes a rigorous, high-coverage benchmark for the table-to-report task, revealing that even the strongest LLMs fall short of industrial requirements for article-level report generation from real-world tables. The benchmark, dataset, and evaluation methodology will drive research toward more reliable, scalable, and semantically faithful table understanding systems, with significant implications for business intelligence, automated analytics, and enterprise reporting.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com