Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? (2309.08963v3)

Published 16 Sep 2023 in cs.CL

Abstract: Despite the remarkable capabilities of LLMs like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs' proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), which spans text tables, HTML, and LaTeX formats. Our proposed FormatCoT aids in crafting format-specific instructions from the intended outputs to populate this benchmark. Addressing the gap in task-centered evaluation, we propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score), to more accurately gauge LLM performance. Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains, outshining its LLM counterparts across most measures. In-depth error analysis and creating an ability map across six dimensions -- coverage, formatting, reasoning, comprehension, pragmatics, and hallucination -- highlight areas for future enhancements and suggest forthcoming research trajectories. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.

Evaluating LLMs on Complex Structured Data Generation

The paper "Struc-Bench: Are LLMs Really Good at Generating Complex Structured Data?" addresses a critical yet underexplored area in the capabilities of LLMs: their proficiency in generating complex, structured data. While models like GPT-4 have demonstrated remarkable prowess in generating natural language text, their performance on tasks requiring structured outputs—such as tables in formats like raw text, HTML, and LaTeX—remains questionable. This paper embarks on a comprehensive assessment of LLMs in this regard and proposes a new solution to enhance their capabilities.

Struc-Bench and Evaluation of LLMs

The authors introduce Struc-Bench, a structured data generation benchmark, comprising carefully constructed datasets across multiple formats. The benchmark scrutinizes the abilities of well-recognized LLMs, including GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna, revealing common formatting errors and identifying areas for potential improvement.

A significant contribution of this paper is the creation of a model capability map across six dimensions: coverage, formatting, reasoning, comprehension, pragmatics, and hallucination. This map underscores the inherent weaknesses of LLMs in managing complex structured outputs. Analysis illustrates that the evaluated models often fall short in maintaining structural fidelity and content accuracy, particularly when handling intricate data structures such as tables.

FormatCoT: A Structure-Aware Fine-Tuning Approach

To address these shortcomings, the authors propose a structure-aware fine-tuning solution named FormatCoT (Chain-of-Thought). This method involves generating detailed format instructions derived from target outputs. Through their experiments, they observe that fine-tuning LLaMA-7B with this approach notably improves the model’s adherence to structural constraints across multiple data formats.

In the comparative analysis, it is evident that the proposed fine-tuning dramatically enhances the LLaMA-7B model's capacity to generate structured outputs, outperforming other examined LLMs. The evaluation includes comprehensive metrics such as SacreBLEU, ROUGE-L, BERTScore, and new methodologies like GPTScore and H-Score, offering a holistic view of model performance.

Implications and Future Directions

The findings hold substantial practical implications, particularly for applications necessitating precise structured data generation, such as automated reporting systems, coding assistive tools, and data visualization processes. The paper suggests that there is considerable room for growth in LLMs, particularly in domains requiring structured output generation.

Future investigations may delve into expanding domain-specific benchmarks and exploring multi-modal LLMs capable of processing more varied data modalities. Additionally, advancements in techniques to bolster LLMs' numerical reasoning and structured data handling capabilities could greatly enhance their utility in practical applications.

Overall, the work presented in this paper paves the way for a nuanced understanding of LLM capabilities in structured data contexts and opens pathways for further refinement and exploration in the domain of structured text generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Table-to-text: Describing table region with natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  2. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  5. Revisiting event argument extraction: Can eae models learn better when being aware of event co-occurrences? arXiv preprint arXiv:2306.00502.
  6. Table and image generation for investigating knowledge of entities in pre-trained vision and language models. arXiv preprint arXiv:2306.02115.
  7. Neural text generation from structured data with application to the biography domain. arXiv preprint arXiv:1603.07771.
  8. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  9. A sequence-to-sequence&set model for text-to-table generation. arXiv preprint arXiv:2306.00137.
  10. Large language model is not a good few-shot information extractor, but a good reranker for hard samples! arXiv preprint arXiv:2303.08559.
  11. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  12. The e2e dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254.
  13. OpenAI. 2023. Gpt-4 technical report.
  14. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  15. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
  16. Stable: Table generation framework for encoder-decoder models. arXiv preprint arXiv:2206.04045.
  17. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
  18. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  19. Knowgl: Knowledge generation and linking from text. arXiv preprint arXiv:2210.13952.
  20. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  21. Llama: Open and efficient foundation language models.
  22. Large language models are not fair evaluators.
  23. Chain-of-thought prompting elicits reasoning in large language models.
  24. Webie: Faithful and robust information extraction on the web. arXiv preprint arXiv:2305.14293.
  25. Challenges in data-to-document generation. arXiv preprint arXiv:1707.08052.
  26. Text-to-table: A new way of information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2518–2533, Dublin, Ireland. Association for Computational Linguistics.
  27. A survey of large language models. arXiv preprint arXiv:2303.18223.
  28. Large language models are effective table-to-text generators, evaluators, and feedback providers. arXiv preprint arXiv:2305.14987.
  29. Robut: A systematic study of table qa robustness against human-annotated adversarial perturbations. arXiv preprint arXiv:2306.14321.
  30. Zexuan Zhong and Danqi Chen. 2020. A frustratingly easy approach for entity and relation extraction. arXiv preprint arXiv:2010.12812.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiangru Tang (62 papers)
  2. Yiming Zong (2 papers)
  3. Jason Phang (40 papers)
  4. Yilun Zhao (59 papers)
  5. Wangchunshu Zhou (73 papers)
  6. Arman Cohan (121 papers)
  7. Mark Gerstein (25 papers)
Citations (2)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com