SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting (2508.03000v1)
Abstract: The growing demand for corporate sustainability transparency, particularly under new regulations like the EU Taxonomy, necessitates precise data extraction from large, unstructured corporate reports. LLMs and Retrieval-Augmented Generation (RAG) systems, requires high-quality, domain-specific question-answering (QA) datasets to excel at particular domains. To address this, we introduce SustainableQA, a novel dataset and a scalable pipeline for generating a comprehensive QA datasets from corporate sustainability reports and annual reports. Our approach integrates semantic chunk classification, a hybrid span extraction pipeline combining fine-tuned Named Entity Recognition (NER), rule-based methods, and LLM-driven refinement, alongside a specialized table-to-paragraph transformation. With over 195,000 diverse factoid and non-factoid QA pairs, SustainableQA is an effective resource for developing and benchmarking advanced knowledge assistants capable of navigating complex sustainability compliance
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.