Explainable ESG QA Systems
- Explainable ESG Question Answering Systems is a benchmark that assesses QA models on corporate sustainability reports with immutable evidence traceability.
- The ESGBench dataset comprises 119 QA pairs across key ESG categories, ensuring answers are derived verbatim from report segments.
- The framework leverages metrics such as Exact Match, String F1, Numeric Accuracy, and Retrieval Recall to transparently evaluate model performance.
ESGBench is a benchmark dataset and explainability framework specifically designed for the evaluation of question-answering (QA) systems in the domain of Environmental, Social, and Governance (ESG) analysis, with a focus on corporate sustainability reports. It emphasizes fine-grained traceability between model answers and source-disclosed evidence, supporting transparent and reproducible model evaluation in ESG-aligned information systems (George et al., 20 Nov 2025).
1. Dataset Structure and Composition
ESGBench v0.1 comprises 119 QA pairs extracted from 12 PDF sustainability and TCFD (Task Force on Climate-related Financial Disclosures) reports corresponding to 10 publicly traded companies. The QA pairs span five annotation categories, with the following distribution:
| Theme/Category | Number of QA Pairs | Approximate Proportion |
|---|---|---|
| Environment (E) | 50 | 42% |
| Social (S) | 14 | 12% |
| Governance (G) | 23 | 19% |
| Strategy | 11 | 9% |
| Risk | 2 | 2% |
Annotation is performed by constrained prompting over narrative report chunks and normalized table rows, enforcing that each question must be answerable solely from the provided report segment and that answers be verbatim, with precise units and values preserved. Each QA record is represented in a JSONL schema capturing the company, document context, category, KPI_name, question, verbatim answer, evidence_quote (verbatim supporting text), and source page number. This structure guarantees a direct, immutable linkage between questions, answers, and primary evidence spans (George et al., 20 Nov 2025).
2. Taxonomy of Question Types and Formal Definitions
ESGBench questions, though not explicitly type-labeled in the current release, naturally fall into several commonly recognized QA classes:
- Factoid Questions: These require extraction of a contiguous text span, , drawn verbatim from the source document :
- Explanatory Questions: These call for a multi-sentence or paragraph-level response synthesizing multiple spans,
- Comparative Questions: These request a relational answer (e.g., “compare X to Y”), requiring identification of facts and a relation .
All questions are annotated with verbatim evidence quotes, supporting rigorous, evidence-grounded explainability (George et al., 20 Nov 2025).
3. Explainability Metrics and Evaluation
The benchmark operationalizes explainability through four formally specified metrics, all computed with strict linkage to the annotated gold evidence:
- Exact Match (EM):
where is the prediction, is the gold answer.
- String F1: Computed as token-level overlap:
- Numeric Accuracy@±2%: Targets numerical value accuracy with tolerance; for numeric values , (units must match),
- Retrieval Recall@K (R@K): Succeeds if the gold evidence page is among the top retrieved pages.
Combined, these enable holistic analysis of both surface answer quality and explanation traceability. Composite scores can be defined, e.g., for weights summing to 1 (George et al., 20 Nov 2025).
4. Baseline Results and Highlighted Challenges
Initial evaluation of a RAG (retrieval-augmented generation) baseline on ESGBench v0.1 yields:
- Exact Match (EM): 21.0%
- String F1: 55.4%
- Numeric Accuracy@±2%: 45.3%
- Retrieval Recall@5: 70–80%
Performance varies by category (per-category EM/NumAcc Environmental: 48.0%, Social: 35.7%, Governance: 43.5%, Strategy: 90.9%). Major system-level challenges include:
- Unit and scale normalization in numeric table responses (e.g., “tCO₂e” vs. “ktCO₂e,” or “million” vs. “thousand”).
- Upstream retrieval errors—missing key sentences or table rows—significantly depress downstream answer correctness.
- Domain alignment limitations in semantic retrieval; generic LLM embeddings sometimes retrieve thematically similar but factually irrelevant context (George et al., 20 Nov 2025).
5. Evidence Retrieval, Reasoning, and Justification
The ESGBench evaluation protocol is tightly coupled to evidence-centric QA. QA systems are required to:
- Retrieve the top- most relevant report chunks or table rows for each posed question.
- Constrain answer generation such that outputs are drawn from these retrieved contexts, enforcing verbatim units and phraseology.
- Explicitly output the evidence_quote and page/document link for every answer provided.
This structure supports robust post-hoc validation: e.g., R@K (retrieval recall at ), as well as evidence-precision via token overlap. For explanatory/comparative queries, recommended approaches include chaining evidence IDs and multi-hop or chain-of-thought prompting referencing document spans (George et al., 20 Nov 2025).
6. Recommendations for System Architecture and Research
Empirical analysis of ESGBench suggests the following pipeline components and methodological best practices:
- Retrieval: Dual-index both narrative text chunks and normalized table rows; employ table-aware and domain-adapted embeddings (e.g., ClimateBERT, FinBERT).
- Reasoning: Constrain neural answer generation to retrieved spans, enforce verbatim reproduction of numeric units, and utilize chain-of-thought prompts as required.
- Justification: Mandate inclusion of evidence_quote with pointer to source; post-hoc compute R@K and evidence-precision.
- Governance: Track all core metrics (EM, F1, NumAcc) per ESG category to monitor for systematic drifts as models adapt to evolving disclosure taxonomies (e.g., TCFD, CSRD, ISSB), flagging out-of-taxonomy responses.
A reproducible end-to-end pipeline (ingest → index → QA generation → evaluate) supports rapid experimentation with retrieval/ranking, explanation strategies, coverage extension (e.g., new ESG taxonomies or languages), and reader fine-tuning using the paired, evidence-centered QA records (George et al., 20 Nov 2025).
7. Broader Implications and Future Directions
The methodology and design principles exemplified by ESGBench provide actionable guidance for constructing explainable, auditable QA pipelines in ESG and related compliance domains:
- Every model output should possess an immutable evidence trail linking natural-language answers to primary-source ESG data, enforcing transparent and accountable automated reporting.
- Modular separation of retrieval, reasoning, and justification stages accelerates evaluation and benchmarking of specialized components (retrieval models, answer generators, explanation generators).
- The presence of per-question, per-category, and composite explainability metrics enables systematic detection of model weaknesses and accurate measurement of progress along multiple axes—answer factuality, reasoning transparency, and traceable citation to original disclosures.
A plausible implication is that ESGBench accelerates research not just in accuracy, but in traceability, establishing a repeatable paradigm for QA evaluation wherever evidence-grounded trust and regulatory auditability are paramount (George et al., 20 Nov 2025).