Papers
Topics
Authors
Recent
2000 character limit reached

ESGBench: Explainable ESG QA

Updated 27 November 2025
  • ESGBench is a benchmark dataset and modular evaluation framework designed for explainable ESG question answering using annotated sustainability reports.
  • It employs a granular annotation schema, formal question taxonomy, and explainability metrics like Exact Match, String F1, Numeric Accuracy, and Retrieval Recall.
  • Baseline evaluations highlight challenges such as numeric ambiguity and retrieval bottlenecks, guiding improvements for transparent and robust ESG QA systems.

ESGBench is a benchmark dataset and modular evaluation framework for explainable question answering (QA) on corporate sustainability reports, focusing on Environmental, Social, and Governance (ESG) themes. Designed to facilitate fine-grained assessment of model reasoning in high-stakes, transparency-sensitive ESG contexts, ESGBench offers question–answer pairs annotated with human-curated answers and explicit traceability to supporting evidence, thus operationalizing explainability for both model benchmarking and system development (George et al., 20 Nov 2025).

1. Composition and Annotation Schema

ESGBench v0.1 comprises 119 question–answer pairs derived from 12 PDF sustainability or Task Force on Climate-related Financial Disclosures (TCFD) reports, covering 10 large public companies. Each QA pair is associated with a categorical ESG theme, distributed as follows:

ESG Category QA Pairs Percentage
Environment (E) 50 ≈42%
Social (S) 14 ≈12%
Governance (G) 23 ≈19%
Strategy 11 ≈9%
Risk 2 ≈2%

All records follow a granular annotation schema stored in JSONL:

  • company
  • doc
  • category (Environmental, Social, Governance, Strategy, Risk)
  • KPI_name
  • question
  • answer (verbatim string)
  • evidence_quote (verbatim supporting text)
  • page_num (integer)

Annotation guidelines mandate that each question must be answerable solely from its provided chunk or table. Answers are extracted verbatim with numerical units preserved. The evidence span is directly copied, creating a one-to-one mapping between system output and ground-truth context, which underpins explainability (George et al., 20 Nov 2025).

2. Question Taxonomy and Formal Definitions

Although not explicitly labeled, ESGBench QA pairs can be mapped to a standard question taxonomy:

  • Factoid Questions: Require retrieval of a single, contiguous span aa from the document. Formally, for document DD tokenized as ss, the answer for question qq is a=s[i:j]a = s[i:j] for i<ji < j.
  • Explanatory Questions: Require multi-sentence or paragraphic answers by concatenating multiple factual spans with minimal connective text. Formal structure is a set {si1:j1,si2:j2,}\{s_{i_1:j_1}, s_{i_2:j_2}, \dots\}.
  • Comparative Questions: Require a relative comparison such as “higher than” or “compare X vs. Y”. These are answered by identifying factual spans a1,a2a_1, a_2 and stating relational property R(a1,a2)R(a_1, a_2).

This taxonomy ensures coverage of direct fact lookup, reasoning across multiple pieces of evidence, and domain-specific comparative analysis (George et al., 20 Nov 2025).

3. Evaluation Metrics for Explainability

ESGBench advances four complementary explainability metrics, each precisely defined:

  1. Exact Match (EM): Binary metric. EM(p,g)=1\mathrm{EM}(p, g) = 1 if lowercased prediction pp matches gold answer gg; 0 otherwise.
  2. String F1: Measures overlap of tokens between prediction and ground truth. F1=2PrecisionRecallPrecision+Recall\mathrm{F1} = \frac{2 \cdot \mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} with precision and recall computed on token sets.
  3. Numeric Accuracy@±2%: For extracted numeric values vp,vgv_p, v_g (units must match), Numeric Accuracy=1\text{Numeric Accuracy} = 1 if vpvgvg0.02\left| \frac{v_p - v_g}{v_g} \right| \leq 0.02, else 0.
  4. Retrieval Recall@K: R@K=1\mathrm{R}@K = 1 if the ground-truth evidence page is within the top-K retrieved by the system; 0 otherwise. ESGBench sets K=5K=5.

Composite explainability scoring schemes are supported, such as: S=αEM+βF1+γR@KS = \alpha\,\mathrm{EM} + \beta\,\mathrm{F1} + \gamma\,\mathrm{R}@K with tunable weights subject to α+β+γ=1\alpha+\beta+\gamma=1 (George et al., 20 Nov 2025).

4. Baseline Performance and Key Challenges

Baseline evaluation using a standard retrieval-augmented generation (RAG) QA system on ESGBench v0.1 yielded the following results:

Metric Value
Exact Match 21.0%
String F1 (avg) 55.4%
Numeric Accuracy@±2% 45.3%
Retrieval Recall@5 70–80%

Per-category numeric accuracy: Environmental (48.0%), Social (35.7%), Governance (43.5%), Strategy (90.9%).

Principal challenges identified include:

  • Numeric ambiguity: Variations in units (tCO₂e vs. ktCO₂e; “million” vs. “thousand”) impede direct string and numeric matches.
  • Retrieval bottlenecks: Failure to surface the relevant sentence or table row often limits end-to-end QA accuracy.
  • Semantic drift: Embedding models sometimes retrieve topically similar but substantively incorrect contexts.

These deficiencies propagate through downstream pipeline stages, highlighting the centrality of evidence recall for transparent QA (George et al., 20 Nov 2025).

5. Architectural Recommendations and Best Practices

The ESGBench framework suggests several practices for maximizing explainability and performance:

  • Dual Indexing: Maintain separate indexes for narrative text and table rows; leverage table-aware embeddings for structured data.
  • Domain-Adapted Retrieval: Apply ESG-specific encoder models (e.g., ClimateBERT, FinBERT) to improve context relevance.
  • Layout and Unit Normalization: Integrate parsers such as Camelot, supplemented with rules for standardizing units and scale words.
  • Reasoning Enhancement: Chain-of-thought (CoT) and multi-hop prompting techniques are especially effective for explanatory and comparative QA.
  • Monitoring: Track per-category performance to diagnose systematic model drift; flag answers when sources are outside recognized ESG taxonomies such as TCFD, CSRD, BRSR, or ISSB (George et al., 20 Nov 2025).

6. Implications for Transparent ESG QA Research

ESGBench demonstrably lowers the barrier for methodological innovation in ESG QA by providing a reproducible, transparent pipeline (ingestion → indexing → QA generation → evaluation). Key implications:

  • Rapid iteration: Modular design facilitates benchmarking of retrieval strategies and reader architectures.
  • Auditable output: Verbatim evidence spans directly support answer traceability and post-hoc validation.
  • Taxonomy alignment: Linking outputs to evolving regulatory standards improves compliance monitoring and system relevance.
  • Extensibility: The schema and pipeline support multilingual expansion and coverage of additional ESG taxonomies.

A plausible implication is that incorporating ESGBench’s metrics and evidence-grounding within QA system development can promote higher trust and auditability in automated sustainability analysis, which is critical for both regulatory and stakeholder-facing applications (George et al., 20 Nov 2025).

7. Connection to Neurosymbolic QA and Future Directions

The modular, evidence-centric methodology established in ESGBench is synergistic with emerging neurosymbolic QA frameworks, such as those using logical inference engines (e.g., Prolog-based systems described in ProSLM (Vakharia et al., 17 Sep 2024)). These paradigms combine context gathering, symbolic proof generation, and LLM-based natural language reasoning with explicit fact validation, presenting a credible path toward fully transparent and robust ESG QA pipelines. Further research integrating ESGBench benchmarking with symbolic logic validation and user-personalized explanations is likely to accelerate advances in both ESG transparency and high-assurance QA (George et al., 20 Nov 2025, Vakharia et al., 17 Sep 2024).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ESGBench.