Papers
Topics
Authors
Recent
2000 character limit reached

Explainable ESG QA Systems

Updated 27 November 2025
  • Explainable ESG Question Answering Systems is a benchmark that assesses QA models on corporate sustainability reports with immutable evidence traceability.
  • The ESGBench dataset comprises 119 QA pairs across key ESG categories, ensuring answers are derived verbatim from report segments.
  • The framework leverages metrics such as Exact Match, String F1, Numeric Accuracy, and Retrieval Recall to transparently evaluate model performance.

ESGBench is a benchmark dataset and explainability framework specifically designed for the evaluation of question-answering (QA) systems in the domain of Environmental, Social, and Governance (ESG) analysis, with a focus on corporate sustainability reports. It emphasizes fine-grained traceability between model answers and source-disclosed evidence, supporting transparent and reproducible model evaluation in ESG-aligned information systems (George et al., 20 Nov 2025).

1. Dataset Structure and Composition

ESGBench v0.1 comprises 119 QA pairs extracted from 12 PDF sustainability and TCFD (Task Force on Climate-related Financial Disclosures) reports corresponding to 10 publicly traded companies. The QA pairs span five annotation categories, with the following distribution:

Theme/Category Number of QA Pairs Approximate Proportion
Environment (E) 50 42%
Social (S) 14 12%
Governance (G) 23 19%
Strategy 11 9%
Risk 2 2%

Annotation is performed by constrained prompting over narrative report chunks and normalized table rows, enforcing that each question must be answerable solely from the provided report segment and that answers be verbatim, with precise units and values preserved. Each QA record is represented in a JSONL schema capturing the company, document context, category, KPI_name, question, verbatim answer, evidence_quote (verbatim supporting text), and source page number. This structure guarantees a direct, immutable linkage between questions, answers, and primary evidence spans (George et al., 20 Nov 2025).

2. Taxonomy of Question Types and Formal Definitions

ESGBench questions, though not explicitly type-labeled in the current release, naturally fall into several commonly recognized QA classes:

  • Factoid Questions: These require extraction of a contiguous text span, aa, drawn verbatim from the source document DD:

a=s[i:j],for some token indices i<j; s:tokenized D.a = s[i:j], \quad \text{for some token indices } i < j;\ s: \text{tokenized } D.

  • Explanatory Questions: These call for a multi-sentence or paragraph-level response synthesizing multiple spans,

a={si1:j1,si2:j2,}+ connective text.a = \{s_{i_1:j_1}, s_{i_2:j_2}, \ldots \} + \text{ connective text}.

  • Comparative Questions: These request a relational answer (e.g., “compare X to Y”), requiring identification of facts a1,a2a_1, a_2 and a relation R(a1,a2)R(a_1, a_2).

All questions are annotated with verbatim evidence quotes, supporting rigorous, evidence-grounded explainability (George et al., 20 Nov 2025).

3. Explainability Metrics and Evaluation

The benchmark operationalizes explainability through four formally specified metrics, all computed with strict linkage to the annotated gold evidence:

  1. Exact Match (EM):

EM(p,g)={1,if lowercase(p)=lowercase(g) 0,otherwise\mathrm{EM}(p, g) = \begin{cases} 1, & \text{if } \text{lowercase}(p) = \text{lowercase}(g) \ 0, & \text{otherwise} \end{cases}

where pp is the prediction, gg is the gold answer.

  1. String F1: Computed as token-level overlap:

Precision=tokens(p)tokens(g)tokens(p),Recall=tokens(p)tokens(g)tokens(g),F1=2PrecisionRecallPrecision+Recall\mathrm{Precision} = \frac{|\text{tokens}(p) \cap \text{tokens}(g)|}{|\text{tokens}(p)|},\quad \mathrm{Recall} = \frac{|\text{tokens}(p) \cap \text{tokens}(g)|}{|\text{tokens}(g)|},\quad \mathrm{F1} = \frac{2 \cdot \mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}

  1. Numeric Accuracy@±2%: Targets numerical value accuracy with tolerance; for numeric values vpv_p, vgv_g (units must match),

NumAcc={1,vpvgvg0.02 0,otherwise\mathrm{NumAcc} = \begin{cases} 1, & |\frac{v_p - v_g}{v_g}| \le 0.02 \ 0, & \text{otherwise} \end{cases}

  1. Retrieval Recall@K (R@K): Succeeds if the gold evidence page pp^* is among the top KK retrieved pages.

Combined, these enable holistic analysis of both surface answer quality and explanation traceability. Composite scores can be defined, e.g., S=αEM+βF1+γR@KS = \alpha \mathrm{EM} + \beta \mathrm{F1} + \gamma \mathrm{R@K} for weights summing to 1 (George et al., 20 Nov 2025).

4. Baseline Results and Highlighted Challenges

Initial evaluation of a RAG (retrieval-augmented generation) baseline on ESGBench v0.1 yields:

  • Exact Match (EM): 21.0%
  • String F1: 55.4%
  • Numeric Accuracy@±2%: 45.3%
  • Retrieval Recall@5: 70–80%

Performance varies by category (per-category EM/NumAcc Environmental: 48.0%, Social: 35.7%, Governance: 43.5%, Strategy: 90.9%). Major system-level challenges include:

  • Unit and scale normalization in numeric table responses (e.g., “tCO₂e” vs. “ktCO₂e,” or “million” vs. “thousand”).
  • Upstream retrieval errors—missing key sentences or table rows—significantly depress downstream answer correctness.
  • Domain alignment limitations in semantic retrieval; generic LLM embeddings sometimes retrieve thematically similar but factually irrelevant context (George et al., 20 Nov 2025).

5. Evidence Retrieval, Reasoning, and Justification

The ESGBench evaluation protocol is tightly coupled to evidence-centric QA. QA systems are required to:

  1. Retrieve the top-KK most relevant report chunks or table rows for each posed question.
  2. Constrain answer generation such that outputs are drawn from these retrieved contexts, enforcing verbatim units and phraseology.
  3. Explicitly output the evidence_quote and page/document link for every answer provided.

This structure supports robust post-hoc validation: e.g., R@K (retrieval recall at KK), as well as evidence-precision via token overlap. For explanatory/comparative queries, recommended approaches include chaining evidence IDs and multi-hop or chain-of-thought prompting referencing document spans (George et al., 20 Nov 2025).

6. Recommendations for System Architecture and Research

Empirical analysis of ESGBench suggests the following pipeline components and methodological best practices:

  • Retrieval: Dual-index both narrative text chunks and normalized table rows; employ table-aware and domain-adapted embeddings (e.g., ClimateBERT, FinBERT).
  • Reasoning: Constrain neural answer generation to retrieved spans, enforce verbatim reproduction of numeric units, and utilize chain-of-thought prompts as required.
  • Justification: Mandate inclusion of evidence_quote with pointer to source; post-hoc compute R@K and evidence-precision.
  • Governance: Track all core metrics (EM, F1, NumAcc) per ESG category to monitor for systematic drifts as models adapt to evolving disclosure taxonomies (e.g., TCFD, CSRD, ISSB), flagging out-of-taxonomy responses.

A reproducible end-to-end pipeline (ingest → index → QA generation → evaluate) supports rapid experimentation with retrieval/ranking, explanation strategies, coverage extension (e.g., new ESG taxonomies or languages), and reader fine-tuning using the paired, evidence-centered QA records (George et al., 20 Nov 2025).

7. Broader Implications and Future Directions

The methodology and design principles exemplified by ESGBench provide actionable guidance for constructing explainable, auditable QA pipelines in ESG and related compliance domains:

  • Every model output should possess an immutable evidence trail linking natural-language answers to primary-source ESG data, enforcing transparent and accountable automated reporting.
  • Modular separation of retrieval, reasoning, and justification stages accelerates evaluation and benchmarking of specialized components (retrieval models, answer generators, explanation generators).
  • The presence of per-question, per-category, and composite explainability metrics enables systematic detection of model weaknesses and accurate measurement of progress along multiple axes—answer factuality, reasoning transparency, and traceable citation to original disclosures.

A plausible implication is that ESGBench accelerates research not just in accuracy, but in traceability, establishing a repeatable paradigm for QA evaluation wherever evidence-grounded trust and regulatory auditability are paramount (George et al., 20 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Explainable ESG Question Answering Systems.