ESGBench: Explainable ESG QA

Updated 27 November 2025

ESGBench is a benchmark dataset and modular evaluation framework designed for explainable ESG question answering using annotated sustainability reports.
It employs a granular annotation schema, formal question taxonomy, and explainability metrics like Exact Match, String F1, Numeric Accuracy, and Retrieval Recall.
Baseline evaluations highlight challenges such as numeric ambiguity and retrieval bottlenecks, guiding improvements for transparent and robust ESG QA systems.

ESGBench is a benchmark dataset and modular evaluation framework for explainable question answering (QA) on corporate sustainability reports, focusing on Environmental, Social, and Governance (ESG) themes. Designed to facilitate fine-grained assessment of model reasoning in high-stakes, transparency-sensitive ESG contexts, ESGBench offers question–answer pairs annotated with human-curated answers and explicit traceability to supporting evidence, thus operationalizing explainability for both model benchmarking and system development (George et al., 20 Nov 2025).

1. Composition and Annotation Schema

ESGBench v0.1 comprises 119 question–answer pairs derived from 12 PDF sustainability or Task Force on Climate-related Financial Disclosures (TCFD) reports, covering 10 large public companies. Each QA pair is associated with a categorical ESG theme, distributed as follows:

ESG Category	QA Pairs	Percentage
Environment (E)	50	≈42%
Social (S)	14	≈12%
Governance (G)	23	≈19%
Strategy	11	≈9%
Risk	2	≈2%

All records follow a granular annotation schema stored in JSONL:

company
doc
category (Environmental, Social, Governance, Strategy, Risk)
KPI_name
question
answer (verbatim string)
evidence_quote (verbatim supporting text)
page_num (integer)

Annotation guidelines mandate that each question must be answerable solely from its provided chunk or table. Answers are extracted verbatim with numerical units preserved. The evidence span is directly copied, creating a one-to-one mapping between system output and ground-truth context, which underpins explainability (George et al., 20 Nov 2025).

2. Question Taxonomy and Formal Definitions

Although not explicitly labeled, ESGBench QA pairs can be mapped to a standard question taxonomy:

Factoid Questions: Require retrieval of a single, contiguous span $a$ from the document. Formally, for document $D$ tokenized as $s$ , the answer for question $q$ is $a = s[i:j]$ for $i < j$ .
Explanatory Questions: Require multi-sentence or paragraphic answers by concatenating multiple factual spans with minimal connective text. Formal structure is a set $\{s_{i_1:j_1}, s_{i_2:j_2}, \dots\}$ .
Comparative Questions: Require a relative comparison such as “higher than” or “compare X vs. Y”. These are answered by identifying factual spans $a_1, a_2$ and stating relational property $R(a_1, a_2)$ .

This taxonomy ensures coverage of direct fact lookup, reasoning across multiple pieces of evidence, and domain-specific comparative analysis (George et al., 20 Nov 2025).

3. Evaluation Metrics for Explainability

ESGBench advances four complementary explainability metrics, each precisely defined:

Exact Match (EM): Binary metric. $\mathrm{EM}(p, g) = 1$ if lowercased prediction $p$ matches gold answer $g$ ; 0 otherwise.
String F1: Measures overlap of tokens between prediction and ground truth. $\mathrm{F1} = \frac{2 \cdot \mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$ with precision and recall computed on token sets.
Numeric Accuracy@±2%: For extracted numeric values $v_p, v_g$ (units must match), $\text{Numeric Accuracy} = 1$ if $\left| \frac{v_p - v_g}{v_g} \right| \leq 0.02$ , else 0.
Retrieval Recall@K: $\mathrm{R}@K = 1$ if the ground-truth evidence page is within the top-K retrieved by the system; 0 otherwise. ESGBench sets $K=5$ .

Composite explainability scoring schemes are supported, such as: $S = \alpha\,\mathrm{EM} + \beta\,\mathrm{F1} + \gamma\,\mathrm{R}@K$ with tunable weights subject to $\alpha+\beta+\gamma=1$ (George et al., 20 Nov 2025).

4. Baseline Performance and Key Challenges

Baseline evaluation using a standard retrieval-augmented generation (RAG) QA system on ESGBench v0.1 yielded the following results:

Metric	Value
Exact Match	21.0%
String F1 (avg)	55.4%
Numeric Accuracy@±2%	45.3%
Retrieval Recall@5	70–80%

Per-category numeric accuracy: Environmental (48.0%), Social (35.7%), Governance (43.5%), Strategy (90.9%).

Principal challenges identified include:

Numeric ambiguity: Variations in units (tCO₂e vs. ktCO₂e; “million” vs. “thousand”) impede direct string and numeric matches.
Retrieval bottlenecks: Failure to surface the relevant sentence or table row often limits end-to-end QA accuracy.
Semantic drift: Embedding models sometimes retrieve topically similar but substantively incorrect contexts.

These deficiencies propagate through downstream pipeline stages, highlighting the centrality of evidence recall for transparent QA (George et al., 20 Nov 2025).

5. Architectural Recommendations and Best Practices

The ESGBench framework suggests several practices for maximizing explainability and performance:

Dual Indexing: Maintain separate indexes for narrative text and table rows; leverage table-aware embeddings for structured data.
Domain-Adapted Retrieval: Apply ESG-specific encoder models (e.g., ClimateBERT, FinBERT) to improve context relevance.
Layout and Unit Normalization: Integrate parsers such as Camelot, supplemented with rules for standardizing units and scale words.
Reasoning Enhancement: Chain-of-thought (CoT) and multi-hop prompting techniques are especially effective for explanatory and comparative QA.
Monitoring: Track per-category performance to diagnose systematic model drift; flag answers when sources are outside recognized ESG taxonomies such as TCFD, CSRD, BRSR, or ISSB (George et al., 20 Nov 2025).

6. Implications for Transparent ESG QA Research

ESGBench demonstrably lowers the barrier for methodological innovation in ESG QA by providing a reproducible, transparent pipeline (ingestion → indexing → QA generation → evaluation). Key implications:

Rapid iteration: Modular design facilitates benchmarking of retrieval strategies and reader architectures.
Auditable output: Verbatim evidence spans directly support answer traceability and post-hoc validation.
Taxonomy alignment: Linking outputs to evolving regulatory standards improves compliance monitoring and system relevance.
Extensibility: The schema and pipeline support multilingual expansion and coverage of additional ESG taxonomies.

A plausible implication is that incorporating ESGBench’s metrics and evidence-grounding within QA system development can promote higher trust and auditability in automated sustainability analysis, which is critical for both regulatory and stakeholder-facing applications (George et al., 20 Nov 2025).

7. Connection to Neurosymbolic QA and Future Directions

The modular, evidence-centric methodology established in ESGBench is synergistic with emerging neurosymbolic QA frameworks, such as those using logical inference engines (e.g., Prolog-based systems described in ProSLM (Vakharia et al., 2024)). These paradigms combine context gathering, symbolic proof generation, and LLM-based natural language reasoning with explicit fact validation, presenting a credible path toward fully transparent and robust ESG QA pipelines. Further research integrating ESGBench benchmarking with symbolic logic validation and user-personalized explanations is likely to accelerate advances in both ESG transparency and high-assurance QA (George et al., 20 Nov 2025, Vakharia et al., 2024).

Markdown Upgrade to Chat

References (2)

ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports (2025)

ProSLM : A Prolog Synergized Language Model for explainable Domain Specific Knowledge Based Question Answering (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ESGBench.

ESGBench: Explainable ESG QA

1. Composition and Annotation Schema

2. Question Taxonomy and Formal Definitions

3. Evaluation Metrics for Explainability

4. Baseline Performance and Key Challenges

5. Architectural Recommendations and Best Practices

6. Implications for Transparent ESG QA Research

7. Connection to Neurosymbolic QA and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

ESGBench: Explainable ESG QA

1. Composition and Annotation Schema

2. Question Taxonomy and Formal Definitions

3. Evaluation Metrics for Explainability

4. Baseline Performance and Key Challenges

5. Architectural Recommendations and Best Practices

6. Implications for Transparent ESG QA Research

7. Connection to Neurosymbolic QA and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research