ESGBench: Explainable ESG QA
- ESGBench is a benchmark dataset and modular evaluation framework designed for explainable ESG question answering using annotated sustainability reports.
- It employs a granular annotation schema, formal question taxonomy, and explainability metrics like Exact Match, String F1, Numeric Accuracy, and Retrieval Recall.
- Baseline evaluations highlight challenges such as numeric ambiguity and retrieval bottlenecks, guiding improvements for transparent and robust ESG QA systems.
ESGBench is a benchmark dataset and modular evaluation framework for explainable question answering (QA) on corporate sustainability reports, focusing on Environmental, Social, and Governance (ESG) themes. Designed to facilitate fine-grained assessment of model reasoning in high-stakes, transparency-sensitive ESG contexts, ESGBench offers question–answer pairs annotated with human-curated answers and explicit traceability to supporting evidence, thus operationalizing explainability for both model benchmarking and system development (George et al., 20 Nov 2025).
1. Composition and Annotation Schema
ESGBench v0.1 comprises 119 question–answer pairs derived from 12 PDF sustainability or Task Force on Climate-related Financial Disclosures (TCFD) reports, covering 10 large public companies. Each QA pair is associated with a categorical ESG theme, distributed as follows:
| ESG Category | QA Pairs | Percentage |
|---|---|---|
| Environment (E) | 50 | ≈42% |
| Social (S) | 14 | ≈12% |
| Governance (G) | 23 | ≈19% |
| Strategy | 11 | ≈9% |
| Risk | 2 | ≈2% |
All records follow a granular annotation schema stored in JSONL:
companydoccategory(Environmental, Social, Governance, Strategy, Risk)KPI_namequestionanswer(verbatim string)evidence_quote(verbatim supporting text)page_num(integer)
Annotation guidelines mandate that each question must be answerable solely from its provided chunk or table. Answers are extracted verbatim with numerical units preserved. The evidence span is directly copied, creating a one-to-one mapping between system output and ground-truth context, which underpins explainability (George et al., 20 Nov 2025).
2. Question Taxonomy and Formal Definitions
Although not explicitly labeled, ESGBench QA pairs can be mapped to a standard question taxonomy:
- Factoid Questions: Require retrieval of a single, contiguous span from the document. Formally, for document tokenized as , the answer for question is for .
- Explanatory Questions: Require multi-sentence or paragraphic answers by concatenating multiple factual spans with minimal connective text. Formal structure is a set .
- Comparative Questions: Require a relative comparison such as “higher than” or “compare X vs. Y”. These are answered by identifying factual spans and stating relational property .
This taxonomy ensures coverage of direct fact lookup, reasoning across multiple pieces of evidence, and domain-specific comparative analysis (George et al., 20 Nov 2025).
3. Evaluation Metrics for Explainability
ESGBench advances four complementary explainability metrics, each precisely defined:
- Exact Match (EM): Binary metric. if lowercased prediction matches gold answer ; 0 otherwise.
- String F1: Measures overlap of tokens between prediction and ground truth. with precision and recall computed on token sets.
- Numeric Accuracy@±2%: For extracted numeric values (units must match), if , else 0.
- Retrieval Recall@K: if the ground-truth evidence page is within the top-K retrieved by the system; 0 otherwise. ESGBench sets .
Composite explainability scoring schemes are supported, such as: with tunable weights subject to (George et al., 20 Nov 2025).
4. Baseline Performance and Key Challenges
Baseline evaluation using a standard retrieval-augmented generation (RAG) QA system on ESGBench v0.1 yielded the following results:
| Metric | Value |
|---|---|
| Exact Match | 21.0% |
| String F1 (avg) | 55.4% |
| Numeric Accuracy@±2% | 45.3% |
| Retrieval Recall@5 | 70–80% |
Per-category numeric accuracy: Environmental (48.0%), Social (35.7%), Governance (43.5%), Strategy (90.9%).
Principal challenges identified include:
- Numeric ambiguity: Variations in units (tCO₂e vs. ktCO₂e; “million” vs. “thousand”) impede direct string and numeric matches.
- Retrieval bottlenecks: Failure to surface the relevant sentence or table row often limits end-to-end QA accuracy.
- Semantic drift: Embedding models sometimes retrieve topically similar but substantively incorrect contexts.
These deficiencies propagate through downstream pipeline stages, highlighting the centrality of evidence recall for transparent QA (George et al., 20 Nov 2025).
5. Architectural Recommendations and Best Practices
The ESGBench framework suggests several practices for maximizing explainability and performance:
- Dual Indexing: Maintain separate indexes for narrative text and table rows; leverage table-aware embeddings for structured data.
- Domain-Adapted Retrieval: Apply ESG-specific encoder models (e.g., ClimateBERT, FinBERT) to improve context relevance.
- Layout and Unit Normalization: Integrate parsers such as Camelot, supplemented with rules for standardizing units and scale words.
- Reasoning Enhancement: Chain-of-thought (CoT) and multi-hop prompting techniques are especially effective for explanatory and comparative QA.
- Monitoring: Track per-category performance to diagnose systematic model drift; flag answers when sources are outside recognized ESG taxonomies such as TCFD, CSRD, BRSR, or ISSB (George et al., 20 Nov 2025).
6. Implications for Transparent ESG QA Research
ESGBench demonstrably lowers the barrier for methodological innovation in ESG QA by providing a reproducible, transparent pipeline (ingestion → indexing → QA generation → evaluation). Key implications:
- Rapid iteration: Modular design facilitates benchmarking of retrieval strategies and reader architectures.
- Auditable output: Verbatim evidence spans directly support answer traceability and post-hoc validation.
- Taxonomy alignment: Linking outputs to evolving regulatory standards improves compliance monitoring and system relevance.
- Extensibility: The schema and pipeline support multilingual expansion and coverage of additional ESG taxonomies.
A plausible implication is that incorporating ESGBench’s metrics and evidence-grounding within QA system development can promote higher trust and auditability in automated sustainability analysis, which is critical for both regulatory and stakeholder-facing applications (George et al., 20 Nov 2025).
7. Connection to Neurosymbolic QA and Future Directions
The modular, evidence-centric methodology established in ESGBench is synergistic with emerging neurosymbolic QA frameworks, such as those using logical inference engines (e.g., Prolog-based systems described in ProSLM (Vakharia et al., 17 Sep 2024)). These paradigms combine context gathering, symbolic proof generation, and LLM-based natural language reasoning with explicit fact validation, presenting a credible path toward fully transparent and robust ESG QA pipelines. Further research integrating ESGBench benchmarking with symbolic logic validation and user-personalized explanations is likely to accelerate advances in both ESG transparency and high-assurance QA (George et al., 20 Nov 2025, Vakharia et al., 17 Sep 2024).