EmeraldGraph: ESG Claims Knowledge Graph

Updated 19 December 2025

EmeraldGraph is a specialized ESG knowledge graph that integrates structured corporate disclosures, regulatory guidelines, and curated greenwashing claims.
It employs a multi-step extraction pipeline using text parsing, multimodal vision-language models, and LLM-based triple extraction for precise data representation.
It supports transparent, evidence-backed greenwashing detection through fine-grained retrieval and schema-driven reasoning over sustainability claims.

EmeraldGraph is a domain-specific, labeled property knowledge graph constructed to support automated detection of greenwashing—i.e., misleading corporate sustainability claims—by surfacing verifiable evidence from corporate ESG (environmental, social, governance) disclosures. As the foundational component of the EmeraldMind framework, EmeraldGraph integrates highly structured, company-centered data from regulatory guidelines and ESG reports, providing explicit representation of entities, relationships, observations, targets, and claims relevant to sustainability reporting. The graph is designed to address the inadequacies of generic open-domain knowledge bases by enabling fine-grained retrieval, justification-centric reasoning, and transparent abstention in claim assessment (Kaoukis et al., 12 Dec 2025).

1. Data Sources and Extraction

EmeraldGraph is constructed from three principal corpora:

Corporate ESG Reports: Annual PDFs published by each target organization, consisting of narrative sections (e.g., “Strategy,” “Governance,” “Risk Management”), tabular KPI disclosures (e.g., scopes 1–3 greenhouse gas emissions tables), and visual materials such as charts and figures (e.g., emissions-by-facility bar charts).
Regulatory KPI Definitions: Standardized KPIs from sources including the EFFAS–DVFA “Key Performance Indicators for ESG Issues” and the EU CSRD/SFDR indicator list.
Seed Claims for Greenwashing: A hand-curated subset of labeled claims from the public GreenClaims dataset.

Extraction involves a three-stage pipeline:

PDF Parsing (Text Channel): PyMuPDF is used to recover structured textual elements (paragraphs, headers, tables) and normalize table cells for consistent key–value extraction.
Vision-Language Parsing (Multimodal Channel): Each page is raster-scanned and processed by a vision-LLM (e.g., DONUT [Kim et al., 2022]), which identifies axes, series, units, and captions in figures or charts, extracting structured representations such as “Figure 1 shows CO₂ emissions (tCO₂e) by Region.”
Alignment and Deduplication: Textual and visual extractions are aligned by page and table coordinates, with overlapping or redundant spans deduplicated to yield a unified parsed representation.

This parsed representation then undergoes LLM-based information extraction. For each paragraph or table row, an LLM is prompted to produce candidate triples ⟨subject_mention, predicate_mention, object_mention⟩, with mentions linked to schema types (e.g., Organization, KPIObservation) via LLM-driven classification.

2. Graph Schema and Structure

EmeraldGraph is formalized as a labeled property graph:

$G = (V, E, T, L, \tau, S, \mathbb{B}, p)$

with the following components:

Node set, $V$ : entities such as companies, facilities, KPI observations, goals, initiatives, sustainability claims.
Type set, $T$ : node types, including Organization, Facility, Location, KPIObservation, Goal, Initiative, SustainabilityClaim.
Label set, $L$ : relationship labels (e.g., reportsKPI, setsGoal, locatedIn, takesPartIn, claims, hasValue).
Type assignment, $\tau: V \rightarrow T$ : maps each node to its type.
Schema, $S \subseteq T \times L \times T$ $S \subseteq T \times L \times T$ : ontology of allowed edges, as the union of:
- $S_{\text{data}}$ : induced frequent patterns from the corpus,
- $S_{\text{reg}}$ : official KPI-reporting relations from regulatory definitions,
- $S_{\text{claim}}$ : edge patterns abstracted from known greenwashing claims.
Typed attributes, $\mathbb{B}$ (e.g., value, unit, year for KPIObservation).
Properties mapping, $p: (V \cup E) \rightarrow \mathbb{B}$ .

Example triples:

⟨AcmeCorp, reportsKPI, CO2_Obs_2023⟩
⟨CO2_Obs_2023, hasValue, {"value": 1200,"unit":"tCO₂e","year":2023}⟩
⟨AcmeCorp, setsGoal, NetZero2050⟩
⟨NetZero2050, targetYear, {"year":2050}⟩
⟨Facility_F1, locatedIn, Region_NorthAmerica⟩
⟨AcmeCorp, claims, “Reduced Scope 1 emissions by 10%”⟩

Distinctive features compared to generic KBs include explicit differentiation between “reportsKPI” and simple mentions, anchoring of observations to facilities and years, and encoding of actuals and targets—typically underspecified in resources like Wikidata.

3. Graph Construction Pipeline

Construction is an explicit multi-step process:

Schema-guided Extraction: For each parsed triple $(u_m, \ell_m, v_m)$ , potential schema types $\tau(u_m), \tau(v_m)$ are assigned. Triples are only admitted if $(\tau(u_m), \ell_m, \tau(v_m)) \in S$ .
Entity Disambiguation and Deduplication: Mentions are embedded using a text-embedding model (e.g., 1536-dim OpenAI embeddings). Embedding-blocking [Li et al., 2024] clusters mentions via cosine similarity (>0.95) for canonicalization and provenance recording.
Confidence Scoring: Each triple $t = (u, \ell, v)$ is assigned $c(t) = \text{CosSim}(\text{emb}(u_m), \text{emb}(u)) \cdot \text{CosSim}(\text{emb}(v_m), \text{emb}(v))$ and accepted if $c(t) \geq \theta$ ( $\theta=0.2$ ).
Incremental Updates: New reports are processed and assimilated with the existing graph by merging new nodes (if $\|\text{emb}_{\text{new}}-\text{emb}_{\text{existing}}\|_{2}<\epsilon$ ) or appending, with edges type-checked and scored as above.

Core insertion pseudocode:

for each parsed triple (u_m, ℓ_m, v_m):
  τ_u ← SchemaType(u_m)
  τ_v ← SchemaType(v_m)
  if (τ_u, ℓ_m, τ_v) ∉ S:
    continue
  u ← FindOrCreateNode(u_m, τ_u)  # embedding-blocking deduplication
  v ← FindOrCreateNode(v_m, τ_v)
  c ← CosSim(emb(u_m),emb(u)) * CosSim(emb(v_m),emb(v))
  if c ≥ θ:
    add_edge(u, ℓ_m, v, properties={confidence: c})

4. Integration with Retrieval-Augmented Generation

EmeraldGraph is integrated with retrieval-augmented generation (RAG) in EmeraldMind via EM-KGRAG (graph RAG) and EM-HYBRID (graph+text RAG):

Graph Embeddings and Indexing: For each node $v$ , a 512-dimensional embedding is computed by averaging its canonical name’s text embedding and the mean embedding of neighbor names. These are stored in a FAISS vector index.
Schema-Based Subgraph Retrieval: Given a grounded claim (e.g., “AcmeCorp reduced CO₂ emissions by 10% in 2023”), recipient nodes (e.g., company, KPI, year) are anchored. For each, the top- $n$ nodes in the $k$ -hop neighborhood of $v_{company}$ with highest embedding similarity (above threshold $\theta$ ) are retrieved, and all nodes/edges along shortest paths are included in the evidence subgraph $H$ .

Schema-driven context retrieval (Algorithm 1):

For each grounded claim node $v \in V_{claim}$ $v \in V_{c l aim}$ :
- Retrieve all nodes of type $\tau(v)$ in the $k$ -hop neighborhood.
- Select those with $\text{CosSim}(\text{emb}(u),\text{emb}(v)) \geq \theta$ .
- For the top- $n$ matches, add all nodes/edges on the shortest path to $v_{company}$ into $H$ .
- Prompt Assembly: EM-KGRAG constructs prompts of the form:
  1 2 3 4
  
  Claim: <c>. Evidence subgraph (triples + key properties): - ⟨AcmeCorp, reportsKPI, CO2_Obs_2023⟩ … Question: Based on this evidence, is the claim greenwashing?
  EM-RAG uses EmeraldDB to retrieve relevant text chunks. EM-HYBRID combines both, then requests the LLM to select the more reliable justification.

5. Evaluation Metrics and Empirical Performance

Evaluation encompasses both intrinsic and extrinsic metrics:

Intrinsic Graph Statistics (over 37 ESG reports):
- Node count $|V|$ = 53,748, edge count $|E|$ = 59,344
- Sparse graph: average total degree 2.21; “star” topology centered on companies
- Diameter = 23,788; average shortest path $\approx 1727$ (indicative of weak global connectivity and localization of information)
- Key entity counts: KPIObservation (24,809); Initiative (4,060); SustainabilityClaim (3,458); Goal (3,414); Organization (3,020)
- Top relations: reportsKPI (24,832); takesPartIn (4,388); setsGoal (3,475); claims (3,446); locatedIn (3,396)
Extrinsic Greenwashing Detection (GreenClaims & EmeraldData):

Variant	Coverage (%)	Accuracy (%)	Overall (%)
Baseline LLM	20–30	94–100	18–31
EM-RAG (text)	55–77	78–88	52–62
EM-KGRAG (graph)	50–76	89–94	46–69
EM-HYBRID	68–75	85–92	59–71

Justification Quality: Judged by the ILORA 5-point rubric and pairwise Borda ranking, the graph-augmented variants (EM-KGRAG, EM-HYBRID) consistently outperform both text-only and baseline (no RAG) LLMs.

This suggests that EmeraldGraph enables substantial improvements in decision coverage and justification quality, while preserving high classification accuracy, compared to generic open-domain knowledge bases and pure LLM reasoning.

6. Distinctive Capabilities and Context

EmeraldGraph is engineered for fine-grained, transparent, and auditable reasoning over ESG data at a level of granularity and provenance not present in generic KBs. By encoding both actual KPI values and announced targets, surfacing company-specific assertions, and supporting rigorous evidence assembly for each sustainability claim, EmeraldGraph enables system abstention on unverifiable claims and delivers transparent, evidence-backed verdicts. A plausible implication is that this tightly coupled extraction–retrieval–generation pipeline defines a new class of domain-specific knowledge graphs optimized for factual consistency and regulatory compliance tasks in automated decision contexts (Kaoukis et al., 12 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

EmeraldMind: A Knowledge Graph-Augmented Framework for Greenwashing Detection (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EmeraldGraph.