Papers
Topics
Authors
Recent
Search
2000 character limit reached

LIBERTy: LLM Benchmark for Causal Explainability

Updated 16 January 2026
  • The paper introduces LIBERTy, a benchmark framework and dataset suite based on structural causal models that generates gold-standard counterfactuals for LLM explanation evaluation.
  • The paper employs intervention-based measurements, including the Individual Causal Concept Effect (ICaCE) and Order-Faithfulness, to rigorously assess explanation methods under causal perturbations.
  • The paper demonstrates that fine-tuned models exhibit higher sensitivity to causal interventions compared to proprietary LLMs, underscoring the importance of explicit SCM design in high-stakes domains.

LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets) is a benchmark framework and dataset suite grounded in structural causal models (SCMs) and designed to support rigorous evaluation of concept-based explanations for LLMs. LIBERTy quantifies how high-level concepts (e.g., gender, symptoms, work experience) influence model behavior in text classification settings by enabling intervention-based measurement of causal effects, providing gold-standard reference targets unavailable in previous human-edited counterfactual datasets. Three domain-diverse datasets and novel metrics support systematic comparison of explanation methods and model sensitivities under genuinely causal perturbations, particularly for high-stakes or fairness-sensitive decision domains (Toker et al., 15 Jan 2026).

1. Structural Causal Model Foundation

LIBERTy is formalized using a structural causal model (SCM) M=(U,V,F)M = (U, V, F), where:

  • U=(ϵ1,,ϵk)U = (\epsilon_1, \dots, \epsilon_k) are exogenous noise variables.
  • V={C1,,Cm,X,Y}V = \{C_1, \dots, C_m, X, Y\} are endogenous variables: CjC_j are high-level “concept” variables, XX is the LLM-generated text, and Y=f(X)ΔlabelsY = f(X) \in \Delta^{|labels|} is the soft-probability output of a classifier.
  • FF is the set of structural functions governing each variable’s generation.

For each concept CjC_j, the generation is:

Cj=fj(parents(Cj),ϵj),ϵjN(μj,σj2) or categorical.C_j = f_j(parents(C_j), \epsilon_j), \quad \epsilon_j \sim N(\mu_j, \sigma_j^2) \text{ or categorical}.

Text is generated as:

X=fX(C1,,Cm,ϵtemplate,ϵpersona)X = f_X(C_1, \dots, C_m, \epsilon_\text{template}, \epsilon_\text{persona})

Model output is:

Y=f(X)Y = f(X)

Interventions are performed using Pearl’s abduction–action–prediction procedure:

  1. Abduction: Fix all exogenous ϵ\epsilon to their observed values.
  2. Action: Apply do(Ck=cC_k = c').
  3. Prediction: Propagate through FF to yield new descendants; regenerate XX' with the same ϵtemplate,ϵpersona\epsilon_\text{template}, \epsilon_\text{persona}.

The result is a pair (X,X)(X, X') of original and counterfactual texts, allowing direct computation of the Individual Causal Concept Effect (ICaCE):

ICaCEf(X,c ⁣ ⁣c)^=f(X)f(X)RY.\widehat{\mathrm{ICaCE}_f(X,\,c\!\to\!c')} = f(X') - f(X) \in \mathbb{R}^{|Y|}.

2. Dataset Construction and Structure

LIBERTy comprises three datasets, each constructed via SCM-guided generation:

  • Sample concept values in topological order through the SCM.
  • Draw ϵtemplate\epsilon_\text{template} and ϵpersona\epsilon_\text{persona} from banks of real-text abstractions to ground prompts.
  • Use GPT-4o with zero-temperature decoding to generate base texts.
  • Apply three randomly selected concept interventions per example, regenerate counterfactuals XX'.

All datasets share a consistent split structure: | Split | Size per Dataset | |----------------------|---------------------| | D_f (explained-model training) | 1,500 examples | | D_M (explainer training) | 500 examples | | D_IC (test, counterfactual pairs)| varies |

Dataset details:

| Dataset | Concepts (C) | Labels (Y) | |D_IC| (pairs) | Avg. Words | |---------------------------|-----------------------------------|----------------------------|--------------|------------| | Workplace Violence | G, A, R, T, L, D, S | No, Verbal, Physical | 1,756 | 350.9 | | Disease Detection | Y, D, L, P, W, F, N, H | Migraine, Sinusitis, Influenza | 1,243 | 310.8 | | CV Screening | G, R, A, E, S, W, V, C | Not Recommended, Potential, Recommended | 1,332 | 313.0 |

Each dataset provides the necessary counterfactual pairs and concept interventions for faithful evaluation.

3. Metrics: Order-Faithfulness and ICaCE-Error Distance

Two primary metrics evaluate whether an explanation method MfM_f approximates true causal effects:

  • ICaCE-Error Distance (ED):

$\mathrm{ED}(f,M_f,X,c\to c') = \|\,\widehat{\mathrm{ICaCE}_f(X,c\to c') - M_f(X,c\to c')\|_2$

ED measures 2\ell_2 error between reference and method scores per change; lower is better.

  • Order-Faithfulness (OF): For two concept changes c1c1c_1 \to c_1', c2c2c_2 \to c_2' in the same example,

$\mathrm{OF}\bigl(f,M_f,X,c_1\to c_1',\,c_2\to c_2'\bigr) = \frac1{|Y|}\sum_{i=1}^{|Y|} \mathbf{1}\left[ \left(\widehat{\mathrm{ICaCE}_f^i(X,c_1\to c_1') - \widehat{\mathrm{ICaCE}_f^i(X,c_2\to c_2')\right) \cdot \left(M_f^i(X,c_1\to c_1') - M_f^i(X,c_2\to c_2')\right)\!>\!0 \right]$

OF quantifies signed agreement of effect differences; higher values indicate better ranking fidelity.

Reported metrics:

  • Average ED: Lower values indicate more faithful importance estimation.
  • Average OF: Higher values reflect stronger agreement with ground-truth concept effect orderings.

4. Experimental Methods and Comparative Results

Models evaluated:

  1. DeBERTa-v3-base (encoder)
  2. T5-base (encoder–decoder)
  3. Qwen-2.5 (1.5B-instruct)
  4. Llama-3.1 (8B-instruct, zero-shot)
  5. GPT-4o (zero-shot)

Explanation methods:

  • Counterfactual Generation (CF Gen): LLM-generated causal-aware counterfactuals
  • Matching: Semantic variants (ST/PT/FT Match), concept-value (Approx, ConVecs)
  • Concept Erasure (LEACE): Linear projections
  • Concept Attributions: ConceptShap and TCAV methods

Key local results (Table 4):

Method Avg. ED Avg. OF
FT Match 0.34 0.74
ConVecs/Approx 0.44 0.69
CF Gen 0.55 0.49
LEACE (Disease only) 0.65 0.46

Global per-concept results (Table 5):

  • FT Match: OF ≈ 0.85 (best overall)
  • ConceptShap: OF 0.44–0.33 (weakest across datasets)

5. Analysis of Model Sensitivity to Interventions

Concept sensitivity is measured as S(f,C)=Ex[ICaCEf(X,C)1]S(f, C) = E_x[\|\mathrm{ICaCE}_f(X, C \to \cdot)\|_1], estimating the magnitude of output change under concept interventions.

Findings:

  • Fine-tuned models (DeBERTa-v3-base, Qwen-2.5) exhibit sensitivity profiles consistent with true SCM causal effects.
  • Llama-3.1 and GPT-4o demonstrate notably reduced sensitivity to demographic concepts (race, gender, age), attributed to post-training mitigation measures.

This suggests that proprietary LLMs are less affected by interventions on sensitive demographic features, possibly due to explicit debiasing strategies applied during training.

6. Implications, Limitations, and Best Practices

LIBERTy provides structural counterfactuals grounded in the known data-generating process, in contrast to previous human-edited datasets such as CEBaB, which only approximate causal effects and are susceptible to LLM “gaming.” Use of SCM-guided interventions yields gold-standard reference targets.

Observed best-case metrics (ED ≈ 0.32, OF ≈ 0.86) indicate that significant improvement margin exists for local and global faithfulness in concept-based explanations.

In high-stakes domains, best practices include explicit SCM design, deterministic grounding of exogenous variables, order-faithfulness evaluation, and concept sensitivity audits, especially for demographic features. This ensures that explanation methods are compared to true causal effects rather than heuristics or text edits. A plausible implication is that further methodological advances are needed to realize fully faithful explanations for practical deployment.

LIBERTy is positioned as a reproducible, principled pipeline for the generation and benchmarking of interventional text classification datasets and methods, offering a new standard for causal explainability evaluation in LLMs (Toker et al., 15 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets).