LIBERTy: LLM Benchmark for Causal Explainability
- The paper introduces LIBERTy, a benchmark framework and dataset suite based on structural causal models that generates gold-standard counterfactuals for LLM explanation evaluation.
- The paper employs intervention-based measurements, including the Individual Causal Concept Effect (ICaCE) and Order-Faithfulness, to rigorously assess explanation methods under causal perturbations.
- The paper demonstrates that fine-tuned models exhibit higher sensitivity to causal interventions compared to proprietary LLMs, underscoring the importance of explicit SCM design in high-stakes domains.
LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets) is a benchmark framework and dataset suite grounded in structural causal models (SCMs) and designed to support rigorous evaluation of concept-based explanations for LLMs. LIBERTy quantifies how high-level concepts (e.g., gender, symptoms, work experience) influence model behavior in text classification settings by enabling intervention-based measurement of causal effects, providing gold-standard reference targets unavailable in previous human-edited counterfactual datasets. Three domain-diverse datasets and novel metrics support systematic comparison of explanation methods and model sensitivities under genuinely causal perturbations, particularly for high-stakes or fairness-sensitive decision domains (Toker et al., 15 Jan 2026).
1. Structural Causal Model Foundation
LIBERTy is formalized using a structural causal model (SCM) , where:
- are exogenous noise variables.
- are endogenous variables: are high-level “concept” variables, is the LLM-generated text, and is the soft-probability output of a classifier.
- is the set of structural functions governing each variable’s generation.
For each concept , the generation is:
Text is generated as:
Model output is:
Interventions are performed using Pearl’s abduction–action–prediction procedure:
- Abduction: Fix all exogenous to their observed values.
- Action: Apply do().
- Prediction: Propagate through to yield new descendants; regenerate with the same .
The result is a pair of original and counterfactual texts, allowing direct computation of the Individual Causal Concept Effect (ICaCE):
2. Dataset Construction and Structure
LIBERTy comprises three datasets, each constructed via SCM-guided generation:
- Sample concept values in topological order through the SCM.
- Draw and from banks of real-text abstractions to ground prompts.
- Use GPT-4o with zero-temperature decoding to generate base texts.
- Apply three randomly selected concept interventions per example, regenerate counterfactuals .
All datasets share a consistent split structure: | Split | Size per Dataset | |----------------------|---------------------| | D_f (explained-model training) | 1,500 examples | | D_M (explainer training) | 500 examples | | D_IC (test, counterfactual pairs)| varies |
Dataset details:
| Dataset | Concepts (C) | Labels (Y) | |D_IC| (pairs) | Avg. Words | |---------------------------|-----------------------------------|----------------------------|--------------|------------| | Workplace Violence | G, A, R, T, L, D, S | No, Verbal, Physical | 1,756 | 350.9 | | Disease Detection | Y, D, L, P, W, F, N, H | Migraine, Sinusitis, Influenza | 1,243 | 310.8 | | CV Screening | G, R, A, E, S, W, V, C | Not Recommended, Potential, Recommended | 1,332 | 313.0 |
Each dataset provides the necessary counterfactual pairs and concept interventions for faithful evaluation.
3. Metrics: Order-Faithfulness and ICaCE-Error Distance
Two primary metrics evaluate whether an explanation method approximates true causal effects:
- ICaCE-Error Distance (ED):
$\mathrm{ED}(f,M_f,X,c\to c') = \|\,\widehat{\mathrm{ICaCE}_f(X,c\to c') - M_f(X,c\to c')\|_2$
ED measures error between reference and method scores per change; lower is better.
- Order-Faithfulness (OF): For two concept changes , in the same example,
$\mathrm{OF}\bigl(f,M_f,X,c_1\to c_1',\,c_2\to c_2'\bigr) = \frac1{|Y|}\sum_{i=1}^{|Y|} \mathbf{1}\left[ \left(\widehat{\mathrm{ICaCE}_f^i(X,c_1\to c_1') - \widehat{\mathrm{ICaCE}_f^i(X,c_2\to c_2')\right) \cdot \left(M_f^i(X,c_1\to c_1') - M_f^i(X,c_2\to c_2')\right)\!>\!0 \right]$
OF quantifies signed agreement of effect differences; higher values indicate better ranking fidelity.
Reported metrics:
- Average ED: Lower values indicate more faithful importance estimation.
- Average OF: Higher values reflect stronger agreement with ground-truth concept effect orderings.
4. Experimental Methods and Comparative Results
Models evaluated:
- DeBERTa-v3-base (encoder)
- T5-base (encoder–decoder)
- Qwen-2.5 (1.5B-instruct)
- Llama-3.1 (8B-instruct, zero-shot)
- GPT-4o (zero-shot)
Explanation methods:
- Counterfactual Generation (CF Gen): LLM-generated causal-aware counterfactuals
- Matching: Semantic variants (ST/PT/FT Match), concept-value (Approx, ConVecs)
- Concept Erasure (LEACE): Linear projections
- Concept Attributions: ConceptShap and TCAV methods
Key local results (Table 4):
| Method | Avg. ED | Avg. OF |
|---|---|---|
| FT Match | 0.34 | 0.74 |
| ConVecs/Approx | 0.44 | 0.69 |
| CF Gen | 0.55 | 0.49 |
| LEACE (Disease only) | 0.65 | 0.46 |
Global per-concept results (Table 5):
- FT Match: OF ≈ 0.85 (best overall)
- ConceptShap: OF 0.44–0.33 (weakest across datasets)
5. Analysis of Model Sensitivity to Interventions
Concept sensitivity is measured as , estimating the magnitude of output change under concept interventions.
Findings:
- Fine-tuned models (DeBERTa-v3-base, Qwen-2.5) exhibit sensitivity profiles consistent with true SCM causal effects.
- Llama-3.1 and GPT-4o demonstrate notably reduced sensitivity to demographic concepts (race, gender, age), attributed to post-training mitigation measures.
This suggests that proprietary LLMs are less affected by interventions on sensitive demographic features, possibly due to explicit debiasing strategies applied during training.
6. Implications, Limitations, and Best Practices
LIBERTy provides structural counterfactuals grounded in the known data-generating process, in contrast to previous human-edited datasets such as CEBaB, which only approximate causal effects and are susceptible to LLM “gaming.” Use of SCM-guided interventions yields gold-standard reference targets.
Observed best-case metrics (ED ≈ 0.32, OF ≈ 0.86) indicate that significant improvement margin exists for local and global faithfulness in concept-based explanations.
In high-stakes domains, best practices include explicit SCM design, deterministic grounding of exogenous variables, order-faithfulness evaluation, and concept sensitivity audits, especially for demographic features. This ensures that explanation methods are compared to true causal effects rather than heuristics or text edits. A plausible implication is that further methodological advances are needed to realize fully faithful explanations for practical deployment.
LIBERTy is positioned as a reproducible, principled pipeline for the generation and benchmarking of interventional text classification datasets and methods, offering a new standard for causal explainability evaluation in LLMs (Toker et al., 15 Jan 2026).