LIBERTy: LLM Benchmark for Causal Explainability

Updated 16 January 2026

The paper introduces LIBERTy, a benchmark framework and dataset suite based on structural causal models that generates gold-standard counterfactuals for LLM explanation evaluation.
The paper employs intervention-based measurements, including the Individual Causal Concept Effect (ICaCE) and Order-Faithfulness, to rigorously assess explanation methods under causal perturbations.
The paper demonstrates that fine-tuned models exhibit higher sensitivity to causal interventions compared to proprietary LLMs, underscoring the importance of explicit SCM design in high-stakes domains.

LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets) is a benchmark framework and dataset suite grounded in structural causal models (SCMs) and designed to support rigorous evaluation of concept-based explanations for LLMs. LIBERTy quantifies how high-level concepts (e.g., gender, symptoms, work experience) influence model behavior in text classification settings by enabling intervention-based measurement of causal effects, providing gold-standard reference targets unavailable in previous human-edited counterfactual datasets. Three domain-diverse datasets and novel metrics support systematic comparison of explanation methods and model sensitivities under genuinely causal perturbations, particularly for high-stakes or fairness-sensitive decision domains (Toker et al., 15 Jan 2026).

1. Structural Causal Model Foundation

LIBERTy is formalized using a structural causal model (SCM) $M = (U, V, F)$ , where:

$U = (\epsilon_1, \dots, \epsilon_k)$ are exogenous noise variables.
$V = \{C_1, \dots, C_m, X, Y\}$ are endogenous variables: $C_j$ are high-level “concept” variables, $X$ is the LLM-generated text, and $Y = f(X) \in \Delta^{|labels|}$ is the soft-probability output of a classifier.
$F$ is the set of structural functions governing each variable’s generation.

For each concept $C_j$ , the generation is:

$C_j = f_j(parents(C_j), \epsilon_j), \quad \epsilon_j \sim N(\mu_j, \sigma_j^2) \text{ or categorical}.$

Text is generated as:

$X = f_X(C_1, \dots, C_m, \epsilon_\text{template}, \epsilon_\text{persona})$

Model output is:

$U = (\epsilon_1, \dots, \epsilon_k)$ 0

Interventions are performed using Pearl’s abduction–action–prediction procedure:

Abduction: Fix all exogenous $U = (\epsilon_1, \dots, \epsilon_k)$ 1 to their observed values.
Action: Apply do( $U = (\epsilon_1, \dots, \epsilon_k)$ 2).
Prediction: Propagate through $U = (\epsilon_1, \dots, \epsilon_k)$ 3 to yield new descendants; regenerate $U = (\epsilon_1, \dots, \epsilon_k)$ 4 with the same $U = (\epsilon_1, \dots, \epsilon_k)$ 5.

The result is a pair $U = (\epsilon_1, \dots, \epsilon_k)$ 6 of original and counterfactual texts, allowing direct computation of the Individual Causal Concept Effect (ICaCE):

$U = (\epsilon_1, \dots, \epsilon_k)$ 7

2. Dataset Construction and Structure

LIBERTy comprises three datasets, each constructed via SCM-guided generation:

Sample concept values in topological order through the SCM.
Draw $U = (\epsilon_1, \dots, \epsilon_k)$ 8 and $U = (\epsilon_1, \dots, \epsilon_k)$ 9 from banks of real-text abstractions to ground prompts.
Use GPT-4o with zero-temperature decoding to generate base texts.
Apply three randomly selected concept interventions per example, regenerate counterfactuals $V = \{C_1, \dots, C_m, X, Y\}$ 0.

All datasets share a consistent split structure: | Split | Size per Dataset | |----------------------|---------------------| | D_f (explained-model training) | 1,500 examples | | D_M (explainer training) | 500 examples | | D_IC (test, counterfactual pairs)| varies |

Dataset details:

| Dataset | Concepts (C) | Labels (Y) | |D_IC| (pairs) | Avg. Words | |---------------------------|-----------------------------------|----------------------------|--------------|------------| | Workplace Violence | G, A, R, T, L, D, S | No, Verbal, Physical | 1,756 | 350.9 | | Disease Detection | Y, D, L, P, W, F, N, H | Migraine, Sinusitis, Influenza | 1,243 | 310.8 | | CV Screening | G, R, A, E, S, W, V, C | Not Recommended, Potential, Recommended | 1,332 | 313.0 |

Each dataset provides the necessary counterfactual pairs and concept interventions for faithful evaluation.

3. Metrics: Order-Faithfulness and ICaCE-Error Distance

Two primary metrics evaluate whether an explanation method $V = \{C_1, \dots, C_m, X, Y\}$ 1 approximates true causal effects:

ICaCE-Error Distance (ED):

$V = \{C_1, \dots, C_m, X, Y\}$ 2

ED measures $V = \{C_1, \dots, C_m, X, Y\}$ 3 error between reference and method scores per change; lower is better.

Order-Faithfulness (OF): For two concept changes $V = \{C_1, \dots, C_m, X, Y\}$ 4, $V = \{C_1, \dots, C_m, X, Y\}$ 5 in the same example,

$V = \{C_1, \dots, C_m, X, Y\}$ 6

OF quantifies signed agreement of effect differences; higher values indicate better ranking fidelity.

Reported metrics:

Average ED: Lower values indicate more faithful importance estimation.
Average OF: Higher values reflect stronger agreement with ground-truth concept effect orderings.

4. Experimental Methods and Comparative Results

Models evaluated:

DeBERTa-v3-base (encoder)
T5-base (encoder–decoder)
Qwen-2.5 (1.5B-instruct)
Llama-3.1 (8B-instruct, zero-shot)
GPT-4o (zero-shot)

Explanation methods:

Counterfactual Generation (CF Gen): LLM-generated causal-aware counterfactuals
Matching: Semantic variants (ST/PT/FT Match), concept-value (Approx, ConVecs)
Concept Erasure (LEACE): Linear projections
Concept Attributions: ConceptShap and TCAV methods

Key local results (Table 4):

Method	Avg. ED	Avg. OF
FT Match	0.34	0.74
ConVecs/Approx	0.44	0.69
CF Gen	0.55	0.49
LEACE (Disease only)	0.65	0.46

Global per-concept results (Table 5):

FT Match: OF ≈ 0.85 (best overall)
ConceptShap: OF 0.44–0.33 (weakest across datasets)

5. Analysis of Model Sensitivity to Interventions

Concept sensitivity is measured as $V = \{C_1, \dots, C_m, X, Y\}$ 7, estimating the magnitude of output change under concept interventions.

Findings:

Fine-tuned models (DeBERTa-v3-base, Qwen-2.5) exhibit sensitivity profiles consistent with true SCM causal effects.
Llama-3.1 and GPT-4o demonstrate notably reduced sensitivity to demographic concepts (race, gender, age), attributed to post-training mitigation measures.

This suggests that proprietary LLMs are less affected by interventions on sensitive demographic features, possibly due to explicit debiasing strategies applied during training.

6. Implications, Limitations, and Best Practices

LIBERTy provides structural counterfactuals grounded in the known data-generating process, in contrast to previous human-edited datasets such as CEBaB, which only approximate causal effects and are susceptible to LLM “gaming.” Use of SCM-guided interventions yields gold-standard reference targets.

Observed best-case metrics (ED ≈ 0.32, OF ≈ 0.86) indicate that significant improvement margin exists for local and global faithfulness in concept-based explanations.

In high-stakes domains, best practices include explicit SCM design, deterministic grounding of exogenous variables, order-faithfulness evaluation, and concept sensitivity audits, especially for demographic features. This ensures that explanation methods are compared to true causal effects rather than heuristics or text edits. A plausible implication is that further methodological advances are needed to realize fully faithful explanations for practical deployment.

LIBERTy is positioned as a reproducible, principled pipeline for the generation and benchmarking of interventional text classification datasets and methods, offering a new standard for causal explainability evaluation in LLMs (Toker et al., 15 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets).

LIBERTy: LLM Benchmark for Causal Explainability

1. Structural Causal Model Foundation

2. Dataset Construction and Structure

3. Metrics: Order-Faithfulness and ICaCE-Error Distance

4. Experimental Methods and Comparative Results

5. Analysis of Model Sensitivity to Interventions

6. Implications, Limitations, and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LIBERTy: LLM Benchmark for Causal Explainability

1. Structural Causal Model Foundation

2. Dataset Construction and Structure

3. Metrics: Order-Faithfulness and ICaCE-Error Distance

4. Experimental Methods and Comparative Results

5. Analysis of Model Sensitivity to Interventions

6. Implications, Limitations, and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research