Token-Level Rationale Annotations

Updated 13 January 2026

Token-level rationale annotations are binary or scalar masks over text tokens that identify the minimal subset justifying a model’s decision.
Methodologies such as prompting-based, attribution-based, and MIL approaches extract these rationales, influencing both alignment with human judgments and model faithfulness.
Empirical findings show that fine-tuning enhances model accuracy and rationale alignment, though computational efficiency and optimal metric evaluation remain challenges.

Token-level rationale annotations are token-wise binary indicators or scalar masks over input sequences that identify the minimal subset of tokens which serve as justifications or evidence for a model’s decision. They arise in supervised learning, interpretability research, and evaluation of explanation methods, and are essential for assessing model faithfulness and human alignment in natural language processing systems. This entry surveys the taxonomy, annotation protocols, extraction methods, metrics, empirical findings, and methodological challenges surrounding token-level rationales.

1. Definition and Formalization

A token-level rationale is formally defined as a mask $m \in [0, 1]^n$ over an input sequence $x = (x_1, \ldots, x_n)$ , with $m_i$ indicating the relative importance (or binary selection) of token $x_i$ with respect to the model’s predicted label $\hat y \in \mathcal{Y}$ . Human-provided “gold” rationales are typically $h \in \{0, 1\}^n$ , where $h_i = 1$ if $x_i$ is marked salient by annotators, whereas model-extracted rationales may be soft or hard masks, depending on extraction method (Fayyaz et al., 2024, Kamp et al., 20 Nov 2025).

Token-level rationales serve as granular evidence for classification or sequential prediction tasks. In the most rigorous setting, a rationale is “sufficient” if masking the non-rationale tokens preserves the model’s original output (faithfulness), and “aligned” if the selected tokens match human judgments (plausibility) (Fayyaz et al., 2024, Carton et al., 2021).

2. Annotation Protocols and Datasets

Annotation protocols for gold rationales vary by task. In e-SNLI, annotators highlight tokens in both premise and hypothesis that explain the entailment label, producing binary bitmasks after subtokenization (e.g., BERT WordPieces) (Thorne et al., 2019, Fayyaz et al., 2024). “BiasLab” collects spans mapped to seven predefined bias indicators, requiring crowdworkers to both select a rationale type and highlight corresponding text in news snippets (Solaiman, 21 May 2025). Datasets for language modeling (Lambada) collect role-based rationales: selectors nominate informative tokens and predictors guess the answer, with the rationale defined as the minimal subset required for correct prediction, yielding $\kappa=0.63$ IAA (Vafa et al., 2021).

Most annotation schemas require explicit marking of rationales at word, subword, or phrase levels, with protocols for quality control (qualification tests, overlap assignments, or union of annotator responses). Rationale density varies strongly by domain—from 3–6 tokens per explanation in SNLI/e-SNLI to longer or denser highlights in argument mining and toxicity detection settings (Kamp et al., 20 Nov 2025).

3. Methods for Rationale Extraction

Extraction approaches fall into prompting-based, attribution-based, and algorithmic optimization frameworks:

Prompting-Based Self-Explanation: LLMs are prompted with context-specific instructions (e.g., “Which $k$ words support label $L$ ?”). Constraints on the number (Top-Var: match human rationale count; Top-Ratio: fixed percent; Unbound: any number) strongly influence alignment (Fayyaz et al., 2024). Output spans are parsed into token sets (typically, $m_i=1$ if $x_i$ is selected, else $0$).
Attribution-Based Techniques: Attention weights ( $\alpha_i$ ), gradient-based saliency ( $A_i = \frac{\partial \hat y}{\partial \mathrm{embedding}(x_i)}$ ), and Input $\times$ Gradient ( $A_i = \frac{\partial \hat y}{\partial \mathrm{embedding}(x_i)} \cdot \mathrm{embedding}(x_i)$ ) provide scalar scores for each token (Fayyaz et al., 2024, Bujel et al., 2023). Raw attributions $A \in \mathbb{R}^n$ are normalized into masks $m$ (e.g., by thresholding top- $k$ tokens).
Combinatorial and MIL Approaches: Greedy rationalization minimizes rationale size $|S|$ subject to preserving prediction: $S^*(t) = \operatorname{argmin}_{S \subseteq [t-1]} |S|$ s.t. $\arg\max_{y'} f_\theta(y'|y_S) = y_t$ (Vafa et al., 2021). Multiple Instance Learning (MIL) regularizes attention so that thresholded attention weights become interpretable rationales, using entropy penalties, “bag-label” constraints, and minimum-weight regularizers (Thorne et al., 2019).
Unsupervised and Hybrid Models: Compositional Soft Attention (C-SA) applies RoBERTa sentence-wise and globally pools token relevance via a soft-attention layer, enabling efficient rationale extraction in long documents without token-level supervision (Bujel et al., 2023). Losses include extremal-score regularizers, ranked soft-attention, and document label reconstruction.

4. Metrics: Alignment, Faithfulness, and Sufficiency

Alignment (Plausibility): Agreement between model rationale $m$ and human rationale $h$ is evaluated by token-level Precision, Recall, and $F_1$ , or Intersection over Union (IoU). For binary masks, the alignment score is

$\mathrm{Alignment}(m, h) = \frac{2 \sum_i (m_i \wedge h_i)}{\sum_i m_i + \sum_i h_i}$

(Fayyaz et al., 2024).

Faithfulness (Causal Impact): Faithfulness quantifies the causal effect of rationale tokens on prediction through ablation—masking rationale tokens ( $m_i=1$ ), re-running the model, and recording the flip rate:

$\mathrm{Faithfulness} = \frac{1}{N} \sum_{k=1}^N \mathbf{1}\{\text{orig\_pred}_k \neq \text{ablated\_pred}_k\}$

(Fayyaz et al., 2024). Higher flip rates indicate greater causal influence.

Sufficiency / Contextual Impact (CI): For gold rationales $r$ , sufficiency is

$\mathrm{suff}(x_i) = \mathcal{M}(x_i)_j - \mathcal{M}(r_i)_j$

(with $\mathcal{M}(x_i)_j$ the model’s probability on the full input, and $\mathcal{M}(r_i)_j$ on the rationale-masked input) (Kamp et al., 20 Nov 2025). $\mathrm{suff}\approx 0$ signifies the rationale alone suffices for the label.

Token Classification Score: Fractional improvement (over majority baseline) of a token classifier’s $F_1$ at predicting rationale bits,

$TC = \frac{\text{token-}F_1_\mathcal{T}}{\text{token-}F_1_\mathcal{B}}$

(Kamp et al., 20 Nov 2025).

5. Empirical Findings and Comparative Analyses

Recent empirical studies reveal critical distinctions:

Prompting-based rationales generally achieve higher alignment (up to 59.1% $F_1$ in e-SNLI) than attribution techniques under zero-shot conditions, but are less faithful as measured by ablation flip rates; task-specific constraints (Top-Var) improve alignment consistency by 4–8 points (Fayyaz et al., 2024).
Fine-tuning improves both task accuracy (e-SNLI: $33\%\to 75\%$ ) and alignment for attribution methods (+8–10 $F_1$ post-finetuning), and more robustly increases faithfulness (post-fine-tune ablation flip rates $45\%$ – $64\%$ ) (Fayyaz et al., 2024). Zero-shot models often neglect input and focus on instructions, evidenced by near-zero flip rates when masking input alone.
Attribution-based rationales can outperform prompting methods in faithfulness after fine-tuning, with attribution flip rates exceeding 60% in some cases, closely matching the effect of masking human rationales (e.g., $62\%$ flip rate for human spans) (Fayyaz et al., 2024).
Sufficiency/CI vs. Token Classification: CI and token-classification scores are largely uncorrelated; high rationale informativeness does not guarantee easy extraction by classifiers, and vice versa (Kamp et al., 20 Nov 2025). Context tokens may interfere with rationale informativeness, complicating optimization.
Supervised Objective Design: Naïve token-level cross-entropy on rationale masks can fail to improve downstream accuracy; class-weighted rationale loss, importance embeddings, and sentence-level aggregation yield stronger results (e.g., FEVER rationale $F_1$ up to $81.2$ with accuracy gains) (Carton et al., 2021).
Computational Considerations: Black-box explainers (LIME) achieve the highest precision and recall but are intractably slow (e.g., $64$s per instance vs. $<0.01$ s for MIL-regularized attention) (Thorne et al., 2019). C-SA achieves up to $+14$ pt $F_1$ in sentiment rationale extraction while reducing per-epoch runtime by $30$– $60\%$ over Longformer (Bujel et al., 2023).

6. Practical Recommendations and Open Questions

Best practices include:

Ensure Adequate Task Performance: Faithfulness metrics are unreliable when the underlying classifier performs poorly. Fine-tuning before evaluation is essential (Fayyaz et al., 2024).
Reporting Multiple Metrics: Alignment ( $F_1$ with human masks) and faithfulness (flip rate under ablation) must be reported together to assess both plausibility and causal fidelity (Fayyaz et al., 2024).
Length Control for Plausibility: Prompting with rationale-length constraints (Top-Var, Top-Ratio) leads to higher alignment and reproducible rationales (Fayyaz et al., 2024).
Precision vs. Recall: Penalty asymmetry (favoring recall in rationale extraction) is critical—missing gold rationale tokens is more deleterious than including extra non-rationales (Carton et al., 2021).
Granularity and Aggregation: Coarser units (sentence-level rather than token-level) for annotation and supervision can reduce noise and mitigate overfitting to idiosyncratic token highlights (Carton et al., 2021, Kamp et al., 20 Nov 2025).

Open questions persist regarding:

Metric Adequacy: Sufficiency/CI, comprehensiveness (removal-based), and alignment all offer partial and sometimes inconsistent insights; consensus on metrics remains elusive (Kamp et al., 20 Nov 2025).
Inter-annotator Agreement: Token-level rationale marking remains user- and context-dependent, with limited reporting on span-level IAA in most settings (Solaiman, 21 May 2025, Vafa et al., 2021).
Hybrid Methods: The potential for hybrid schemes (e.g., using prompting to seed attribution or vice versa) and advanced regularization remains underexplored (Fayyaz et al., 2024).
Cross-domain and Multilingual Transfer: Rationale supervision can yield cross-domain gains (AR $>1$ ), but generality and annotation protocol effects require further study (Kamp et al., 20 Nov 2025).

7. Data Releases and Benchmark Resources

Several datasets and toolkits provide standard evaluation and research infrastructure:

Dataset/Resource	Annotation Type	Domain/Task
e-SNLI	Human token-level highlights	NLI, 3-label, premise/hypothesis
BiasLab	Span + indicator rationale	Political news, bias detection
Lambada	Role-based predictor rationales	Language modeling, cloze tests
De–En Alignments	Word alignments as rationales	Machine Translation
BEA-2019, FCE	Token error annotation	Grammatical Error Detection
IMDB-Pos/Neg	Token-level sentiment marks	Review classification

Full annotation files, rationale masks, model outputs, and explanatory code are often released alongside the benchmarks (Fayyaz et al., 2024, Vafa et al., 2021, Solaiman, 21 May 2025).

The field of token-level rationale annotation intersects annotation theory, interpretability methodology, supervised and unsupervised modeling, and evaluation frameworks. Advances in learning objectives, annotation granularity, and extraction algorithms continue to refine both the plausibility and faithfulness of model-generated rationales. Comprehensive reporting across alignment and faithfulness measures, with robust annotation protocols and cross-domain assessment, remains the gold standard for scientific progress in this area (Fayyaz et al., 2024, Kamp et al., 20 Nov 2025, Carton et al., 2021).