Papers
Topics
Authors
Recent
2000 character limit reached

Text Aphasia Battery (TAB)

Updated 2 December 2025
  • Text Aphasia Battery (TAB) is a clinically-grounded metric designed to assess aphasia-like language deficits in LLMs using text-only tasks.
  • It comprises four subtests—Connected Text, Word Comprehension, Sentence Comprehension, and Repetition—each targeting different linguistic domains such as lexical access and syntactic processing.
  • Automated scoring with instruction-tuned LLMs and exact-match routines ensures objective, reproducible evaluation with reliability comparable to expert human annotations.

The Text Aphasia Battery (TAB) is a clinically-grounded, text-only benchmark designed to characterize aphasia-like language deficits in LLMs. Adapted from standard clinical aphasia assessments—specifically, the Quick Aphasia Battery (QAB) and the APROCSA connected-speech framework—TAB isolates core language faculties by excluding non-textual modalities and pragmatic demands not shared with artificial systems. TAB addresses the need for a scalable, objective, and fine-grained protocol for measuring breakdowns in lexical, syntactic, and discourse competence in LLMs under simulated or actual “virtual lesion” conditions (Roll et al., 25 Nov 2025).

1. Motivation and Clinical Rationale

Conventional aphasia assessments, such as WAB-R, BDAE, CAT, and QAB, rely on multimodal tasks encompassing auditory comprehension, picture-word matching, oral tasks involving prosody and speech-motor control, and measures of pragmatic language use. These components are not directly applicable to text-only LLMs, which lack access to sensory input (visual, auditory), do not enact speech-motor processes, and do not possess communicative intent or self-concept. Standard human rating scales for aphasia are subjective and coarse-grained in this context. TAB preserves clinical validity by adapting only the core constructs likely to transfer across biological and artificial language systems, focusing on the measurable structure of text output (Roll et al., 25 Nov 2025).

The choice to restrict TAB to text responses both constrains the modeled domain—ensuring that only linguistic competence is measured—and facilitates reproducible, automatable scoring of linguistic deficits, such as agrammatism and paraphasias, at scale.

2. Benchmark Structure and Targeted Language Domains

TAB comprises four subtests, each with five items, which together sample principal domains implicated in aphasia:

  1. Connected Text: Elicits multi-sentence narrative or procedural responses to open prompts, scored for 19 APROCSA-adapted error features across lexical, morphosyntactic, fluency, disfluency, perseverative, and coherence categories.
  2. Word Comprehension: Forced-choice selection task requiring verbatim single-token responses from a six-word array, probing lexical-semantic retrieval.
  3. Sentence Comprehension: Yes/No questions targeting passive syntax, negation, and logical reasoning.
  4. Repetition: Verbatim repetition tasks increasing in linguistic complexity, requiring exact output matching with sensitivity to case, punctuation, and whitespace.

The following table summarizes the structure and principal focus of each subtest:

Subtest Task Type Main Linguistic Domains Targeted
Connected Text Narrative/procedural Lexical, syntax, discourse, fluency
Word Comprehension Forced-choice Lexical-semantic access
Sentence Comprehension Yes/No questions Syntactic/semantic comprehension
Repetition Verbatim reproduction Short-term verbal memory, phonology

Each subtest is engineered for modality-independence, removing dependencies on spoken language, gesture, or context.

3. Scoring System and Quantification of Deficits

Scoring in TAB is strictly objective and supports automation:

  • Connected Text: Each response is annotated for 19 binary features. Categories and representative features include:
    • Lexical: anomia, semantic and phonemic paraphasias, neologisms
    • Fluency/productivity: empty speech, simplified utterances
    • Morphosyntactic: omission of bound morphemes, omission of function words, paragrammatism
    • Disfluency: abandoned utterances, false starts, retracing, conduite d’approche
    • Perseverative: perseverations, stereotypies/automatisms
    • Coherence: jargon, unclear meaning, off-topic
    • Overall communication impairment

The sum for each item is:

ScoreCT(item)=f=119flagf,where flagf{0,1}\text{Score}_{CT}(\text{item}) = \sum_{f=1}^{19} \text{flag}_f, \quad \text{where } \text{flag}_f \in \{0, 1\}

Total for five prompts (subject-level composite):

ScoreCT(total)=i=15ScoreCT(itemi),max=95\text{Score}_{CT}(\text{total}) = \sum_{i=1}^5 \text{Score}_{CT}(\text{item}_i), \quad \text{max}=95

  • Word Comprehension, Sentence Comprehension, Repetition: Each scored as binary correct/incorrect (1/0). For word comprehension, chance accuracy is 1/6 due to six options. Repetition requires strict verbatim reproduction.

No composite “Aphasia Quotient” is predefined; results may be aggregated as vectors or normalized percentages.

4. Automated Evaluation Protocol and Reliability

Automation of the Connected Text subtest employs instruction-tuned LLMs as annotators (e.g., Gemini 2.5 Flash), using in-context few-shot prompts with definitions and examples for all 19 features. The model is prompted to output JSON with feature presence/absence per transcript. For Word/Sentence Comprehension and Repetition, exact-match string comparison suffices.

Inter-rater reliability for feature annotation is quantified via prevalence-weighted Cohen’s κ:

κ=pope1pe\kappa = \frac{p_o - p_e}{1 - p_e}

where pop_o = observed agreement and pep_e = chance agreement. Prevalence-weighted κ aggregates agreement across features proportionally to their frequency. In validation, prevalence-weighted κ was human–human = 0.286, and model (Gemini)–consensus = 0.255, both within the “fair” agreement range. This establishes that the automated protocol closely approaches expert human annotation consistency (Roll et al., 25 Nov 2025).

5. Empirical Validation and Key Findings

TAB was validated on 561 text samples: 306 from AphasiaBank (human-produced aphasic transcripts) and 255 from lesioned LLMs (GPT-2, Pythia, Llama with 10–40% simulated ablations). Five expert speech-language pathologists (SLPs) annotated all 19 connected-text features per sample, with 82 multiply-annotated for reliability computation.

Key findings:

  • Automated (Gemini-based) feature identification achieved human-level reliability: κ ≈ 0.25–0.29 under prevalence-weighting.
  • Subtests 2–4 (word and sentence comprehension, repetition) are reliably automatable by exact-matching routines.
  • The full 19-feature error profile enables the phenotyping of LLM breakdowns mirroring established clinical aphasic syndromes (Roll et al., 25 Nov 2025).

This suggests the TAB framework enables large-scale, systematic analysis of language deficits in LLMs with precision comparable to clinical expert annotation.

6. Protocol Administration and Best Practices

For practical use, administration follows these guidelines:

  • For Connected Text, present five open-ended prompts; responses are analyzed by a prompted judge LLM (e.g., Gemini), which receives:

    1. System prompt as aphasia-feature annotator
    2. Definitions for all features
    3. Two few-shot labeled examples
    4. Target transcript for annotation
  • For forced-choice and repetition subtests, enforce exactly one-token answers, rigorous whitespace/case/punctuation matching, and disallow extra commentary or justifications.

  • Score aggregation may sum raw error counts or compute percent-correct for each subtest; normalization (e.g., 0–1) is optional for reporting profile vectors.
  • When reporting human–automated agreement, use prevalence-weighted κ.

Best practice cautions include avoiding meta-commentary in model queries and ensuring robust instruction-following capability in the judge LLM.

7. Current Limitations and Future Development

TAB does not constitute a substitute for human clinical diagnosis; it omits auditory, prosodic, motor, and psychosocial measures integral to clinical assessment. The protocol’s item set (5 per subtest) may incur ceiling or floor effects, and currently lacks item-response calibration. Automated scoring depends on a single judge LLM per annotation, rendering it potentially susceptible to model-specific bias. Currently, only English and culturally specific prompt sets are supported.

Proposed extensions include:

  • Expanding subtest item pools (with greater syntactic and semantic variety)
  • Applying item-response theory analyses for psychometric refinement
  • Employing multiple annotating models, or hybrid models-plus-expert protocols for critical features
  • Developing equivalents for other languages and cultural contexts
  • Investigating a formal “TAB Aphasia Quotient” or analogous severity metric by normalizing subtest results

These advances will extend TAB’s sensitivity and generalizability as a benchmark for aphasia-like phenomena in both artificial and human language systems (Roll et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Text Aphasia Battery (TAB).