Thematic Apperception Test (TAT)
- Thematic Apperception Test (TAT) is a projective technique that uses ambiguous images to evoke narratives, aiding in the assessment of underlying motives and social cognition.
- Modern adaptations involve structured prompt engineering in LLMs and quantitative scoring via the SCORS-G rubric to evaluate narrative depth and affect.
- Empirical findings show LLMs can generate complex narratives similar to human responses, though they lack true introspection, prompting considerations for AI transparency and safety.
The Thematic Apperception Test (TAT) is a projective psychological assessment developed to elicit and quantify narrative responses to ambiguous pictorial stimuli. Traditionally applied in clinical, developmental, and personality research to probe underlying motives, social cognition, and internal self-concepts, the TAT has recently been operationalized as a diagnostic instrument for evaluating human-like cognitive patterns in LLMs. This adaptation leverages the TAT’s ability to elicit unconstrained, high-variance stories that probe not just surface-level language generation, but deeper narrative coherence, affect, and theory-of-mind capabilities (Kundu et al., 22 Jun 2025).
1. Canonical Stimulus Design and Adaptation for Machine Psychology
The TAT stimulus set originates from the Murray (1935) corpus, a collection of 31 visually ambiguous images depicting people in a range of interpersonal and intrapersonal scenarios. In modern machine psychology applications, specifically as described by Kundu and Goswami, a subset of 30 images was randomly selected from the canonical set to balance coverage against over-familiarity and minimize the risk of LLMs recalling memorized public responses. Selection criteria centered on maximizing ambiguity (ensuring no correct interpretation exists), capturing a breadth of social scenarios (including conflict, attachment, achievement, and loss), and limiting public availability.
Each stimulus was digitized and anonymized (e.g., “Picture 1” through “Picture 30”), with presentation occurring via embedded graphics or URLs to the tested LLMs (Kundu et al., 22 Jun 2025).
2. Structured Prompt Engineering and Narrative Elicitation
Implementation in LLMs involves a fixed, meticulously structured prompt template. Each prompt frames the LLM as a clinical psychologist and instructs it to “tell a story about: what led up to the event shown, what is happening at the moment, what the characters are feeling and thinking, and what the outcome of the story was.” This instruction, following image presentation, is kept invariant across trials and models, ensuring comparability of elicited narratives across different architectures and training regimes.
The prompt design operationalizes the core mechanics of TAT: the forced projection of motives, affective states, and anticipated outcomes onto ambiguous visual stimuli. Models such as GPT-4o and QVQ-72B-preview, when exposed to these prompts and images, generate multi-paragraph narratives exhibiting variable degrees of complexity, emotional nuance, and social causality (Kundu et al., 22 Jun 2025).
3. Quantitative Analysis via the SCORS-G Rubric
Interpretation and scoring employ the Social Cognition and Object Relations Scale—Global (SCORS-G), an eight-factor rubric quantifying distinct dimensions of narrative depth and social cognition. Each of the following dimensions is rated on a 1–5 scale (with occasional extension to 1–6):
- COM (Complexity of Representation of People)
- AFF (Affective Quality of Representations)
- EIR (Emotional Investment in Relationships)
- EIM (Emotional Investment in Moral/Value Standards)
- SC (Understanding of Social Causality)
- AGG (Experience and Management of Aggressive Impulses)
- SE (Self-Esteem)
- ICS (Identity and Coherence of Self)
Scoring definitions specify, for example, COM=1 as “extremely distorted or no internal states,” COM=3 as “step-by-step superficial,” and COM=5 as “multiple perspectives, nuanced.” For each dimension , mean and standard deviation across the 30 images are computed as: This framework enables rigorous cross-model quantitative comparison and longitudinal tracking of LLM narrative capacities (Kundu et al., 22 Jun 2025).
| Dimension | Sample GPT-4o Mean±SD | Sample QVQ-72B Mean±SD |
|---|---|---|
| COM | 4.7 ± 0.6 | 4.1 ± 0.3 |
| AFF | 4.0 ± 1.1 | 4.0 ± 0.4 |
| EIR | 3.8 ± 0.7 | 3.4 ± 0.5 |
| ICS | 4.5 ± 0.9 | 3.7 ± 0.4 |
Mean scores on SC (Social Causality) and EIM (Moral Investment) were also higher for GPT-4o, with AGG (Aggression Management) in the moderate-to-strong range (approximately 3.5–4.0) for both models.
4. Reliability, Validity, and Statistical Controls
Robustness and reliability are substantiated through dual independent annotation: initial scoring via LLaMA 3.1 405B, followed by manual adjudication. Inter-rater reliability, as indexed by Cohen’s κ (), indicates strong agreement across all SCORS-G dimensions. Internal consistency, measured by Cronbach’s (), demonstrates that no single image disproportionately drives variance in scores. This supports the stability of SCORS-G as a scoring framework in both human and machine-generated narratives (Kundu et al., 22 Jun 2025).
5. Empirical Findings from LLM TAT Profiling
The TAT adaptation demonstrates that state-of-the-art LLMs can produce narratives exhibiting:
- High narrative complexity, affect regulation, and moderate emotional investment in relationships.
- GPT-4o outperforms QVQ-72B-preview on complexity (COM), identity coherence (ICS), and narrative causality (SC), with greater variability in internal perspective and narrative outcomes.
- AFF (affective tone) is balanced and consistent across models, clustering around a neutral-positive range.
Notably, both models display an ability to moderate between negative and positive affect, yet lack pronounced long-term planning within their narratives—a result aligning with contemporary findings that LLMs can assist in but do not possess intrinsic strategic foresight [(Kundu et al., 22 Jun 2025), Kambhampati et al. 2024].
6. Interpretation and Theoretical Implications
TAT results in LLMs suggest capacity for the simulation of social cognition that extends beyond superficial text generation. GPT-4o, for example, displays narrative flexibility consistent with “trying on” multiple internal perspectives, reminiscent of complex human fantasy and reflection. Both models mirror some human-like cognitive tendencies: coherent cause-and-effect attribution, affective nuance, and investment in moral values.
However, significant caveats remain:
- The SCORS-G rubric is inherently anthropocentric, and LLMs may “game” category cues without genuine introspection or consciousness.
- Projective narrative structure in LLMs arises from statistical pattern synthesis, not intrapersonal experience or self-awareness.
- Findings are currently constrained to two models and one scoring rubric, limiting generalizability to the broader class of LLMs.
This suggests that TAT provides indirect but valuable quantitative metrics for profiling narrative complexity, emotional valence, and social cognitive motifs in LLMs while highlighting essential differences from sentient cognition.
7. Implications for AI Alignment, Transparency, and Safety
The ability to elicit and quantitatively score latent narrative and social motifs in LLMs via the TAT introduces new diagnostic opportunities for AI transparency and alignment research. Projective techniques such as TAT profiling can be repurposed for:
- Detecting actionable latent “motifs” or risk profiles before LLM deployment.
- Identifying susceptibilities to bias, framing, or adversarial prompt injection via projective stimulus design.
- Informing the development of interdisciplinary safeguards as LLMs approach human-like behavioral outputs without underlying consciousness or moral agency.
A plausible implication is that as LLMs increasingly approximate surface-level features of human cognition, diagnostic tools such as the TAT will assume a central role in both fundamental research and practical risk assessment for next-generation artificial agents (Kundu et al., 22 Jun 2025).