phi-2: Causal Language Model Analysis
- The phi-2 causal language model is a 2-billion-parameter, 32-layer decoder-only transformer designed for left-to-right language modeling and systematic semantic anomaly detection.
- It employs a detailed layerwise probing approach with logistic regression to reveal the emergence of anomaly signals, peaking at layers 24–26 with high AUC performance.
- Manifold geometry analysis shows that semantic inconsistencies expand representation dimensionality before consolidating into a low-dimensional subspace, paralleling the human N400 response.
The phi-2 causal LLM is a 2-billion-parameter instantiation of the PaLM-2 architecture, designed as a 32-layer, decoder-only transformer optimized for left-to-right language modeling. Recent work has systematically analyzed the internal mechanisms by which phi-2 detects semantic anomalies in sentence completions, employing a precise layerwise probing regimen to uncover both the representational and computational basis of semantic violation detection. These results illuminate when and how the model encodes and consolidates information about semantic plausibility, revealing computational patterns with close analogies to psycholinguistic phenomena such as the human N400 response (Zacharopoulos et al., 24 Nov 2025).
1. Model Architecture and Parameterization
The phi-2 model belongs to the PaLM-2 family and implements a transformer-decoder-only structure as follows:
- Parameter count: 2 billion
- Depth: 32 layers
- Hidden dimensionality: 1,280
- Self-attention: 16 heads per layer, with per-head key/query/value dimensionality of 80
- Feed-forward inner dimension: 4,096, activated with SwiGLU (gated GELU)
- Positional encoding: Rotary embeddings integrated in each self-attention block
- Normalization & regularization: Standard pre-norm (layer normalization prior to residual computation), no dropout at inference
- Architecture: Decoder-only, left-to-right autoregressive modeling. No modifications or fine-tuning were applied in the analysis, keeping phi-2 "off the shelf."
This configuration enables the model to process input sequences and produce left-to-right token predictions, with each sub-layer and attention head contributing to the emergent semantic representations at greater depth.
2. Experimental Design: Probing Semantic Violation Detection
The evaluation protocol consisted of two major components: a tailored plausibility corpus and systematic extraction of hidden representations for layer-specific analysis.
2.1 Plausibility Corpus Construction
Researchers constructed a balanced cloze-style dataset. Each sentence adhered to the template form:
1 |
The X near the Y Z. |
where X and Y were drawn from a closed set of animate/inanimate nouns (e.g. "baker," "school," etc.). For every sentence with a semantically plausible completion ("bakes bread"), a matched semantically anomalous version was generated by altering the final noun ("bakes sunlight"). The resulting corpus consisted of 100 matched pairs (200 sentences), meticulously controlled by animacy and frequency of the final noun.
2.2 Extraction of Layerwise Hidden States
Each sentence was tokenized and fed through phi-2, and at every transformer layer â„“ (â„“ = 1 ... 32), the hidden state for the LAST token (the final noun) was extracted. No downstream tuning was involved; analyses relied exclusively on these per-layer activations from the unmodified model.
3. Linear Probing and Classifiability of Semantic Violations
3.1 Linear Probe Formulation
At each layer â„“, a two-class logistic regression probe was trained to classify the final-noun activation as either plausible or implausible. The probe parameters at a given layer â„“ are a weight matrix and bias , with output probability vector:
3.2 Probe Optimization and Evaluation
The probe optimization employed standard Lâ‚‚-regularized logistic regression (sklearn, C=1.0, L2 penalty), with cross-entropy loss. Performance at each layer was quantified using area under the ROC curve (AUC) on a 20% held-out corpus split. Sanity checks included standard classification accuracy.
3.3 Classifiability Trajectory Across Layers
Layerwise performance traces a distinct profile (see Figure 1A of (Zacharopoulos et al., 24 Nov 2025)):
- Early layers (â„“ < 10): AUC at chance (~0.5). No significant anomaly signal is present.
- Mid to upper-middle layers (ℓ = 10–26): Steady rise in AUC, peaking at ℓ ≈ 24–26 (AUC ≈ 0.88).
- Top layers (â„“ > 26): Slight plateau or tapering.
The implication is that explicit, linearly accessible information about semantic violations in completions only emerges after approximately 24 layers of contextual and lexical composition.
4. Manifold Geometry: Effective Dimensionality and Representational Dynamics
4.1 Participation Ratio (Effective Dimensionality)
The participation ratio (PR) quantifies the effective dimensionality of the representation subspace occupied by the final-noun activations for plausible and implausible cases. For a set of representations, the empirical covariance (for condition ) has eigenvalues , and:
The difference captures how anomalous inputs diversify the subspace relative to plausible ones.
4.2 Evolution of the Subspace
- Early layers (ℓ < 10): Both conditions inhabit a high-dimensional subspace (PR ~ 200–300).
- Mid-layers (ℓ ≈ 10–20): Subspace dimensionality begins to diverge; implausible completions exhibit a modest increase in PR over plausible ones.
- Upper-middle layers (ℓ ≈ 24): peaks; implausible states "fan out" along more dimensions, coinciding with maximal linear classifiability.
- Final layers (â„“ > 28): Both plausible and implausible PR values decline, indicating "collapse" onto a tighter manifold as the model prepares the hidden state for token prediction.
This dynamic suggests an initial exploration or expansion followed by rapid consolidation of anomaly information into low-dimensional features accessible to the model's output head.
5. Consolidation Dynamics and Psycholinguistic Alignment
Key findings underscore the two-phase process by which semantic anomaly signals are shaped:
- Exploratory Phase (layers 10–20): The model integrates lexical and world-knowledge cues, resulting in increased effective dimensionality for implausible endings.
- Rapid Consolidation (layers 20–26): The anomaly signal becomes crisply linearly separable; both AUC and reach their maxima.
- Post-consolidation (layers > 28): Representational dimensionality contracts, and the semantic violation information is efficiently encoded for downstream generation or classification.
A notable contextualization: The temporal profile and locus of semantic anomaly detection in phi-2 mirrors findings from human event-related potential (ERP) research, specifically the N400 component, where sensitivity to semantic implausibility emerges only after syntactic resolution, later in the incremental processing stream (Zacharopoulos et al., 24 Nov 2025).
6. Illustrative Figures and Quantitative Results
| Figure Panel | Analysis Summary | Quantitative Trend |
|---|---|---|
| 1A | Logistic-probe AUC by layer | Peaks at ℓ ≈ 24–26, AUC ≈ 0.88 |
| 1B | PR(Σ) ellipses, ΔPR vs. layer | ΔPR peaks at ℓ ≈ 24 |
Key formulas reproduced:
- Softmax probe prediction:
- Participation ratio for effective dimensionality:
These quantitative and geometric results demonstrate that semantic violation information in phi-2 is not explicitly encoded until relatively deep, highlighting the importance of stacking multiple contextual integration steps for robust anomaly detection.
7. Implications and Connections
The emergence of linearly decodable signals and dynamic changes in subspace occupancy in phi-2 suggest that causal transformer LLMs implement a late, multi-stage process for detecting semantic violations. This trajectory—initial high-dimensional exploration, followed by violation-specific expansion, and ending in manifold contraction—bears strong computational resemblance to human semantic anomaly detection as indexed by the N400 ERP. A plausible implication is that advancing model interpretability may benefit from psycholinguistic theory, and that probing manifold geometry yields actionable markers for the emergence and consolidation of context-sensitive semantic computations in LLMs (Zacharopoulos et al., 24 Nov 2025).