- The paper demonstrates that pre-trained LLMs, without fine-tuning, learn HMM structures in-context with accuracy nearing that of the Viterbi algorithm.
- LLMs show improved convergence with longer context windows, with performance intricately affected by HMM characteristics such as entropy and mixing rate.
- Real-world tests confirm that LLM in-context learning provides a data-efficient prediction tool, outperforming traditional methods like Baum-Welch.
This paper, "Pre-trained LLMs Learn Hidden Markov Models In-context" (Pre-trained Large Language Models Learn Hidden Markov Models In-context, 8 Jun 2025), investigates the capability of pre-trained LLMs to model and predict sequences generated by Hidden Markov Models (HMMs) using in-context learning (ICL). The core finding is that LLMs can achieve predictive accuracy approaching the theoretical optimum (defined by the Viterbi algorithm with ground-truth parameters) on a diverse set of synthetic HMMs, often outperforming traditional statistical methods like Baum-Welch.
Key Contributions and Findings:
- Near-Optimal HMM Learning with ICL:
- Through systematic experiments on synthetic HMMs, the paper demonstrates that LLMs, without any parameter updates (fine-tuning), can learn HMM structures purely from examples provided in their input context.
- The predictive accuracy of LLMs consistently converges to the performance of the Viterbi algorithm, especially when HMM entropy is low and mixing is fast. This convergence is observed not just in terms of predicting the most likely next token but also distributionally, as measured by Hellinger distance.
- Scaling Trends and HMM Properties:
- LLM performance generally improves monotonically with longer context windows.
- This improvement is significantly influenced by HMM properties:
- Entropy: Higher entropy in transition (A) and emission (B) matrices leads to slower convergence and a larger gap to optimal performance.
- Mixing Rate (λ2): Slower mixing (higher λ2, the second-largest eigenvalue of A) delays LLM convergence.
- State/Observation Space Size (M,L): When normalized entropy is held constant, the number of hidden states or observations does not significantly impact LLM convergence rates, though larger spaces generally mean lower absolute accuracy due to increased task complexity.
- These scaling trends are characterized, offering insights into the learnability of stochastic systems via ICL.
- Practical Guidelines and Real-World Applications:
- ICL as a Diagnostic Tool: Practitioners can use LLM ICL to assess the inherent structure and learnability of their sequential data. Rapid convergence and high accuracy suggest low-entropy, fast-mixing underlying processes. Slow or poor convergence might indicate high entropy or slow mixing, which are fundamentally harder to model.
- Data-Efficient Prediction: LLMs offer a data-efficient method for next-observation prediction without requiring extensive domain expertise or computational resources for training bespoke models like LSTMs or Baum-Welch (which can suffer from non-convexity).
- Real-World Validation:
- Decision-making Mice Dataset (IBL): LLM ICL, when provided with sufficient context ( >1000 trials of stimulus, choice, and reward), achieved an average prediction accuracy of 86.2%, surpassing an expert-developed GLM-HMM (82.2%). The convergence pattern mirrored synthetic experiments with low entropy.
- Reward-learning Rats Dataset: On a more complex task (no explicit stimuli, learning dynamics captured), LLM ICL showed marginal improvement with context length, similar to synthetic HMMs with high entropy and slow mixing. It performed significantly worse than a specialized model developed via extensive evolutionary search. This suggests limitations of off-the-shelf LLM ICL for highly complex, non-stationary dynamics.
Comparison to Baselines:
- LLM ICL consistently outperformed learning-based baselines like Baum-Welch, LSTMs, and n-gram models in terms of faster, more stable convergence to the ground-truth distribution on synthetic HMMs.
- Baum-Welch suffered from slow and unreliable convergence due to non-convex optimization.
- LSTMs required significant computational resources and showed unstable accuracy.
- n-gram models (especially bigram) were inherently suboptimal for HMMs where observations are not Markovian.
- A conditional predictor P(Ot+1∣Ot−k:t) using true HMM parameters could approach Viterbi performance, especially in fast-mixing regimes, suggesting that truncated history can be effective.
Theoretical Conjectures:
- The paper draws parallels between LLM ICL behavior and spectral learning algorithms for HMMs.
- Theorem 1 (informal) extends results from spectral learning to single trajectory scenarios, suggesting that the observed scaling trends (improvement with sample size t, dependence on mixing rate 1−λ2(A)1, and effect of entropy via observability conditions) are consistent with spectral learning theory.
- LLMs seem to handle practical limitations of spectral learning (e.g., rank conditions, sensitivity to conditioning) more gracefully, indicating an area for further statistical understanding.
Implementation Details and Considerations:
- Experimental Setup: Synthetic HMMs were generated by varying states (M), observations (L), mixing rate (λ2), stationary distribution skewness, entropy of A and B, and initial state distribution π. LLMs (Qwen, Llama families) were evaluated for next-token prediction accuracy and Hellinger distance to the true next-token distribution.
- Tokenization: Experiments explored different tokenization strategies (ABC, 123, random special tokens) for representing observations. While all converged similarly, ABC (single letters) was slightly faster for high-entropy A matrices. For low-entropy A, ABC showed lower initial accuracy, hypothesized to be due to filtering of repetitive n-grams during LLM pretraining.
- LLM Size: Performance was generally similar across different model sizes, with slight degradation for the smallest models.
- Context Length: Performance was assessed across context lengths from 4 to 2048 observations.
Limitations and Future Directions:
- Discrete Tokenization: A major bottleneck is the reliance on discrete tokens, making it challenging to model continuous, real-valued, or high-dimensional data (e.g., neural recordings) directly with ICL.
- Interpretability: LLMs remain "black-box" models. Extracting interpretable HMM parameters (transition/emission probabilities) from LLM representations is non-trivial.
- Call to Action: The paper calls for multidisciplinary efforts to develop next-generation foundation models designed for scientific data, moving beyond NLP-focused architectures.
In essence, the research demonstrates that pre-trained LLMs possess a remarkable and previously underexplored ability to learn HMMs in-context, offering a powerful, accessible, and data-efficient tool for analyzing sequential data in scientific domains, particularly for initial diagnostics and prediction when domain expertise or data for bespoke models is limited.