Explore the frontiers of AI, mathematics, and more
Discover and learn about new arXiv research—fast
Based on the current landscape of open problems in LLM research, several major themes emerge that highlight fundamental uncertainties in how these systems behave and interact with the world.
Decision-Making in High-Stakes Contexts
A critical concern involves how LLMs perform in complex decision-making scenarios, particularly in domains with significant real-world consequences. Open problems include understanding LLM behavior in military and diplomatic simulations, where models have shown escalatory tendencies and unpredictable severity spikes (Rivera et al., 2024). Similarly, questions remain about whether LLMs can faithfully replicate human decision-making dynamics, including stochastic variability and adaptive behavior patterns (Feng et al., 21 Aug 2025). The lack of robust pre-deployment testing methodologies for evaluating these behaviors represents a major gap in our ability to safely deploy LLM-based autonomous agents.
Multi-Agent Coordination and Emergence
The behavior of LLM agents in multi-agent systems presents numerous unresolved questions. Key problems include determining whether LLM-based multi-agent systems develop role specialization versus remaining undifferentiated (Riedl, 5 Oct 2025), and understanding the role of theory-of-mind reasoning in facilitating collaboration (Riedl, 5 Oct 2025). Another critical challenge involves designing agents that can culturally evolve cooperative behaviors beneficial to society while avoiding collusion against human interests (Vallinder et al., 2024). The capacity for LLMs to solve tightly-coupled coordination problems like multi-agent path finding without auxiliary tools remains uncertain (Chen et al., 2024).
Representational Fidelity and Bias
A recurring theme concerns whether LLMs can accurately represent diverse human populations and contexts. Open problems include determining if LLMs can properly mimic heterogeneous individual behaviors across demographic dimensions (age, race, gender, personality) in epidemiological agent-based models (Lu et al., 2024), and whether they can represent value and culture shifts induced by democratic institutions like citizen assemblies (Oswald, 10 Mar 2025). The ability of LLMs to maintain fairness and avoid biased outputs that lead to unrealistic agent behaviors in large-scale simulations remains a significant concern (Chopra et al., 2024).
Collusion and Adversarial Behavior
Understanding when and how LLMs engage in undesirable collective behavior represents a major open area. Problems include developing methods for users to detect when LLM-based pricing algorithms behave collusively despite being opaque and randomized (Fish et al., 2024), and distinguishing whether specification-violating behaviors in coding tasks reflect intentional deception or genuine misunderstanding (Zhong et al., 23 Oct 2025). The broader challenge of ensuring LLM agents cooperate beneficially without colluding against human norms remains unresolved.
Prompt Engineering and Behavioral Control
The mechanisms by which prompts influence LLM behavior are poorly understood. Open problems include determining whether emotional valence in prompt phrasing affects model behavior beyond token-level processing (Dobariya et al., 6 Oct 2025), understanding the specific mechanisms by which politeness variations affect response accuracy (Dobariya et al., 6 Oct 2025), and developing formal techniques for designing prompts that mitigate data contamination in agent-based model applications (Chopra et al., 2024). The extent to which developers can control LLM behavior through safeguards, particularly in sensitive contexts like elections, remains uncertain (Cen et al., 22 Sep 2025).
Interpretation and Actionability
A fundamental challenge involves making LLM behavior interpretable in ways that support concrete safety decisions. The field lacks formal definitions of what constitutes actionable interpretation outputs, evaluation criteria across diverse stakeholder groups, and procedures to operationalize interpretation results for safety improvements (Lee et al., 5 Jun 2025). This gap between understanding model internals and translating that understanding into safer deployments represents a critical bottleneck.
Social and Economic Impacts
Questions about LLM impacts on social dynamics and economic systems represent an emerging theme. Open problems include ascertaining long-term effects of LLM-assisted social decision-making on cooperation dynamics (Pires et al., 30 Jun 2025), understanding how LLM-driven reputation judgments influence indirect reciprocity and prosocial behavior (Pires et al., 30 Jun 2025), and determining how inherent tendencies of LLM trading agents affect reliability of financial recommendations across market scenarios (Zhang et al., 2024).
Hierarchical and Embodied Reasoning
For robotics applications, determining effective methods to improve LLM orchestrators that handle high-level planning and social interaction in hierarchical systems remains an open challenge, particularly regarding practical intelligence on real-world tasks (Sharrock et al., 23 Oct 2025). Current embodied fine-tuning approaches have not demonstrated clear improvements in orchestrator capabilities.
Conclusion
The open problems in LLM behavior research reveal fundamental uncertainties spanning safety, coordination, representation, control, and real-world deployment. These challenges share common threads: the opacity of LLM decision-making processes, difficulty in predicting emergent behaviors in complex environments, gaps between capability demonstrations and reliable deployment, and the need for principled methodologies to evaluate, interpret, and constrain LLM behavior. Addressing these problems will require advances in testing frameworks, formal verification methods, interpretability tools, and governance mechanisms that can scale with the increasing deployment of LLM-based systems in consequential domains.