- The paper introduces StockSim, an open-source dual-mode simulator that evaluates multi-agent LLMs by modeling realistic market dynamics at the order level.
- It combines order-level and candlestick-level execution to capture microstructural effects such as latency, slippage, and queue dynamics for fair comparisons.
- The platform’s modular, asynchronous architecture supports heterogeneous trading strategies and scales almost linearly up to 150 agents for robust research.
StockSim: A Dual-Mode Order-Level Simulator for Evaluating Multi-Agent LLMs in Financial Markets
The paper "StockSim: A Dual-Mode Order-Level Simulator for Evaluating Multi-Agent LLMs in Financial Markets" (2507.09255) introduces StockSim, an open-source simulation platform designed for the rigorous evaluation of LLMs in financial decision-making. The platform addresses the limitations of existing tools by providing a comprehensive system that models market dynamics with varying granularity and incorporates real-world factors such as latency, slippage, and order-book microstructure. StockSim supports heterogeneous trading strategies and multi-agent coordination, making it a valuable testbed for NLP research on reasoning under uncertainty and sequential decision-making.
Current evaluation practices often rely on static benchmark datasets, which can lead to data leakage and inflated performance metrics. Existing platforms either oversimplify market interactions or depend on expensive, limited tick-level datasets, hindering reproducibility and fair comparisons. StockSim overcomes these challenges by offering a unified, open-source platform that integrates two complementary simulation modes: order-level execution and candlestick-level execution.
(Figure 1)
Figure 1: Overview of StockSim's system architecture and input/output scheme.
The order-level execution mode emulates real market behavior by operating directly on the LOB, capturing latency, queue dynamics, and microstructural dynamics. The candlestick-level execution mode enables scalable evaluation while abstracting away low-level market effects. This dual-mode approach allows researchers to focus on NLP-driven agent design and experimentation rather than infrastructure development.
System Architecture and Core Components
StockSim employs a modular, asynchronous architecture with four core components: the Exchange Simulation Engine, Data Sources, Agent, and Evaluator. The Exchange Simulation Engine manages the simulated trading environment, processes agent actions, and disseminates market indicators. Data Sources provide both market data (order-level and candlestick) and external data (news, corporate actions). The Agent component allows researchers to implement and test various trading strategies, including multi-agent setups. The Evaluator component tracks trade executions, computes performance metrics, and generates visual diagnostics.
The Exchange Simulation Engine acts as the central intermediary between data sources and trading agents, routing data dynamically and maintaining internal states related to orders and trades. Agents communicate asynchronously with the Engine via RabbitMQ, ensuring reliable message delivery and scalable communication.
Agent Capabilities and Multi-Agent Coordination
Agents in StockSim can subscribe to data streams, submit and cancel orders, receive execution outcomes and portfolio updates, and log reasoning. The platform includes a modular LLMTradingAgent that delegates decision-making to a team of specialist LLMs, such as market-technical analysts, news analysts, and fundamental analysts. This design enables researchers to rapidly prototype new agent structures, experiment with different backbones or prompting techniques, and conduct ablation studies.
StockSim maintains a unified interface across simulation engines, allowing researchers to switch between order-level and candlestick-level execution with only a configuration change. Pre-configured wrappers are provided for widely used LLMs, including LLaMA, OpenAI's offerings, and Anthropic's models.
Evaluation of Scalability and Consistency
The paper evaluates the scalability and consistency of StockSim through a series of controlled simulation tests using varying numbers of deterministic agents. The results confirm StockSim's consistency and demonstrate that it scales almost linearly up to approximately 150 agents. Resource demands remain modest, even at maximum load.

Figure 2: System performance metrics (memory/CPU usage) for varying numbers of deterministic agents.
To demonstrate the extraction of insights about model behavior using StockSim, the paper presents a simulation for GPT-o4-mini and GPT-o3 on NVIDIA stock. The results reveal distinct trading patterns and strategic behaviors between the LLMs, highlighting the evaluator's ability to capture and distinguish underlying strategic differences.
Conclusion
StockSim advances NLP research infrastructure by providing a sophisticated platform for studying LLM abilities in realistic, multi-agent, temporal reasoning scenarios. By combining financial simulation with NLP evaluation tools, StockSim bridges research experiments with real-world deployment requirements. The platform's open-source availability and documentation ensure broad accessibility for advancing the understanding of LLM behavior in complex decision-making environments.