Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 81 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

StockSim: A Dual-Mode Order-Level Simulator for Evaluating Multi-Agent LLMs in Financial Markets (2507.09255v1)

Published 12 Jul 2025 in cs.CE and cs.MA

Abstract: We present StockSim, an open-source simulation platform for systematic evaluation of LLMs in realistic financial decision-making scenarios. Unlike previous toolkits that offer limited scope, StockSim delivers a comprehensive system that fully models market dynamics and supports diverse simulation modes of varying granularity. It incorporates critical real-world factors, such as latency, slippage, and order-book microstructure, that were previously neglected, enabling more faithful and insightful assessment of LLM-based trading agents. An extensible, role-based agent framework supports heterogeneous trading strategies and multi-agent coordination, making StockSim a uniquely capable testbed for NLP research on reasoning under uncertainty and sequential decision-making. We open-source all our code at https: //github.com/harrypapa2002/StockSim.

Summary

The paper introduces StockSim, an open-source dual-mode simulator that evaluates multi-agent LLMs by modeling realistic market dynamics at the order level.
It combines order-level and candlestick-level execution to capture microstructural effects such as latency, slippage, and queue dynamics for fair comparisons.
The platform’s modular, asynchronous architecture supports heterogeneous trading strategies and scales almost linearly up to 150 agents for robust research.

StockSim: A Dual-Mode Order-Level Simulator for Evaluating Multi-Agent LLMs in Financial Markets

The paper "StockSim: A Dual-Mode Order-Level Simulator for Evaluating Multi-Agent LLMs in Financial Markets" (2507.09255) introduces StockSim, an open-source simulation platform designed for the rigorous evaluation of LLMs in financial decision-making. The platform addresses the limitations of existing tools by providing a comprehensive system that models market dynamics with varying granularity and incorporates real-world factors such as latency, slippage, and order-book microstructure. StockSim supports heterogeneous trading strategies and multi-agent coordination, making it a valuable testbed for NLP research on reasoning under uncertainty and sequential decision-making.

Addressing Limitations of Existing Evaluation Platforms

Current evaluation practices often rely on static benchmark datasets, which can lead to data leakage and inflated performance metrics. Existing platforms either oversimplify market interactions or depend on expensive, limited tick-level datasets, hindering reproducibility and fair comparisons. StockSim overcomes these challenges by offering a unified, open-source platform that integrates two complementary simulation modes: order-level execution and candlestick-level execution.

(Figure 1)

Figure 1: Overview of StockSim's system architecture and input/output scheme.

The order-level execution mode emulates real market behavior by operating directly on the LOB, capturing latency, queue dynamics, and microstructural dynamics. The candlestick-level execution mode enables scalable evaluation while abstracting away low-level market effects. This dual-mode approach allows researchers to focus on NLP-driven agent design and experimentation rather than infrastructure development.

System Architecture and Core Components

StockSim employs a modular, asynchronous architecture with four core components: the Exchange Simulation Engine, Data Sources, Agent, and Evaluator. The Exchange Simulation Engine manages the simulated trading environment, processes agent actions, and disseminates market indicators. Data Sources provide both market data (order-level and candlestick) and external data (news, corporate actions). The Agent component allows researchers to implement and test various trading strategies, including multi-agent setups. The Evaluator component tracks trade executions, computes performance metrics, and generates visual diagnostics.

The Exchange Simulation Engine acts as the central intermediary between data sources and trading agents, routing data dynamically and maintaining internal states related to orders and trades. Agents communicate asynchronously with the Engine via RabbitMQ, ensuring reliable message delivery and scalable communication.

Agent Capabilities and Multi-Agent Coordination

Agents in StockSim can subscribe to data streams, submit and cancel orders, receive execution outcomes and portfolio updates, and log reasoning. The platform includes a modular LLMTradingAgent that delegates decision-making to a team of specialist LLMs, such as market-technical analysts, news analysts, and fundamental analysts. This design enables researchers to rapidly prototype new agent structures, experiment with different backbones or prompting techniques, and conduct ablation studies.

StockSim maintains a unified interface across simulation engines, allowing researchers to switch between order-level and candlestick-level execution with only a configuration change. Pre-configured wrappers are provided for widely used LLMs, including LLaMA, OpenAI's offerings, and Anthropic's models.

Evaluation of Scalability and Consistency

The paper evaluates the scalability and consistency of StockSim through a series of controlled simulation tests using varying numbers of deterministic agents. The results confirm StockSim's consistency and demonstrate that it scales almost linearly up to approximately 150 agents. Resource demands remain modest, even at maximum load.

Figure 2: System performance metrics (memory/CPU usage) for varying numbers of deterministic agents.

To demonstrate the extraction of insights about model behavior using StockSim, the paper presents a simulation for GPT-o4-mini and GPT-o3 on NVIDIA stock. The results reveal distinct trading patterns and strategic behaviors between the LLMs, highlighting the evaluator's ability to capture and distinguish underlying strategic differences.

Conclusion

StockSim advances NLP research infrastructure by providing a sophisticated platform for studying LLM abilities in realistic, multi-agent, temporal reasoning scenarios. By combining financial simulation with NLP evaluation tools, StockSim bridges research experiments with real-world deployment requirements. The platform's open-source availability and documentation ensure broad accessibility for advancing the understanding of LLM behavior in complex decision-making environments.