Sequential Agent Modeling

Updated 13 June 2026

Sequential agent modeling is a set of techniques that decomposes multi-agent interactions into sequential stages, enabling efficient learning and action conditioning.
This approach reduces the exponential complexity of joint models by factorizing state transitions and using autoregressive or diffusion-inspired methods.
Applications span MARL, collaborative LLM systems, social dilemmas, and spatiotemporal simulations, resulting in superior sample efficiency and robust performance.

Sequential agent modeling refers to a set of methodologies and architectures for capturing and leveraging the structured, temporally ordered dependencies among multiple agents acting within a sequential decision process, particularly in complex, high-dimensional environments such as multi-agent reinforcement learning (MARL), collaborative LLM systems, game-theoretic domains, and agent-based simulations. Unlike fully joint models, which operate on the entire multi-agent system state and joint action in one step (suffering exponential complexity), sequential agent modeling factorizes the interactions in a way that decomposes global transitions, policy updates, or communication flows into a chain of agent-wise or step-wise stages. This approach has been shown to improve tractability, expressivity, and sample-efficiency across a range of applications. The sequential aspect commonly appears either in the action selection, modeling transitions, communication protocols, or the orchestration of agent sub-processes.

1. Sequential Factorization in Multi-Agent World Modeling

Sequential agent modeling is foundational to state-of-the-art world model architectures for MARL. The Diffusion-Inspired Multi-Agent world model (DIMA) introduces a principled sequential decomposition of the global state transition, which reduces modeling complexity from exponential to linear in the number of agents. This is accomplished by expressing the transition as a short reverse-diffusion chain of $N$ "denoising" steps, where each step sequentially absorbs the action of one agent and conditions future predictions accordingly:

$P(s_{t+1}, s_{t+1}^{(1)}, \ldots, s_{t+1}^{(N)} | s_t, a_t^1, \ldots, a_t^N) = p(s_{t+1}^{(N)}) \prod_{k=1}^N p(s_{t+1}^{(k-1)} | s_{t+1}^{(k)}, s_t, a_t^k)$

where $s_{t+1}^{(N)}$ corresponds to maximal uncertainty, and each conditional progressively reduces uncertainty as actions are revealed in sequence. This structure closely mirrors the reverse process in score-based diffusion models, resulting in both tractable training and stable, permutation-invariant predictions. DIMA employs a diffusion-based dynamic model (1D U-Net), a reward+termination predictor, and a state-to-observation VQ-VAE autoencoder, optimizing a denoising MSE loss over randomly sampled agent orderings and diffusion steps (Zhang et al., 27 May 2025).

A key finding is that sequential modeling not only ensures better sample efficiency (learning curves reach high performance at 2–3× fewer steps than joint models), but also yields more stable optimization and is robust to agent orderings. When compared to joint (fully flattened) action-conditioning, sequential DIMA consistently achieves higher returns and lower variance across challenging continuous-control benchmarks such as MAMuJoCo and Bi-DexHands.

2. Sequence Modeling Architectures for Multi-Agent Decision Making

Several recent frameworks have demonstrated the effectiveness of recasting multi-agent decision-making as a sequence modeling problem. Central examples include the Multi-Agent Transformer (MAT) and SrSv (Sequential rollout with Sequential value estimation):

MAT (Wen et al., 2022) uses an encoder–decoder Transformer, mapping the concatenated sequence of agent observations to an ordered sequence of agent actions. The sequential decoder leverages the multi-agent advantage decomposition theorem, enabling the joint action optimization problem to be split into a chain of per-agent conditional optimizations—each conditioned on the decisions made by predecessors. This yields linear time complexity in the number of agents and guarantees monotonic policy improvement under a PPO objective.
SrSv (Wan et al., 3 Mar 2025) extends autoregressive Transformers to model both sequential rollouts of agent actions and attention-based sequential value estimation. Each agent’s policy $\pi(a_t^i | o_t^{1:N}, a_t^{1:i-1})$ is conditioned on the observations and prior agents’ actions, and values are aggregated through masked attention reflecting the dependency structure inherent in the multi-agent process.

These architectures enable large-scale applications (e.g., up to 1,024 agents in DubinsCar benchmarks in SrSv), facilitate sample-efficient learning, and generalize across tasks and agent configurations.

Model	Sequential Mechanism	Scalability	Sample Efficiency
DIMA	Reverse-diffusion, agent-by-agent	High	2–3× faster than baselines
MAT	Autoregressive Transformer decoding	High	20–50% fewer steps
SrSv	Transformer, sequential rollout/val	Very high	Markedly faster convergence

3. Sequential Agent Communication and Orchestration

Sequentiality in agent communication is critical for efficient collaboration, particularly as seen in recent LLM-based multi-agent systems. NeXa (Tastan et al., 15 May 2026) frames the orchestration task as a hybrid of parallel and sequential modes: initially, all agents independently produce candidate outputs, which are then embedded into a shared semantic space. A learned, response-conditioned policy predicts a sparse directed acyclic communication graph, along which a single round of sequential message passing and response refinement occurs. This mechanism subsumes both pure-parallel and traditional sequential pipelines (such as chain-of-thought) and achieves guaranteed acyclicity by construction.

Empirically, NeXa demonstrates superior accuracy–efficiency trade-offs (e.g., highest average accuracy, 60.90%, and 35% lower token usage than strong sequential baselines), maintains communication sparsity as agent count increases, and exhibits strong transferability across agent team sizes, tasks, and underlying model backbones.

Beyond LLMs, value-aware sequential communication protocols such as SeqComm-DFL (Amoh et al., 10 Apr 2026) explicitly model Stackelberg-style leader–follower conditioning: agents generate messages in a priority order, where each agent conditions its message on all previous messages. This protocol is end-to-end trainable via bilevel optimization and yields significant empirical and information-theoretic gains in coordination under partial observability.

Sequential agent modeling is indispensable in the study of temporally extended social dilemmas and complex games. In sequential social dilemmas (SSDs), as formulated in (Leibo et al., 2017), cooperation and defection are properties of agent policies as entire trajectories, not single-step atomic actions. The sequential nature of policy learning, environmental reward structure, and inter-agent state dependencies critically shape emergent social behavior—demonstrated by the divergent cooperation regimes in the Gathering and Wolfpack environments, and by the sensitivity of group outcomes to agent-level parameters such as discount factor and model capacity.

In non-cooperative multi-agent settings such as No-Press Diplomacy (Paquette et al., 2019), sequential modeling is instantiated at multiple levels: policy networks model the Markov game as a sequence of phases, issue action sequences via autoregressive decoders conditioned on unit ordering, and are trained with both supervised learning (expert demonstration) and reinforcement learning (self-play), capturing both stable coalition tactics and dynamic betrayal.

5. Sequential Agent Modeling for Spatiotemporal Prediction and Simulation

Spatiotemporal multi-agent prediction tasks benefit from architectures that are inherently sequential but permutation-invariant over agent and time dimensions. baller2vec (Alcorn et al., 2021) generalizes Transformer attention mechanisms to the multi-entity domain by representing input as a time–entity matrix with causal masks, enabling efficient joint modeling of sports trajectories (e.g., basketball player movement and ball dynamics) without imposed order on agent indices. This results in improved data efficiency and expressivity compared to graph-RNN baselines, while attention heads learn meaningful agent-interaction patterns (e.g., predicting pass recipients).

A similar paradigm holds in agent-based epidemiological modeling, where agent-indexed configurations are sequentially updated via advanced particle filters (Sequential Monte Carlo, SMC) (Ju et al., 2021). Fully adapted and controlled SMC proposals increase effective sample size and inferential fidelity by considering sequential inter-agent heterogeneity, rather than global count-based summaries.

6. Sequential Modeling and Orchestration in LLM-Based and Mathematical Modeling Agents

Sequential agent modeling underpins the design of complex LLM-powered modeling agents, where tasks (e.g., mathematical modeling in MM-Agent (Liu et al., 20 May 2025)) are decomposed into a chain of problem analysis, model formulation, computational realization, and report generation—each phase feeding state and memory into the next according to an explicit dependency DAG. Coordination among sub-agents is managed via modular memory and iterative refinement. This sequential structure induces high expert-level performance and efficient resource usage in open-ended, real-world problem solving.

Unified language-model-based decision architectures (DLM (Zhang et al., 26 Apr 2026)) further formalize multi-agent sequential decision processes as dialogue-style sequence modeling problems. Agents predict actions autoregressively, conditioned on both their own and others’ histories. Robustness to out-of-distribution actions, zero-shot generalization, and compatibility with heterogeneous observation and action spaces are achieved via staged optimization (supervised fine-tuning and group relative policy optimization).

7. Theoretical and Practical Implications

Sequential agent modeling fundamentally mitigates the curse of dimensionality associated with joint transition and policy spaces, establishes strong theoretical guarantees (e.g., contraction for interactive POMDPs (Doshi et al., 2011), monotonic improvement for sequential MARL), and underlies sample-efficient, scalable, and transferable multi-agent architectures. A persistent theme is the close connection to generative sequential modeling (notably diffusion processes), sequential advantage/value decomposition, and Stackelberg conditioning for communication.

Empirically, the adoption of sequential factorization—across world modeling, communication, orchestration, social learning, and spatiotemporal forecasting—consistently yields state-of-the-art sample efficiency, scalability, and robustness. Ablation studies confirm that sequentially structured models outperform joint/parallel alternatives in stability and final task performance, provided care is taken to ensure permutation-invariance and efficient agent-order selection (Zhang et al., 27 May 2025, Tastan et al., 15 May 2026, Wan et al., 3 Mar 2025, Wen et al., 2022).