Multi-Agent Classroom Systems

Updated 10 November 2025

Multi-agent classroom setups are structured learning environments where autonomous agents interact using formal protocols to adapt instruction and foster cooperative or competitive behaviors.
They leverage techniques from reinforcement learning and LLM-based simulations to coordinate agent roles, curriculum sequencing, and system evaluations.
Key methodologies include role engineering, modular architectures, and empirical metrics that inform scalable and adaptive instructional designs in AI-driven education.

A multi-agent classroom setup refers to the explicit design, deployment, and orchestration of learning environments in which multiple autonomous agents—software agents, embodied learning agents, or simulated learners—interact to foster learning, curriculum adaptation, and the emergence of cooperative or competitive behaviors. Such setups can be found in domains ranging from cooperative multi-agent reinforcement learning (MARL) and curriculum learning, to AI-powered instructional design, virtual classrooms with LLM-based agents, and participatory simulations in educational software. This article systematically overviews the architectural principles, curriculum and role-design methodologies, interaction protocols, evaluation metrics, and empirical results that define the state of the art in multi-agent classroom systems.

1. Formal Architectures of Multi-Agent Classrooms

Multi-agent classroom architectures are characterized by the instantiation of multiple, interacting agents operating under formal protocols and organizational structures. In reinforcement learning contexts, the cooperative multi-agent Markov Decision Process (MMDP) is a canonical formalism, with agents defined by their state spaces, action sets, reward functions, and transition kernels, e.g., in Overcooked: agents $N=2$ , joint state $S$ , individual actions $A^{1},A^{2}$ , joint reward $R_{team}$ , and episodic returns $R_{team} = \sum_{t=0}^{T-1} 20 \cdot \mathbf{1}\{\text{“soup delivered at } t\text{”}\}$ (Bhati et al., 2023).

In LLM-based classrooms, agent pools include teacher, assistant, diverse classmate roles, and a hidden Manager Agent that coordinates turn-taking and function calls, as in the SimClass framework (Zhang et al., 27 Jun 2024). Agent behavior is typically governed by finite-state machines or Mealy automata (MAIC: $M_a=(S_a, \Sigma, \Delta, \delta_a, \lambda_a)$ ), and orchestrated by session controllers or explicit policy mappings (e.g., $\mathcal{L}: S_t \rightarrow (a_t, f_t)$ in SimClass).

Modular architectures such as the von Neumann Multi-Agent System (MAS) decompose each agent into a Control Unit, Logic Unit, Memory Unit (short- and long-term), and I/O devices, each cycling through task deconstruction, self-reflection, memory updates, and tool invocation. Message passing and blackboard systems mediate inter-agent communication in both educational AI and simulation spaces (Jiang et al., 30 Dec 2024, Jiang et al., 1 Sep 2025, Ma et al., 7 Oct 2025).

2. Curriculum Design and Agent Role Engineering

Curriculum learning for multi-agent classrooms involves not just sequencing environments or tasks, but selecting and ordering teammates or peer policies to shape the learning trajectory. In the cooperative Overcooked MMDP, curriculum strategies studied include pairing the student agent with (A) a fixed pre-trained teammate of specified skill level; (B) a curriculum of partners with increasing skill; (C) a curriculum with decreasing skill (begin with high-skill, end with low-skill); and (D) fully joint learning (both agents learn from scratch). Skill is operationalized strictly as training episodes accrued via Independent DQN (e.g., $t_1=2{,}000$ , $t_2=5{,}000$ , $t_3=10{,}000$ ), normalized as $skill_{norm}(\pi'_{\tau}) = \tau/K \in [0,1]$ (Bhati et al., 2023).

Pseudocode for curriculum integration involves partitioning the total training budget $K$ into buckets for each curriculum "phase", fixing the teammate in each bucket, and training the student’s Q-network correspondingly.

In agent-centric instructional systems, agent role engineering follows explicit decompositions: e.g., the FACET system deploys a teacher agent (aggregator/design), learner agents (profile/model students), and an evaluator agent (didactical QA) (Gonnermann-Müller et al., 15 Aug 2025). Similar staged modularity appears in KLI-embedded instructional design (KC Agent, Process Agent, Principle Agent, Design Agent, Feedback Agent) (Wang et al., 20 Aug 2025) and in protocol-based automation of course material (Teaching Faculty Agent, Instructional Designer Agent, TA, Coordinator, Program Chair) (Yao et al., 27 Aug 2025).

3. Interaction Protocols, Communication, and Control

Interaction mechanisms in multi-agent classrooms are formalized through message-passing, function libraries, and role-based control flows.

In SimClass, the session controller and Manager Agent regulate classroom state via explicit mappings $S_t = \{C_t, H_t|Roles\}$ and $\mathcal{L}$ . Functions are partitioned between teacher-exclusive (e.g., lecture, advance slide) and general interaction (generate, respond, discuss). Turn-taking is managed by explicit manager messages and predefined buffers; agent utterances are annotated with JSON schemas for traceability (Zhang et al., 27 Jun 2024).
In reinforcement learning classrooms, when agents are decentralized, interaction involves distributed epistemic uncertainty measurement (Random Network Distillation) and peer-to-peer action advising, where each agent autonomously transitions between "student" and "teacher" modes based on $\mu(o)$ —the RND-derived epistemic signal. Advising rules (early or importance) control the granularity and timing of assistance (İlhan et al., 2019).
In agentic instructional design, communication and orchestration are handled by Orchestrator modules which enforce dependency graphs (e.g., ADDIE: analyze, design, develop) and manage synchronous or asynchronous agent calls via structured JSON contracts or APIs (Yao et al., 27 Aug 2025).
In participatory simulation (NetLogo-based classrooms), coordination is achieved through HubNet message-passing (TCP/IP, port 2050), with synchronous "Collect–Update–Display" cycles and majority/teacher-controlled application of student menu choices (Gkiolmas et al., 2015).

4. Metrics and Evaluation Methodologies

Success in a multi-agent classroom is defined along dimensions of joint task performance, individual agent progress, interaction patterns, and pedagogical quality assurance.

MARL settings:

Team performance: $G_{team}(e) = 20 \cdot \# \text{soups delivered}$ (Overcooked); individual contribution: $G_{student}(e) = \sum_t r^s_t$ (Bhati et al., 2023).
In advising-based RL, learning performance is quantified by normalization to expert baseline scores on held-out validation levels, area under the learning curve, and final evaluation score; ablations probe the effect of advice budget size, advisor skill, and heterogeneity (İlhan et al., 2019).

Instructional systems:

FACET’s evaluator agent uses 1–6 rubrics for didactical structure, clarity, creativity, suitability, with formulas such as $score_{dimension}=1+5 \times (\text{matched subcriteria}/\text{total})$ ; internal stability is measured as standard deviation $\sigma<0.5$ over ten runs (Gonnermann-Müller et al., 15 Aug 2025).
KLI-MAS employs weighted utility functions over knowledge, learning process, and instructional principle tags, e.g.

$U_j = \sum_{i=1}^N (w_K K\_score_{i,j} + w_L L\_score_{i,j} + w_I I\_score_{i,j})$

with weights $w_K=w_L=w_I=1/3$ (Wang et al., 20 Aug 2025).

EduVerse defines network density $D=2E/N(N-1)$ and IRF rates (Initiation–Response–Feedback) $R_{IRF} = \#\{I_t, R_{t+1}, F_{t+2}\}/T$ to compare peer-to-peer and teacher-directed interactions to real classrooms; session-to-session positive transition rates $R^+$ describe longitudinal behavioral growth (Ma et al., 7 Oct 2025).
In MAIC, lag sequential analysis identifies engagement chains and computes transition probabilities and $z$ -scores for event pairs. Learning gains ( $\Delta$ Score), motivational shifts, and technology acceptance are computed with standard psychometric procedures (Cohen's d, Cronbach's $\alpha$ ) (Hao et al., 3 Jun 2025).

5. Empirical Findings and Design Implications

Cooperative MARL

Pairing students with low-skill partners maximizes team reward and accelerates learning curves, but leads to student passivity— $G_{student}$ near zero ("lazy" learners). Medium-skill partners optimize both student contribution and near-optimal team returns. Dynamic curriculums, especially those with decreasing partner skill (high $\to$ med $\to$ low), outperform simultaneous learning or increasing-skill regimes (Bhati et al., 2023).
In decentralized advising, early advising with a moderate budget (5–10k exchanges per 20k episodes) measurably improves learning in agent teams of heterogeneous or partitioned skill. Excessive advising (unbounded budget) can degrade learning performance due to confusion (İlhan et al., 2019).

Agentic LLM Classrooms

LLM-based simulations with explicit role and turn control (SimClass) yield interaction patterns (teacher talk/student talk, indirect/direct ratios, SIR) and user-reported cognitive and social presence scores nearly matching real classrooms (Zhang et al., 27 Jun 2024).
Teacher-facing multi-agent systems dedicated to personalized worksheet creation (FACET) yield artifacts highly rated by in-service math teachers on didactical structure and suitability. Suggested improvements include increasing scaffolding for low-knowledge learners (Gonnermann-Müller et al., 15 Aug 2025).
Embedding explicit learning designs (KLI framework) into agent prompts and pipelines produces more creative and contextually relevant activities, as confirmed by qualitative teacher feedback, even if quantitative rubric deltas are small (Wang et al., 20 Aug 2025).
Multi-agent classrooms support differentiated engagement: students with low prior knowledge benefit more from co-construction (teacher–initiated, peer–interactive) patterns, evidenced by higher learning gains and motivation increases, while high prior-knowledge students engage more in co-regulation with limited improvement (Hao et al., 3 Jun 2025).
In multi-session simulations (EduVerse), IRF rates, network density, and cross-session transition matrices closely match empirical classroom benchmarks, demonstrating that realistic cognitive, emotional, and behavioral development—over time and across diverse genres/settings—can be reproduced (Ma et al., 7 Oct 2025).

6. Implementation Considerations and Best Practices

Initial partner skill selection (low, medium, high) must be operationalized using reproducible criteria, such as episode-based DQN training history in MARL, or persona vector assignment in simulation/classroom modeling (Bhati et al., 2023, Ma et al., 7 Oct 2025).
Curriculum switching can be fixed-episode, performance-based (e.g., moving average of $G_{team}$ thresholded), or randomized to avoid overfitting to fixed orders.
LLM-based environments require prompt libraries, role-specific system prompts, JSON-RPC or schema-governed communication, and should modularize agent logic to allow for plug-and-play substitution (e.g., swap GPT-4.1 for GPT-4o in FACET) (Gonnermann-Müller et al., 15 Aug 2025).
For deployment, agent microservices should be containerized (Docker, K8s), use centralized message brokers (RabbitMQ, Kafka) for bus/blackboard communication, and integrate directly with LMS systems for output consumption (Jiang et al., 1 Sep 2025, Yao et al., 27 Aug 2025).
Evaluation must always track both individual learning contributions and overall system throughput. Large gaps where students contribute minimally—despite high joint performance—indicate pathological classroom dynamics (Bhati et al., 2023).
To support real-time adaptation and prevent overfitting, randomize partner/teammate order and leverage performance/adaptation-based curriculum triggers as opposed to static sequencing.
As classroom scale grows, additional skill levels, randomized curriculum orders, and performance-triggered transitions should be added. Peer communication can employ self-attention or permutation-invariant pooling to handle variable group sizes (see SPC) (Wang et al., 2023).

7. Limitations, Challenges, and Future Directions

Empirical work establishes that curriculum effects can be counterintuitive—starting with high-skill partners and decreasing skill (contrary to common single-agent curriculum doctrine) can accelerate and stabilize learning (Bhati et al., 2023). Decentralized multi-agent RL classrooms show robust sample-efficiency improvements via peer advising, but only within well-tuned advice budgets and with careful uncertainty calibration (İlhan et al., 2019).

LLM-based multi-agent classrooms face practical challenges in turn-taking latency, inconsistency across agents, memory synchronization, and controlling for drift in large agent pools. Longitudinal simulations (EduVerse) reveal the necessity of consistent personality, emotional, and cognitive state evolution to match human classroom dynamics (Ma et al., 7 Oct 2025). Integration of richer learner profiles, adaptive prompt templating, LMS interoperability, and empirical teacher-in-the-loop validation are prominent next steps.

In summary, multi-agent classroom setups, grounded in hybrid architectures, rigorous skill and curriculum models, and formal agent orchestration, represent a robust paradigm for research on learning dynamics, instructional design, and scalable personalization across both AI and human educational contexts.