Degenerate equilibria in self-contained co-evolving agent–coach systems

Determine whether a fully self-contained multiagent system in which the agents and a trainable coach co-evolve without any external supervision (such as meta-evaluation from a stronger external model, agreement with outcome-based verification, or human feedback) can avoid converging to degenerate equilibria.

Background

The paper introduces MAPPA, a framework that trains multiagent systems using per-action process rewards from an LLM coach to address credit assignment and sample efficiency challenges. In the Future Directions section, the authors propose making the coach itself a trainable agent within the multiagent system, co-evolving with the agents rather than relying on external supervision.

They discuss possible signals for training such a coach (stronger external models, outcome-based verification, or human feedback) and raise a concern about whether a fully self-contained setup—where agents and the coach co-evolve without external oversight—could fall into degenerate equilibria. This uncertainty motivates a formal investigation into the stability and convergence properties of such closed systems.

References

Whether a fully self-contained system---where agents and coaches co-evolve without external supervision---can avoid degenerate equilibria remains an open question.

Scaling Multiagent Systems with Process Rewards  (2601.23228 - Li et al., 30 Jan 2026) in Section: Future Directions, subsection "Trainable Coach"