Potential self‑reinforcing bias from using the same model for orchestration and evaluation

Ascertain whether employing the same language model for both meta‑orchestration and LLM‑as‑a‑judge evaluation in Mimosa introduces self‑reinforcing optimization tendencies, compared to cross‑model configurations.

Background

The current implementation uses the same model (claude‑opus‑4‑5‑20251101) for generating workflow proposals and for judging their executions. Although these roles operate on different artifacts, using a single model might bias evaluation toward its own proposals. Testing cross‑model configurations could reveal or mitigate such effects.

References

While these roles operate on distinct outputs — the meta-orchestrator proposes a workflow structure, while the judge evaluates the resulting agent execution trace — it remains an open question whether using the same model introduces self-reinforcing optimization tendencies.

— Mimosa Framework: Toward Evolving Multi-Agent Systems for Scientific Research (2603.28986 - Legrand et al., 30 Mar 2026) in Section 7 (Limitations and Future work), bullet 'Cross-model configuration for meta-orchestrator and judge'

Potential self‑reinforcing bias from using the same model for orchestration and evaluation

Background

References

Related Problems