Correlation between LLM‑as‑a‑judge scores and ground‑truth benchmark outcomes

Determine the correlation between Mimosa’s LLM‑as‑a‑judge aggregate score and ground‑truth benchmark metrics to assess the reliability and effectiveness of judge‑guided workflow optimization.

Background

Mimosa uses an LLM‑as‑a‑judge to guide iterative refinement of multi‑agent workflows, but this signal is only approximate. Establishing how well judge scores align with ground‑truth task success is necessary to quantify the sample efficiency and validity of the optimization process.

References

This judge provides approximate rather than exact quality signal; its directional accuracy is sufficient to drive convergence, and its correlation with benchmark ground-truth metrics is left to future work.

Mimosa Framework: Toward Evolving Multi-Agent Systems for Scientific Research  (2603.28986 - Legrand et al., 30 Mar 2026) in Section 3.1.2 (Workflow Discovery Problem)