Correlation between LLM‑as‑a‑judge scores and ground‑truth benchmark outcomes
Determine the correlation between Mimosa’s LLM‑as‑a‑judge aggregate score and ground‑truth benchmark metrics to assess the reliability and effectiveness of judge‑guided workflow optimization.
References
This judge provides approximate rather than exact quality signal; its directional accuracy is sufficient to drive convergence, and its correlation with benchmark ground-truth metrics is left to future work.
— Mimosa Framework: Toward Evolving Multi-Agent Systems for Scientific Research
(2603.28986 - Legrand et al., 30 Mar 2026) in Section 3.1.2 (Workflow Discovery Problem)