2000 character limit reached
Does UMBRELA Work on Other LLMs? (2507.09483v1)
Published 13 Jul 2025 in cs.IR
Abstract: We reproduce the UMBRELA LLM Judge evaluation framework across a range of LLMs to assess its generalizability beyond the original study. Our investigation evaluates how LLM choice affects relevance assessment accuracy, focusing on leaderboard rank correlation and per-label agreement metrics. Results demonstrate that UMBRELA with DeepSeek V3 obtains very comparable performance to GPT-4o (used in original work). For LLaMA-3.3-70B we obtain slightly lower performance, which further degrades with smaller LLMs.