Validation of LLM-as-a-judge alignment with human judgment for taxonomy evaluation
Establish whether the LLM-based evaluation used to score the Sparse Autoencoder-derived reasoning taxonomy—specifically the consistency, completeness, and independence metrics—aligns with true human judgment, thereby validating the reliability of LLM-as-a-judge for this task.
References
Note that the consistency, completeness, and independence scores are generated by prompting an LLM. Although LLM-as-a-judge is a common practice in current literature, the alignment between our evaluation pipeline and true human judgment remains to be validated.
— Base Models Know How to Reason, Thinking Models Learn When
(2510.07364 - Venhoff et al., 8 Oct 2025) in Section 2.2 (Taxonomy Evaluation)