Validation of LLM-as-a-judge alignment with human judgment for taxonomy evaluation

Establish whether the LLM-based evaluation used to score the Sparse Autoencoder-derived reasoning taxonomy—specifically the consistency, completeness, and independence metrics—aligns with true human judgment, thereby validating the reliability of LLM-as-a-judge for this task.

Background

To assess unsupervised taxonomies of reasoning mechanisms, the paper employs LLM-as-a-judge to compute consistency (F1), completeness (confidence), and independence (semantic orthogonality) scores. While cost-effective and scalable, the authors note that these LLM-generated evaluations should be checked against human judgments to ensure validity.

Validating this alignment would strengthen confidence in taxonomy comparisons across models and configurations and inform the broader use of LLM-as-a-judge in interpretability research.

References

Note that the consistency, completeness, and independence scores are generated by prompting an LLM. Although LLM-as-a-judge is a common practice in current literature, the alignment between our evaluation pipeline and true human judgment remains to be validated.

— Base Models Know How to Reason, Thinking Models Learn When (2510.07364 - Venhoff et al., 8 Oct 2025) in Section 2.2 (Taxonomy Evaluation)

Validation of LLM-as-a-judge alignment with human judgment for taxonomy evaluation

Background

References

Related Problems