Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 129 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses (2509.16093v1)

Published 19 Sep 2025 in cs.CL and cs.AI

Abstract: Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments ($r=0.78$), compared to traditional metrics ($r=0.12$), pointwise LLM scoring ($r=0.35$), and modern multidimensional evaluators ($r=0.48$). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE's scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.

Summary

  • The paper presents the DeCE framework which decomposes evaluation into precision and recall for assessing LLM outputs.
  • It shows that generalist models achieve higher recall while specialized models demonstrate greater precision accuracy.
  • The framework correlates at 0.78 with expert evaluations, significantly outperforming metrics like BLEU and ROUGE.

"Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses"

Introduction to DeCE Framework

"Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses" addresses fundamental challenges in evaluating LLM-generated content within high-stakes domains such as law and medicine. Traditional metrics like BLEU and ROUGE often fail to capture the semantic accuracy required for these tasks. The paper introduces the DeCE framework, which evaluates LLM responses using decomposed criteria for precision and recall, derived from instance-specific gold answer requirements. This approach is model-agnostic and does not require predefined taxonomies.

This framework provides a refined correlation with expert judgments (r=0.78r = 0.78) compared to existing evaluation methods, which achieve much lower correlation results. The paper demonstrates that while generalist models show higher recall, specialized models exhibit increased precision. Figure 1

Figure 1

Figure 1: Precision Workflow

Framework Architecture and Implementation

Pipeline Overview

DeCE consists of two primary workflows: precision and recall. Each workflow utilizes model-generated answers and gold-standard answers to provide interpretable evaluations.

  1. Precision Workflow: This process decomposes the model-generated answer into factual elements. Each element is verified for factual accuracy and relevance against the gold answer. Unsupported elements result in penalization in the final score.
  2. Recall Workflow: This component extracts evaluation criteria from the gold answer and evaluates the completeness of the model responses based on Required Information. This dimension explicitly checks coverage of necessary concepts. Figure 2

    Figure 2: Distribution of pointwise scores (0–4) for each model

DeCE Score Implementation

The decomposition into precision and recall offers an interpretable signal, allowing evaluators to identify specific shortcomings in the model outputs. The precision score is calculated as the proportion of correctly grounded elements, while the recall score measures the completeness of essential concept coverage.

Experimental Evaluation and Results

Evaluation of Model Alignment

In a comprehensive comparative analysis, DeCE showed substantial improvement in alignment with expert evaluations over standard metrics such as ROUGE and BLEU. Figure 3

Figure 3: DeCE scores (precision and recall) for each evaluated model

The precision and recall decomposition revealed that larger, general-purpose models tend to favor recall with comprehensive answers, while smaller, domain-specific models focus on precision.

Jurisdictional and Query Type Performance

The effectiveness of DeCE extends to diagnosing model outputs across different jurisdictions and query types, highlighting challenges in certain legal reasoning problems and jurisdictions such as Minnesota and New York. Figure 4

Figure 4: Model performance across jurisdictions insights (precision vs. recall)

Figure 5

Figure 5: Model performance across query types (precision vs. recall)

Implications and Future Work

This research proposes significant advancements in how LLM outputs can be evaluated beyond holistic scores, specifically in complex domains requiring nuanced understanding. While DeCE significantly outperforms existing methodologies, it also paves the way for further research into domain-specific evaluation criteria and enhances the alignment with human expert evaluations.

Future directions could involve integrating human-in-the-loop methods for more challenging jurisdictions and query types, as well as adapting DeCE to other high-stakes domains like healthcare and finance.

Conclusion

The DeCE framework provides a robust, scalable, and interpretable approach to evaluating LLM outputs in expert domains. By distinguishing between precision and recall, it highlights the strengths of general and specialized models and provides actionable insights for improving model deployment strategies in high-stakes environments. This paper establishes a foundation for more nuanced AI evaluation frameworks, crucial for the advancement of trustworthy AI systems in expert settings.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.