A Survey of Useful LLM Evaluation
Abstract: LLMs have gotten attention across various research domains due to their exceptional performance on a wide range of complex tasks. Therefore, refined methods to evaluate the capabilities of LLMs are needed to determine the tasks and responsibility they should undertake. Our study mainly discussed how LLMs, as useful tools, should be effectively assessed. We proposed the two-stage framework: from core ability'' toagent'', clearly explaining how LLMs can be applied based on their specific capabilities, along with the evaluation methods in each stage. Core ability refers to the capabilities that LLMs need in order to generate high-quality natural language texts. After confirming LLMs possess core ability, they can solve real-world and complex tasks as agent. In the "core ability" stage, we discussed the reasoning ability, societal impact, and domain knowledge of LLMs. In the ``agent'' stage, we demonstrated embodied action, planning, and tool learning of LLMs agent applications. Finally, we examined the challenges currently confronting the evaluation methods for LLMs, as well as the directions for future development.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of unresolved issues the paper leaves open for future research to address.
- Lack of a formal, operational definition of “useful” and explicit thresholds for transitioning an LLM from “core ability” to “agent” status (e.g., minimal performance, safety, and reliability criteria).
- No standardized, end-to-end evaluation pipeline instantiating the proposed two-stage framework (data collection, metrics, protocols, reporting) that enables reproducible, cross-domain comparisons.
- Absence of unified, domain-agnostic metrics for reasoning quality beyond task accuracy (e.g., calibration, confidence, robustness, faithfulness, and efficiency), hindering comparability across reasoning tasks.
- Insufficient methods to measure reasoning faithfulness and validity of chain-of-thought explanations at scale (i.e., detecting when rationales are post-hoc or misleading).
- Limited treatment of benchmark contamination and data leakage (e.g., training-set overlap with evaluation tasks) in core ability evaluations, especially for proprietary models.
- Overreliance on multiple-choice and short-answer formats for reasoning; need for richer, process-sensitive diagnostics that stress compositionality, counterfactuals, and causal inference.
- Underdeveloped evaluation of multi-hop reasoning that separates retrieval quality from reasoning quality and quantifies error propagation across steps.
- Structured data reasoning lacks comprehensive metrics beyond exact match (e.g., semantic equivalence of SQL, execution accuracy under schema shift, robustness to partial structures).
- Sparse assessment of multilingual and non-English reasoning and domain knowledge, including culturally diverse commonsense and legal norms.
- Limited analysis of the trade-offs between instruction tuning, alignment, and reasoning capabilities (e.g., which tuning regimes help/hurt specific reasoning types).
- Safety evaluations are scattered across benchmarks with heterogeneous categories, labels, and protocols; need for harmonized taxonomies and standardized scoring schemes.
- Cross-cultural and jurisdictional validity of safety and ethics evaluations is not established (e.g., how “harm” and “acceptability” vary across contexts and legal frameworks).
- Privacy-risk measurement remains ad hoc; there is no standardized methodology to quantify memorization, PII leakage, and re-identification risks in closed-source LLMs.
- Adversarial robustness is evaluated piecemeal; lacking unified stress tests and coverage metrics for prompt injection, tool-integrated attacks, and multi-agent adversaries.
- Unclear guidance on balancing safety alignment against utility (e.g., how safety interventions affect task performance, coverage, and agent autonomy).
- Hallucination evaluations focus on detection but less on actionable mitigation protocols and measurable reductions across diverse tasks and modalities.
- Bias benchmarks span domains but lack cross-domain comparability and standardized fairness metrics (e.g., consistent sensitive attributes, parity definitions, and utility-fairness trade-off reporting).
- Domain knowledge evaluations (Finance, Law, Psychology, Medicine, Education) rely heavily on static datasets; need for dynamic, real-world, longitudinal studies where knowledge evolves (e.g., markets, laws, clinical practices).
- Finance evaluations lack rigorous, expert-validated protocols for high-stakes tasks (e.g., risk assessment, portfolio decisions) and do not quantify downstream impacts of errors or disinformation.
- Legal evaluations target simplified statutory tasks; missing assessments on complex, multi-jurisdictional reasoning, precedent integration, and systematic hallucination auditing in legal advice generation.
- Psychology evaluations need stronger construct validity checks and harm assessments (e.g., cross-language reliability, bias in emotion/moral inference, effects of generated content on participants).
- Medicine evaluations remain far from clinical-grade: no prospective trials, EHR-integrated testing, safety case frameworks, or regulatory compliance pathways for agentic decision support.
- Education evaluations rarely measure real learning outcomes, transfer, and long-term effects; need classroom-scale RCTs, privacy-preserving data collection, and equity analyses.
- Agent planning evaluations lack standardized metrics for plan optimality, re-planning agility, sample efficiency, resource use (time/cost), and failure recovery in dynamic environments.
- Web-grounded agent evaluations are mostly in synthetic or sandboxed environments; need robust, real-website benchmarks with policy compliance, citation integrity, and browsing trace audits.
- Code generation and tool-use evaluations often rely on unit tests; missing measures for security vulnerabilities, maintainability, runtime performance, and multi-step tool orchestration reliability.
- Database-query agents need evaluations of error handling, schema-generalization, semantic equivalence, and data-quality robustness beyond exact-match SQL.
- Robotic navigation/manipulation lacks sim-to-real transfer evaluations with safety metrics, standardized scene/task suites, and measurement of generalization across embodiments and environments.
- Tool creation by agents is under-evaluated: no clear metrics for novelty, correctness, compositionality, maintenance, and security implications of generated tools.
- Benchmarks for tool-use and agentic systems are fragmented; no unified, cross-scenario benchmark that measures integration, sequencing, and failure modes across APIs, web, code, DB, and robotics.
- Little attention to cost, energy, and environmental impact of evaluation pipelines and agent deployments; need metrics and reporting standards for efficiency and sustainability.
- Insufficient meta-evaluation: unclear to what extent benchmark scores predict real-world utility, user trust, and safety outcomes; need external validity studies linking metrics to downstream impacts.
- Reproducibility concerns persist for closed models (versioning, undocumented training data, changing behaviors); need evaluation protocols robust to model drift and API variability.
- Lack of governance and reporting standards for evaluations (e.g., datasets’ provenance, annotator demographics, licensing, risk disclosures), especially for high-stakes domains.
- Open question of how to aggregate multi-metric performance into actionable deployment decisions (e.g., composite scores, thresholds, and risk-adjusted utility functions).
- Limited exploration of uncertainty quantification and confidence calibration in both core abilities and agent actions; need standardized measures and user-facing uncertainty communication.
- No guidance on human-in-the-loop evaluation and oversight frameworks (roles, escalation protocols, audit trails) for agentic systems operating in real-world settings.
Collections
Sign up for free to add this paper to one or more collections.