Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base (2503.23361v1)

Published 30 Mar 2025 in cs.CL

Abstract: LLMs possess impressive linguistic capabilities but often fail to faithfully retain factual knowledge, leading to hallucinations and unreliable outputs. Understanding LLMs' knowledge deficiencies by exhaustively evaluating against full-scale knowledge bases is computationally prohibitive, especially for closed-weight models. We propose stochastic error ascent (SEA), a scalable and efficient framework for discovering knowledge deficiencies (errors) in closed-weight LLMs under a strict query budget. Rather than naively probing all knowledge candidates, SEA formulates error discovery as a stochastic optimization process: it iteratively retrieves new high-error candidates by leveraging the semantic similarity to previously observed failures. To further enhance search efficiency and coverage, SEA employs hierarchical retrieval across document and paragraph levels, and constructs a relation directed acyclic graph to model error propagation and identify systematic failure modes. Empirically, SEA uncovers 40.7x more knowledge errors than Automated Capability Discovery and 26.7% more than AutoBencher, while reducing the cost-per-error by 599x and 9x, respectively. Human evaluation confirms the high quality of generated questions, while ablation and convergence analyses validate the contribution of each component in SEA. Further analysis on the discovered errors reveals correlated failure patterns across LLM families and recurring deficits, highlighting the need for better data coverage and targeted fine-tuning in future LLM development.

Summary

The paper introduces SEA, a stochastic optimization process that efficiently uncovers LLM factual errors with up to 40.7 times more error discoveries per query than baseline methods.
It leverages semantic similarity and hierarchical retrieval to guide error detection, dramatically reducing computational costs and search space.
The paper employs a Relation Directed Acyclic Graph to model correlation among errors, enabling systematic identification of knowledge deficiencies across LLMs.

Evaluating the factual recall of LLMs against comprehensive knowledge bases (KBs) presents significant computational challenges, particularly for closed-weight models where internal states are inaccessible and evaluations rely solely on querying the model's external API. The sheer scale of modern KBs makes exhaustive probing infeasible due to prohibitive costs and time constraints. The paper "Discovering Knowledge Deficiencies of LLMs on Massive Knowledge Base" (2503.23361) introduces Stochastic Error Ascent (SEA), a framework designed to efficiently discover knowledge deficiencies (errors) in LLMs under strict query budget limitations.

Stochastic Error Ascent (SEA) Framework

SEA reframes the problem of identifying LLM knowledge gaps from a naive exhaustive search to a stochastic optimization process. The core principle is to iteratively guide the search towards regions of the knowledge space where the LLM is likely to exhibit errors. Instead of randomly sampling knowledge triples or uniformly exploring the KB, SEA leverages the semantic information embedded within the knowledge itself and the patterns observed in previously identified errors.

The iterative process works as follows:

Initialization: Start with a small seed set of known LLM errors or randomly sampled knowledge triples.
Error Probing: Query the LLM with questions derived from the current set of candidate knowledge triples. Identify factual errors made by the LLM.
Candidate Retrieval: Utilize the semantic similarity between the entities and relations involved in the newly discovered errors and the vast pool of unevaluated knowledge triples in the KB. Retrieve new candidate triples that are semantically close to the observed failures. The intuition is that errors related to similar concepts or relations are often correlated.
Iteration: Repeat steps 2 and 3, refining the set of candidate triples to focus on areas with a higher probability of containing errors.

This iterative retrieval, guided by semantic similarity to past errors, allows SEA to "ascend" towards regions of high error density within the knowledge space, maximizing the number of errors discovered per query.

Enhancing Search Efficiency and Coverage

To further improve the efficiency and breadth of the error discovery process, SEA incorporates two key enhancements: hierarchical retrieval and a Relation Directed Acyclic Graph (R-DAG).

Hierarchical Retrieval

Operating directly on massive KBs containing millions or billions of triples is inefficient for retrieval. SEA employs a hierarchical retrieval strategy to manage this scale:

Document-Level Retrieval: The KB is often structured or can be mapped to a document collection (e.g., Wikipedia pages corresponding to entities). Initially, retrieval identifies relevant documents likely to contain information related to the observed errors based on entity similarity.
Paragraph-Level Retrieval: Within the retrieved documents, finer-grained retrieval at the paragraph or sentence level is performed. This step uses semantic similarity based on both entities and relations from the previously identified errors to pinpoint specific knowledge triples that are most likely to elicit further errors.

This two-stage process significantly prunes the search space, allowing SEA to focus computational resources on the most promising knowledge candidates while still maintaining broad coverage across the KB.

Relation Directed Acyclic Graph (R-DAG)

LLM knowledge errors are often not isolated incidents but part of systematic failure modes. For instance, an LLM might consistently fail on facts involving specific relations (e.g., birth dates) or particular types of entities (e.g., lesser-known historical figures). To model and exploit these correlations, SEA constructs an R-DAG.

The R-DAG captures the dependencies and potential error propagation paths between different relations in the KB. Nodes in the graph represent relations, and directed edges might indicate semantic relatedness or hierarchical structures (e.g., place_of_birth is related to country_of_citizenship). When an error is detected for a triple involving a specific relation, the R-DAG helps prioritize exploring related relations, anticipating that errors might cluster. This graph-based approach allows SEA to identify systematic deficiencies more effectively than treating each knowledge triple independently.

Implementation and Evaluation

Implementing SEA involves several practical considerations:

Embedding Models: Choosing appropriate embedding models (e.g., Sentence-BERT variants) is crucial for calculating semantic similarity between entities, relations, and text passages effectively.
Retrieval System: Efficient vector database and retrieval algorithms (e.g., FAISS, ScaNN) are needed to perform nearest neighbor searches over large embedding spaces quickly.
Query Generation: A robust mechanism is required to convert knowledge triples (subject, relation, object) into natural language questions suitable for probing the LLM. This often involves using templates tailored to different relation types.
Error Detection: An automated method is needed to compare the LLM's generated answer against the ground truth object in the knowledge triple to classify a response as correct or incorrect. This might involve exact matching, fuzzy matching, or even leveraging another powerful LLM as a judge.

The empirical evaluation presented in the paper demonstrates the effectiveness of SEA. When compared against baselines like Automated Capability Discovery (ACD) and AutoBencher on large-scale KBs, SEA achieved significant improvements:

Error Discovery Rate: SEA uncovered 40.7 times more knowledge errors than ACD and 26.7% more errors than AutoBencher under the same query budget.
Cost-Efficiency: The cost-per-discovered-error was drastically reduced, being 599 times lower than ACD and 9 times lower than AutoBencher.

Human evaluations confirmed that the questions generated by SEA to probe the LLM were of high quality and accurately reflected the underlying knowledge triples. Ablation studies validated the individual contributions of hierarchical retrieval and the R-DAG, showing that removing either component led to a decrease in performance. Convergence analysis indicated that the stochastic optimization process effectively directs the search towards high-error regions over iterations.

Analysis and Implications

The analysis of errors discovered by SEA revealed important patterns in LLM knowledge deficiencies. Correlated failure modes were observed across different LLM families (e.g., GPT, Llama, Claude), suggesting shared weaknesses potentially stemming from common pre-training data or architectural limitations. Recurring deficits were identified, often related to less popular entities, specific numerical attributes, or complex relational reasoning.

These findings underscore the limitations of current LLMs in comprehensively storing and reliably recalling factual knowledge. The systematic errors uncovered by SEA highlight the need for more targeted data curation and augmentation strategies during pre-training and fine-tuning. The framework itself provides a valuable tool for developers to efficiently identify specific knowledge gaps in their models and prioritize areas for improvement, potentially through retrieval-augmented generation (RAG) or targeted fine-tuning on the identified weak spots.

Conclusion

Stochastic Error Ascent (SEA) offers a scalable and cost-effective approach for discovering knowledge deficiencies in LLMs operating on massive knowledge bases, particularly under the constraints of closed-weight APIs and limited query budgets. By formulating error discovery as a guided stochastic search leveraging semantic similarity, hierarchical retrieval, and relation-dependency modeling via R-DAGs, SEA significantly outperforms existing methods in both the volume of errors found and cost-efficiency. The insights derived from the errors identified by SEA provide valuable guidance for future LLM development, emphasizing the need for better knowledge grounding and targeted interventions to enhance factual reliability.