TaxoBench: Taxonomy-Guided Benchmark
- TaxoBench is a comprehensive benchmark that assesses expert-level survey synthesis by measuring deep retrieval and hierarchical taxonomy organization.
- It employs dual evaluation modes, such as Deep Research and Bottom-Up, using metrics like Recall@K, ARI, and Tree-Edit Distance for precise diagnostics.
- The framework drives improvements in automated survey generation and text-to-SQL translation, promoting targeted retrieval and hybrid clustering strategies.
TaxoBench refers to several independently developed diagnostic benchmarks and dataset resources for rigorous evaluation and taxonomy-guided assessment of machine learning models in automated survey generation and real-world text-to-SQL translation. These resources address a central limitation of prior evaluations by focusing on the fundamental cognitive steps that define expert-level task performance: accurate retrieval and organization of domain knowledge. The following entry focuses on TaxoBench as instantiated in automated survey synthesis for computer science and taxonomy-guided text-to-SQL benchmarks, detailing their construction, metrics, core findings, and practical usage (Zhang et al., 18 Jan 2026, Wang et al., 17 Nov 2025, Lahiri et al., 20 Oct 2025).
1. Purpose and Scope
TaxoBench is designed as a gold-standard benchmark to evaluate whether autonomous “deep research” agents and LLMs can replicate the two critical expert processes required for survey writing and knowledge-structured generation: (1) retrieval of the defining literature and (2) organization of this literature into coherent, hierarchical taxonomies.
In the domain of literature survey generation, TaxoBench addresses the synthesis gap overlooked by benchmarks emphasizing surface-level attributes such as linguistic fluency or citation accuracy. Instead, it directly evaluates the underlying capabilities that distinguish expert-authored surveys—retrieving seminal works and constructing multi-level conceptual hierarchies reflecting expert knowledge interrelations. For text-to-SQL applications, TaxoBench denotes a taxonomy-driven protocol ensuring comprehensive coverage of user intents, SQL pattern diversity, and database schema interactions. These variants provide both diagnostic depth and extensive resource coverage for cross-domain LLMs and research agents (Zhang et al., 18 Jan 2026, Wang et al., 17 Nov 2025).
2. Corpus Construction and Data Format
2.1 Survey Generation TaxoBench
The primary TaxoBench corpus for survey generation is derived from 72 highly-cited computer science surveys encompassing eight subdomains, including multimodal learning, reinforcement learning, and AI alignment. Manual extraction by Ph.D.-level annotators yields:
- 72 expert-authored taxonomy trees
- 3,815 unique papers, with precise mapping of each citation to taxonomy leaf nodes
- Average taxonomy: 53 papers, 3.2 hierarchy levels, 12.4 leaf-level categories; surveys average 354.5 citations
The taxonomies are represented in a simple directory-tree JSON format, capturing explicit concept–subconcept–paper relationships as curated by field experts after extensive synthesis (Zhang et al., 18 Jan 2026).
2.2 Taxonomy-Guided Text-to-SQL (SQL-Synth/TaxoBench)
For the text-to-SQL instantiation, TaxoBench leverages a systematic four-dimensional taxonomy (core intent, statement type, syntax structure, key action), used to synthesize SQL-Synth—an extensive dataset:
- 114,029 natural language question (NLQ)–SQL pairs over 1,250 training databases; 8,601 pairs over 500 test databases
- Comprehensive schema coverage with explicit mapping to taxonomy dimensions, ensuring all relevant database operation patterns are represented and balanced by computational complexity bands (simple, medium, hard)
- Dataset is distributed with schema files, prompts, and evaluation code in a standardized format (Wang et al., 17 Nov 2025)
2.3 CS-TaxoBench for Scholarly Taxonomy Generation
CS-TaxoBench from the TaxoAlign project consists of:
- 460 gold-standard taxonomies extracted from ACM Computing Surveys and major conference surveys (IJCAI, ACL, NAACL, EMNLP, EACL), ~131 reference papers per taxonomy
- Node statistics: mean 14.8 per taxonomy; branching factor mean 3.1; up to 4–5 levels deep
- Data is provided in JSON schema comprising topic, paper_ids, and explicit parent-child relationships for nodes (Lahiri et al., 20 Oct 2025)
3. Evaluation Protocols and Performance Metrics
Across all TaxoBench instantiations, evaluation targets both leaf-level assignments (paper and example coverage), and global structural alignment.
3.1 Survey Generation Evaluation
Two main evaluation modes are defined:
- Deep Research Mode: Given only a survey topic, models must end-to-end retrieve relevant literature and organize it into a hierarchy, producing taxonomy .
- Bottom-Up Mode: The gold-standard list of expert-selected papers (leaves of ) is provided; models must only organize these into a taxonomy, isolating structuring ability.
Metrics include:
- Recall@K in Deep Research:
- Adjusted Rand Index (ARI) in Bottom-Up: quantifies agreement between model and expert clusters
- Homogeneity, Completeness, V-Measure: assess purity and informativeness of clustering
- Tree-Edit Distance (TED) and Soft-F1 (semantic similarity of node labels): quantify hierarchy structure overlap
- LLM-as-judge protocol: for qualitative structural evaluation
3.2 CS-TaxoBench Evaluation Metrics
- Average Degree Score (): global branching similarity between predicted and gold tree
- Edge Precision/Recall/F1: overlap of parent→child edges
- Tree Edit Distance (TED): minimal edit operation path between predicted and expert trees
- Level-order Traversal Scores: BLEU-2, ROUGE-L, BERTScore on node label sequences
- Node Soft Recall (NSR): embedding-based node similarity, using SBERT cosine
- Node Entity Recall (NER): lexical overlap of noun phrase spans
3.3 Text-to-SQL TaxoBench Metrics
- Execution Accuracy (EX): match between model-returned result (for SELECT) or post-execution database state (for DML/DDL queries)
- Coverage metrics: per-dimension recovery rate () across taxonomy types
- Diversity and complexity: type-token ratio (TTR), count of semantic clusters, and uniform coverage across pre-defined complexity bands
4. Main Findings and Benchmark Results
Evaluation results reveal pronounced “synthesis gaps” for existing systems:
- In survey Deep Research mode, most research agents recall less than of expert-selected core papers; the best agent achieves recall. This limits downstream taxonomy quality, with persistent failures in reproducing critical branches (TED and Soft-F1 remain poor) (Zhang et al., 18 Jan 2026).
- In Bottom-Up organization, best ARI across twelve leading LLMs is $0.31$; major structural errors include missing core branches (), imbalance (), and semantic drift among siblings ().
- TaxoAlign on CS-TaxoBench demonstrates that its three-phase LLM pipeline achieves the highest agreement with human taxonomies across all tested metrics, outperforming baselines such as AutoSurvey, STORM, or keyphrase-based methods (Lahiri et al., 20 Oct 2025).
- For SQL-Synth, coverage across all taxonomy dimensions is complete: . Baseline models underperform, with execution accuracy in the range for most models. Fine-tuned models (e.g., Synth-Coder) improve accuracy, especially on rare intents and advanced operations (e.g., on advanced statistics), highlighting the benefit of targeted taxonomy-guided data (Wang et al., 17 Nov 2025).
5. Error Analysis and Diagnostic Insights
Qualitative and quantitative analysis of predicted taxonomies consistently highlights:
- Systematic omission of major conceptual branches central to expert understanding
- Over-fragmentation of clustering, with excessive creation of fine categories and insufficient top-level structure
- Semantic drift and overlap between categories, reflecting a lack of domain-specific conceptual judgment
- Recovery of node and edge structures in automatic taxonomies remains substantially below expert reproductions, despite high surface fluency
This suggests that current autonomous survey and knowledge organization agents lack the depth of implicit domain knowledge and structural reasoning applied by expert practitioners during synthesis (Zhang et al., 18 Jan 2026).
6. Best Practices and Usage Guidance
Researchers aiming to advance or evaluate deep research agents and structured generation systems are advised:
- To use both Deep Research and Bottom-Up evaluation modes to deconvolve retrieval and organization limits
- To leverage TaxoBench's JSON-formatted taxonomies, associated evaluation code, and high-coverage datasets for protocol-standardized assessment and error tracing
- For text-to-SQL, to use SQL-Synth in combination with existing in-domain data, stratified by complexity, for fine-tuning and to validate generalization via the provided 500-database test set
- Taxonomy dimension coding enables principled coverage analysis, isolating blind spots and revealing distributional bias in both training and evaluation resources (Wang et al., 17 Nov 2025)
- For scholarly taxonomy generation, to benchmark taxonomy generation systems against CS-TaxoBench using the prescribed alignment and semantic coherence metrics, facilitating reproducibility and rigorous quantitative comparison (Lahiri et al., 20 Oct 2025)
7. Implications and Research Directions
The persistent synthesis bottlenecks revealed by TaxoBench instantiations underscore the need for advances in two directions:
- Targeted retrieval: integrating citation networks, expert-curated indices, and multi-hop search to capture seminal literature
- Hierarchical organization: curriculum-style knowledge embedding, large-scale expert taxonomy mining, and hybrid symbolic-neural clustering frameworks
A plausible implication is that, while LLMs have achieved high linguistic performance, truly expert-level survey synthesis and structured knowledge generation may require domain-specific inductive biases or hybrid architectures not present in current paradigms. TaxoBench resources and protocols are publicly available and intended to catalyze research seeking to close these gaps (Zhang et al., 18 Jan 2026, Wang et al., 17 Nov 2025, Lahiri et al., 20 Oct 2025).