TaxoBench: Taxonomy-Guided Benchmark

Updated 21 January 2026

TaxoBench is a comprehensive benchmark that assesses expert-level survey synthesis by measuring deep retrieval and hierarchical taxonomy organization.
It employs dual evaluation modes, such as Deep Research and Bottom-Up, using metrics like Recall@K, ARI, and Tree-Edit Distance for precise diagnostics.
The framework drives improvements in automated survey generation and text-to-SQL translation, promoting targeted retrieval and hybrid clustering strategies.

TaxoBench refers to several independently developed diagnostic benchmarks and dataset resources for rigorous evaluation and taxonomy-guided assessment of machine learning models in automated survey generation and real-world text-to-SQL translation. These resources address a central limitation of prior evaluations by focusing on the fundamental cognitive steps that define expert-level task performance: accurate retrieval and organization of domain knowledge. The following entry focuses on TaxoBench as instantiated in automated survey synthesis for computer science and taxonomy-guided text-to-SQL benchmarks, detailing their construction, metrics, core findings, and practical usage (Zhang et al., 18 Jan 2026, Wang et al., 17 Nov 2025, Lahiri et al., 20 Oct 2025).

1. Purpose and Scope

TaxoBench is designed as a gold-standard benchmark to evaluate whether autonomous “deep research” agents and LLMs can replicate the two critical expert processes required for survey writing and knowledge-structured generation: (1) retrieval of the defining literature and (2) organization of this literature into coherent, hierarchical taxonomies.

In the domain of literature survey generation, TaxoBench addresses the synthesis gap overlooked by benchmarks emphasizing surface-level attributes such as linguistic fluency or citation accuracy. Instead, it directly evaluates the underlying capabilities that distinguish expert-authored surveys—retrieving seminal works and constructing multi-level conceptual hierarchies reflecting expert knowledge interrelations. For text-to-SQL applications, TaxoBench denotes a taxonomy-driven protocol ensuring comprehensive coverage of user intents, SQL pattern diversity, and database schema interactions. These variants provide both diagnostic depth and extensive resource coverage for cross-domain LLMs and research agents (Zhang et al., 18 Jan 2026, Wang et al., 17 Nov 2025).

2. Corpus Construction and Data Format

2.1 Survey Generation TaxoBench

The primary TaxoBench corpus for survey generation is derived from 72 highly-cited computer science surveys encompassing eight subdomains, including multimodal learning, reinforcement learning, and AI alignment. Manual extraction by Ph.D.-level annotators yields:

72 expert-authored taxonomy trees
3,815 unique papers, with precise mapping of each citation to taxonomy leaf nodes
Average taxonomy: 53 papers, 3.2 hierarchy levels, 12.4 leaf-level categories; surveys average 354.5 citations

The taxonomies are represented in a simple directory-tree JSON format, capturing explicit concept–subconcept–paper relationships as curated by field experts after extensive synthesis (Zhang et al., 18 Jan 2026).

2.2 Taxonomy-Guided Text-to-SQL (SQL-Synth/TaxoBench)

For the text-to-SQL instantiation, TaxoBench leverages a systematic four-dimensional taxonomy (core intent, statement type, syntax structure, key action), used to synthesize SQL-Synth—an extensive dataset:

114,029 natural language question (NLQ)–SQL pairs over 1,250 training databases; 8,601 pairs over 500 test databases
Comprehensive schema coverage with explicit mapping to taxonomy dimensions, ensuring all relevant database operation patterns are represented and balanced by computational complexity bands (simple, medium, hard)
Dataset is distributed with schema files, prompts, and evaluation code in a standardized format (Wang et al., 17 Nov 2025)

2.3 CS-TaxoBench for Scholarly Taxonomy Generation

CS-TaxoBench from the TaxoAlign project consists of:

460 gold-standard taxonomies extracted from ACM Computing Surveys and major conference surveys (IJCAI, ACL, NAACL, EMNLP, EACL), ~131 reference papers per taxonomy
Node statistics: mean 14.8 per taxonomy; branching factor mean 3.1; up to 4–5 levels deep
Data is provided in JSON schema comprising topic, paper_ids, and explicit parent-child relationships for nodes (Lahiri et al., 20 Oct 2025)

3. Evaluation Protocols and Performance Metrics

Across all TaxoBench instantiations, evaluation targets both leaf-level assignments (paper and example coverage), and global structural alignment.

3.1 Survey Generation Evaluation

Two main evaluation modes are defined:

Deep Research Mode: Given only a survey topic, models must end-to-end retrieve relevant literature and organize it into a hierarchy, producing taxonomy $\hat{T}$ .
Bottom-Up Mode: The gold-standard list of expert-selected papers (leaves of $T^*$ ) is provided; models must only organize these into a taxonomy, isolating structuring ability.

Metrics include:

Recall@K in Deep Research: $\mathrm{Recall@K} = \frac{|\mathrm{Retrieved} \cap \mathrm{GroundTruth}|}{|\mathrm{GroundTruth}|}$
Adjusted Rand Index (ARI) in Bottom-Up: quantifies agreement between model and expert clusters
Homogeneity, Completeness, V-Measure: assess purity and informativeness of clustering
Tree-Edit Distance (TED) and Soft-F1 (semantic similarity of node labels): quantify hierarchy structure overlap
LLM-as-judge protocol: for qualitative structural evaluation

3.2 CS-TaxoBench Evaluation Metrics

Average Degree Score ( $\Delta$ ): global branching similarity between predicted and gold tree
Edge Precision/Recall/F1: overlap of parent→child edges
Tree Edit Distance (TED): minimal edit operation path between predicted and expert trees
Level-order Traversal Scores: BLEU-2, ROUGE-L, BERTScore on node label sequences
Node Soft Recall (NSR): embedding-based node similarity, using SBERT cosine
Node Entity Recall (NER): lexical overlap of noun phrase spans

3.3 Text-to-SQL TaxoBench Metrics

Execution Accuracy (EX): match between model-returned result (for SELECT) or post-execution database state (for DML/DDL queries)
Coverage metrics: per-dimension recovery rate ( $\mathrm{Cov}_{\mathrm{dim}(D)}$ ) across taxonomy types
Diversity and complexity: type-token ratio (TTR), count of semantic clusters, and uniform coverage across pre-defined complexity bands

4. Main Findings and Benchmark Results

Evaluation results reveal pronounced “synthesis gaps” for existing systems:

In survey Deep Research mode, most research agents recall less than $5\%$ of expert-selected core papers; the best agent achieves $20.9\%$ recall. This limits downstream taxonomy quality, with persistent failures in reproducing critical branches (TED and Soft-F1 remain poor) (Zhang et al., 18 Jan 2026).
In Bottom-Up organization, best ARI across twelve leading LLMs is $0.31$; major structural errors include missing core branches ( $77.5\%$ ), imbalance ( $83.4\%$ ), and semantic drift among siblings ( $75.9\%$ ).
TaxoAlign on CS-TaxoBench demonstrates that its three-phase LLM pipeline achieves the highest agreement with human taxonomies across all tested metrics, outperforming baselines such as AutoSurvey, STORM, or keyphrase-based methods (Lahiri et al., 20 Oct 2025).
For SQL-Synth, coverage across all taxonomy dimensions is complete: $\mathrm{Cov}_{core}=\mathrm{Cov}_{stmt}=\mathrm{Cov}_{syntax}=\mathrm{Cov}_{action}=1$ . Baseline models underperform, with execution accuracy in the $68–85\%$ range for most models. Fine-tuned models (e.g., Synth-Coder) improve accuracy, especially on rare intents and advanced operations (e.g., $+43\%$ on advanced statistics), highlighting the benefit of targeted taxonomy-guided data (Wang et al., 17 Nov 2025).

5. Error Analysis and Diagnostic Insights

Qualitative and quantitative analysis of predicted taxonomies consistently highlights:

Systematic omission of major conceptual branches central to expert understanding
Over-fragmentation of clustering, with excessive creation of fine categories and insufficient top-level structure
Semantic drift and overlap between categories, reflecting a lack of domain-specific conceptual judgment
Recovery of node and edge structures in automatic taxonomies remains substantially below expert reproductions, despite high surface fluency

This suggests that current autonomous survey and knowledge organization agents lack the depth of implicit domain knowledge and structural reasoning applied by expert practitioners during synthesis (Zhang et al., 18 Jan 2026).

6. Best Practices and Usage Guidance

Researchers aiming to advance or evaluate deep research agents and structured generation systems are advised:

To use both Deep Research and Bottom-Up evaluation modes to deconvolve retrieval and organization limits
To leverage TaxoBench's JSON-formatted taxonomies, associated evaluation code, and high-coverage datasets for protocol-standardized assessment and error tracing
For text-to-SQL, to use SQL-Synth in combination with existing in-domain data, stratified by complexity, for fine-tuning and to validate generalization via the provided 500-database test set
Taxonomy dimension coding enables principled coverage analysis, isolating blind spots and revealing distributional bias in both training and evaluation resources (Wang et al., 17 Nov 2025)
For scholarly taxonomy generation, to benchmark taxonomy generation systems against CS-TaxoBench using the prescribed alignment and semantic coherence metrics, facilitating reproducibility and rigorous quantitative comparison (Lahiri et al., 20 Oct 2025)

7. Implications and Research Directions

The persistent synthesis bottlenecks revealed by TaxoBench instantiations underscore the need for advances in two directions:

Targeted retrieval: integrating citation networks, expert-curated indices, and multi-hop search to capture seminal literature
Hierarchical organization: curriculum-style knowledge embedding, large-scale expert taxonomy mining, and hybrid symbolic-neural clustering frameworks

A plausible implication is that, while LLMs have achieved high linguistic performance, truly expert-level survey synthesis and structured knowledge generation may require domain-specific inductive biases or hybrid architectures not present in current paradigms. TaxoBench resources and protocols are publicly available and intended to catalyze research seeking to close these gaps (Zhang et al., 18 Jan 2026, Wang et al., 17 Nov 2025, Lahiri et al., 20 Oct 2025).

Markdown Upgrade to Chat

References (3)

Can Deep Research Agents Find and Organize? Evaluating the Synthesis Gap with Expert Taxonomies (2026)

Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation (2025)

TaxoAlign: Scholarly Taxonomy Generation Using Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TaxoBench.

TaxoBench: Taxonomy-Guided Benchmark

1. Purpose and Scope

2. Corpus Construction and Data Format

2.1 Survey Generation TaxoBench

2.2 Taxonomy-Guided Text-to-SQL (SQL-Synth/TaxoBench)

2.3 CS-TaxoBench for Scholarly Taxonomy Generation

3. Evaluation Protocols and Performance Metrics

3.1 Survey Generation Evaluation

3.2 CS-TaxoBench Evaluation Metrics

3.3 Text-to-SQL TaxoBench Metrics

4. Main Findings and Benchmark Results

5. Error Analysis and Diagnostic Insights

6. Best Practices and Usage Guidance

7. Implications and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

TaxoBench: Taxonomy-Guided Benchmark

1. Purpose and Scope

2. Corpus Construction and Data Format

2.1 Survey Generation TaxoBench

2.2 Taxonomy-Guided Text-to-SQL (SQL-Synth/TaxoBench)

2.3 CS-TaxoBench for Scholarly Taxonomy Generation

3. Evaluation Protocols and Performance Metrics

3.1 Survey Generation Evaluation

3.2 CS-TaxoBench Evaluation Metrics

3.3 Text-to-SQL TaxoBench Metrics

4. Main Findings and Benchmark Results

5. Error Analysis and Diagnostic Insights

6. Best Practices and Usage Guidance

7. Implications and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research