Multi-Table Data Discovery
- Multi-table data discovery is a set of computational techniques that identify, retrieve, and integrate related tables from diverse data repositories.
- Key methods include embedding-based retrieval, graph-based approaches, index filtering, and LLM integration to detect joinable, unionable, and composite relationships.
- Recent advances show improvements in accuracy and efficiency with robust benchmarks, though challenges in scalability and semantic integration persist.
Multi-table data discovery is the family of computational techniques, models, and systems designed to identify, retrieve, relate, and assemble multiple tables (and/or views) from large-scale heterogeneous data repositories or data lakes. This includes the detection of joinable, unionable, or subset/superset tables, as well as more complex composite relationships needed for integration, enrichment, or multi-hop analytics. The goal is to support automated or semi-automated exploration and synthesis of relevant tables for analytics, business intelligence, data integration, entity resolution, and open-domain question answering.
1. Formal Problem Definitions and Taxonomy
Multi-table data discovery encompasses a set of interrelated tasks characterized by different structural and semantic requirements:
- Unionable Table Discovery: Given a query table , retrieve tables such that is well-defined (schemas are alignable, compatible columns exist) (Otto, 4 Nov 2025). This is often posed as top-k table union search (TUS), with unionability defined via strict or semantic column compatibility.
- Joinable Table Discovery: Identify tables (or columns) that can be joined with a query table or column based on key compatibility, which can be exact (equi-join) or semantic (fuzzy, type-aware, or LLM-augmented) (Dong et al., 2022, Cong et al., 2022, Koutras et al., 2024). For n-ary keys or composite join scenarios, efficient enumeration and pruning become critical (Esmailoghli et al., 2021).
- Subset/Superset Table Discovery: Detect tables where the query table is a subset or superset in either rows or columns, often under constraints of semantic similarity (Otto, 4 Nov 2025, Khatiwada et al., 2024).
- Composite/Multi-hop Discovery: Retrieve sets of tables that, when joined along a path or hypergraph of relationships, provide maximal coverage for a complex query (Boutaleb et al., 17 Nov 2025, Cong et al., 2022, Liu et al., 3 Jan 2026). The focus expands from pairwise compatibility to set-level structural coherence and redundancy minimization.
- Natural Language Conditional Table Discovery: Marry structured table queries with NL-specified constraints as in nlcTD, requiring models to integrate symbolic and free-form semantic guidance (Cui et al., 11 Jul 2025, Cui et al., 22 Apr 2025).
The general schema for formalizing these problems is to seek maximizing a composite objective function , where Q encodes the query (table, columns, NL condition), and captures relevance, joinability, unionability, and/or semantic satisfaction.
2. Algorithmic Frameworks and Model Architectures
The last five years have seen significant progress in both scalable index-based methods and neural representation learning for multi-table data discovery. The dominant paradigms include:
a. Embedding-based Retrieval
- Column/Table Embeddings: Columns or tables are encoded into fixed-dimensional vector spaces using pretrained or fine-tuned PLMs (e.g., BERT, Sentence-BERT, DistilBERT, TABBIE, TabSketchFM). Joinability, unionability, and semantic relatedness are quantified by similarity (usually cosine) in this space (Dong et al., 2022, Khatiwada et al., 2024, Cong et al., 2022).
- Contrastive/Triplet Learning: Models such as DeepJoin train specifically to minimize distance between joinable pairs and maximize it between non-joinable pairs, using triplet or ranking losses (Dong et al., 2022, Cong et al., 2022, Khatiwada et al., 2024).
- Tabular Sketches: TabSketchFM introduces sketch-based representations per column (MinHash, numerical, snapshot), which are combined in a transformer architecture to efficiently capture both set-level statistics and content-based signals (Khatiwada et al., 2024).
b. Graph-based and Hypergraph Approaches
- Hypergraphs and Message Passing: HyperJoin models entire data lakes as hypergraphs with intra-table and LLM-augmented inter-table hyperedges. It uses hierarchical interaction networks (HINs) to propagate information bidirectionally, capturing both intra- and inter-table patterns. Coherence-aware reranking via maximum spanning tree pruning is employed to enforce result set consistency (Liu et al., 3 Jan 2026).
- Column Similarity Graphs: OmniMatch constructs a multi-relational graph where edges encode diverse similarity signals (Jaccard, embedding similarity, distributional, set containment) between column pairs, and uses a relational GNN to aggregate and propagate join signals without hand-tuned thresholds (Koutras et al., 2024).
c. Index-based Filtering and Sketching
- Super Key and Syntactic Hashing: MATE compresses n-ary join keys into "super keys" using XASH, which encodes syntactic features. Early pruning is achieved by bitwise operations, discarding up to 1,000× more false positives compared to unary-only indexes at scale (Esmailoghli et al., 2021).
- MinHash and LSH-based Pruning: LSH Ensemble, JOSIE, and similar techniques use minwise hashing and locality-sensitive hashing for scalable set relatedness discovery, especially for union and join compatibility (Khatiwada et al., 2023).
d. LLM Integration
- LLM-powered Conditional Retrieval: TableCopilot (CROFUMA) and nlcTD models aggregate cross-modal scores for structured (query table) and unstructured (NL condition) inputs, supporting union/join/fuzzy search via linear or neural fusion (Cui et al., 11 Jul 2025, Cui et al., 22 Apr 2025).
- Hierarchical Catalogs and Semantic Summaries: LEDD constructs and indexes LLM-summarized metadata facets, then clusters embeddings in a multi-level hierarchy for navigation and search (An et al., 21 Feb 2025).
e. Lightweight Entity-Aware Systems
- Query Parsing and Entity Matching: Octopus avoids heavy content indexing by parsing queries for fine-grained column and value mentions, using compact embedding indices on column headers and direct grepping for value hits, supporting both independent and join-based multi-table retrieval with minimal resource overhead (Li et al., 5 Jan 2026).
3. Evaluation Methodologies, Benchmarks, and Metrics
Robust evaluation for multi-table data discovery has been driven by the development of comprehensive benchmarks and standardized metrics:
- Benchmarks: Datasets such as LakeBench, TUSBench, nlcTables, and OpenData/NYC OpenData benchmarks are used to measure performance on union, join, subset, and NL-conditional tasks, incorporating thousands to millions of tables (Cui et al., 22 Apr 2025, Otto, 4 Nov 2025, Srinivas et al., 2023, Koutras et al., 2024).
- Metrics: Major metrics include Precision@k, Recall@k, F1, MAP@k, NDCG@k, R² (for regression tasks, e.g., Jaccard similarity), as well as runtime, end-to-end system latency, memory footprint, and reduction factor (in candidate views) (Otto, 4 Nov 2025, Dong et al., 2022, Li et al., 5 Jan 2026, Fernandez et al., 2019).
- Labeling: Binary (unionable/joinable/not), multi-label (join columns), and regression-based (containment/Jaccard scores) labels are standard, often with ground truth derived from table splits, inclusion dependencies, or manual/LLM-verified semantic augmentation (Cui et al., 22 Apr 2025, Srinivas et al., 2023).
- Ablations: Recent works systematically evaluate the roles of different sketch types, negative example generation, and reranking strategies (Khatiwada et al., 2024, Liu et al., 3 Jan 2026, Koutras et al., 2024).
4. System Architectures and End-to-end Pipelines
Practical multi-table data discovery systems integrate discovery, integration, and analysis in modular workflows:
- DIALITE: Implements a pluggable pipeline (discovery, alignment/integration, analysis) that can leverage either SANTOS or LSH Ensemble for discovery; supports full disjunction-based integration (ALITE) and downstream analytics (aggregation, ER) (Khatiwada et al., 2023).
- Ver: Emphasizes robust project-join view discovery without PK-FK metadata through join-path search, 4C (compatible/contained/complementary/contradictory) view classification, and interactive user feedback via a multi-armed bandit (Gong et al., 2021).
- Octopus: Offers a training-free, entity-centric workflow for both independent and join-based table retrieval and cell-level value extraction, with tight integration of retrieval and downstream NL2SQL execution (Li et al., 5 Jan 2026).
- TableCopilot: Exemplifies a comprehensive LLM-powered assistant, able to switch between NL-only, union, and join discovery, with dynamic UI elements and downstream workspace management (Cui et al., 11 Jul 2025).
5. Empirical Results and Performance Trends
Recent systems have documented significant advances in both discovery accuracy and computational efficiency:
| Method/System | Setting | Main Metric | Best-Achieved Value / Improvement |
|---|---|---|---|
| TabSketchFM | Join search | F1@10 (Wiki) | 89.1%, +5.4pp over SBERT |
| HyperJoin | Join@15 (Webtable, USA) | Precision@15, Recall@15 | +21.4pp, +17.2pp over best baseline |
| EasyTUS (ET-O) | TUSBench (Union) | MAP (average over lakes) | +34.3% over D3L, 79.2x faster prep |
| CROFUMA (TableCopilot) | nlcTables (Union/Join) | NDCG@5 | +12–14pp over best single-input method |
| OmniMatch | Any-join (CityGov) | F1/PR-AUC | 0.857/0.920, +14pp F1 over Starmie |
| Octopus | Join discovery (Spider) | Hit@1 | 56.7%, +29.7pp over Pneuma |
| Ver/Dataset-on-Demand | Candidate view pruning | Reduction factor | Up to 10x; 4C-distillation in ~100s |
This table is representative; systems also report strong robustness under missing metadata, few-shot adaptation to new data lakes, and orders-of-magnitude improvements in offline/online cost over earlier baselines (Otto, 4 Nov 2025, Khatiwada et al., 2024, Dong et al., 2022, Koutras et al., 2024).
6. Limitations, Open Challenges, and Future Directions
Despite progress, several major challenges persist:
- Semantic Constraint Integration: Models integrating symbolic (table structure) and open-ended NL constraints (including numeric/date reasoning) remain underdeveloped. Existing fusion approaches are limited to relatively simple conditions (Cui et al., 22 Apr 2025, Cui et al., 11 Jul 2025).
- Set-level Coherence: Naive top-k ranking yields incoherent results; recent hypergraph/message-passing and coherence-aware reranking methods (e.g., HyperJoin) achieve substantial gains, but multi-hop and set-level reasoning remain computationally demanding (Liu et al., 3 Jan 2026, Boutaleb et al., 17 Nov 2025).
- Scalability: While embedding-based and sketch-based methods achieve strong speed/quality tradeoffs, challenges persist with web-scale and highly dynamic data lakes, or in settings with minimal metadata (Otto, 4 Nov 2025, Cong et al., 2022).
- Label Scarcity/Robust Evaluation: Data and benchmarks for composite/multi-table scenarios are still limited, and new tasks such as NL-conditioned fuzzy retrieval create annotation bottlenecks (Cui et al., 22 Apr 2025).
- Generalization and Adaptivity: Effective transfer across domains and efficient adaptation to new table distributions and patterns are ongoing research topics (Khatiwada et al., 2024, Dong et al., 2022).
Future directions indicated in recent literature include: end-to-end NL-conditional table retrieval architectures, learnable/semantic index structures, reinforcement learning from user feedback, dynamic integration with knowledge graphs, and joint ranking+SQL generation for complex information needs (Cui et al., 11 Jul 2025, Cui et al., 22 Apr 2025, Boutaleb et al., 17 Nov 2025, Cong et al., 2022).
7. Practical Guidelines and Deployment Considerations
To deploy high-performance multi-table data discovery in practice, key recommendations include:
- Precompute column/table embeddings and enable fast approximate ANN search (e.g., HNSW) for all major tasks (Dong et al., 2022, Otto, 4 Nov 2025, Khatiwada et al., 2024).
- Combine or ensemble diverse signals (contextual, structural, sketch-based, statistical) for robustness across tasks; ablation studies show strong specialization by sketch type (Khatiwada et al., 2024, Koutras et al., 2024).
- Where available, leverage knowledge graphs and schema metadata, but do not depend upon it; design for missing or imperfect metadata (An et al., 21 Feb 2025, Koutras et al., 2024).
- Consider hybrid systems: use fast, lightweight entity-aware filtering for initial narrowing (e.g., header/entity-based), followed by deeper LLM-augmented ranking or GNN reranking (Li et al., 5 Jan 2026, Liu et al., 3 Jan 2026).
- System pipelines should be modular to permit swapping discovery, alignment, integration, and query-answering components for domain or use-case customization (Khatiwada et al., 2023, An et al., 21 Feb 2025).
In sum, multi-table data discovery is an essential and rapidly evolving research domain that brings together scalable indexing, neural representation learning, graph structures, and LLM-powered semantics for comprehensive, composable analysis over heterogeneous data lakes and open repositories.