Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Table Data Discovery

Updated 12 January 2026
  • Multi-table data discovery is a set of computational techniques that identify, retrieve, and integrate related tables from diverse data repositories.
  • Key methods include embedding-based retrieval, graph-based approaches, index filtering, and LLM integration to detect joinable, unionable, and composite relationships.
  • Recent advances show improvements in accuracy and efficiency with robust benchmarks, though challenges in scalability and semantic integration persist.

Multi-table data discovery is the family of computational techniques, models, and systems designed to identify, retrieve, relate, and assemble multiple tables (and/or views) from large-scale heterogeneous data repositories or data lakes. This includes the detection of joinable, unionable, or subset/superset tables, as well as more complex composite relationships needed for integration, enrichment, or multi-hop analytics. The goal is to support automated or semi-automated exploration and synthesis of relevant tables for analytics, business intelligence, data integration, entity resolution, and open-domain question answering.

1. Formal Problem Definitions and Taxonomy

Multi-table data discovery encompasses a set of interrelated tasks characterized by different structural and semantic requirements:

  1. Unionable Table Discovery: Given a query table TqT_q, retrieve tables TiT_i such that TqTiT_q \cup T_i is well-defined (schemas are alignable, compatible columns exist) (Otto, 4 Nov 2025). This is often posed as top-k table union search (TUS), with unionability defined via strict or semantic column compatibility.
  2. Joinable Table Discovery: Identify tables (or columns) that can be joined with a query table or column based on key compatibility, which can be exact (equi-join) or semantic (fuzzy, type-aware, or LLM-augmented) (Dong et al., 2022, Cong et al., 2022, Koutras et al., 2024). For n-ary keys or composite join scenarios, efficient enumeration and pruning become critical (Esmailoghli et al., 2021).
  3. Subset/Superset Table Discovery: Detect tables where the query table is a subset or superset in either rows or columns, often under constraints of semantic similarity (Otto, 4 Nov 2025, Khatiwada et al., 2024).
  4. Composite/Multi-hop Discovery: Retrieve sets of tables that, when joined along a path or hypergraph of relationships, provide maximal coverage for a complex query (Boutaleb et al., 17 Nov 2025, Cong et al., 2022, Liu et al., 3 Jan 2026). The focus expands from pairwise compatibility to set-level structural coherence and redundancy minimization.
  5. Natural Language Conditional Table Discovery: Marry structured table queries with NL-specified constraints as in nlcTD, requiring models to integrate symbolic and free-form semantic guidance (Cui et al., 11 Jul 2025, Cui et al., 22 Apr 2025).

The general schema for formalizing these problems is to seek TT\mathcal{T}' \subseteq \mathcal{T} maximizing a composite objective function ρ(Q,T)\rho(Q, \mathcal{T}'), where Q encodes the query (table, columns, NL condition), and ρ\rho captures relevance, joinability, unionability, and/or semantic satisfaction.

2. Algorithmic Frameworks and Model Architectures

The last five years have seen significant progress in both scalable index-based methods and neural representation learning for multi-table data discovery. The dominant paradigms include:

a. Embedding-based Retrieval

  • Column/Table Embeddings: Columns or tables are encoded into fixed-dimensional vector spaces using pretrained or fine-tuned PLMs (e.g., BERT, Sentence-BERT, DistilBERT, TABBIE, TabSketchFM). Joinability, unionability, and semantic relatedness are quantified by similarity (usually cosine) in this space (Dong et al., 2022, Khatiwada et al., 2024, Cong et al., 2022).
  • Contrastive/Triplet Learning: Models such as DeepJoin train specifically to minimize distance between joinable pairs and maximize it between non-joinable pairs, using triplet or ranking losses (Dong et al., 2022, Cong et al., 2022, Khatiwada et al., 2024).
  • Tabular Sketches: TabSketchFM introduces sketch-based representations per column (MinHash, numerical, snapshot), which are combined in a transformer architecture to efficiently capture both set-level statistics and content-based signals (Khatiwada et al., 2024).

b. Graph-based and Hypergraph Approaches

  • Hypergraphs and Message Passing: HyperJoin models entire data lakes as hypergraphs with intra-table and LLM-augmented inter-table hyperedges. It uses hierarchical interaction networks (HINs) to propagate information bidirectionally, capturing both intra- and inter-table patterns. Coherence-aware reranking via maximum spanning tree pruning is employed to enforce result set consistency (Liu et al., 3 Jan 2026).
  • Column Similarity Graphs: OmniMatch constructs a multi-relational graph where edges encode diverse similarity signals (Jaccard, embedding similarity, distributional, set containment) between column pairs, and uses a relational GNN to aggregate and propagate join signals without hand-tuned thresholds (Koutras et al., 2024).

c. Index-based Filtering and Sketching

  • Super Key and Syntactic Hashing: MATE compresses n-ary join keys into "super keys" using XASH, which encodes syntactic features. Early pruning is achieved by bitwise operations, discarding up to 1,000× more false positives compared to unary-only indexes at scale (Esmailoghli et al., 2021).
  • MinHash and LSH-based Pruning: LSH Ensemble, JOSIE, and similar techniques use minwise hashing and locality-sensitive hashing for scalable set relatedness discovery, especially for union and join compatibility (Khatiwada et al., 2023).

d. LLM Integration

  • LLM-powered Conditional Retrieval: TableCopilot (CROFUMA) and nlcTD models aggregate cross-modal scores for structured (query table) and unstructured (NL condition) inputs, supporting union/join/fuzzy search via linear or neural fusion (Cui et al., 11 Jul 2025, Cui et al., 22 Apr 2025).
  • Hierarchical Catalogs and Semantic Summaries: LEDD constructs and indexes LLM-summarized metadata facets, then clusters embeddings in a multi-level hierarchy for navigation and search (An et al., 21 Feb 2025).

e. Lightweight Entity-Aware Systems

  • Query Parsing and Entity Matching: Octopus avoids heavy content indexing by parsing queries for fine-grained column and value mentions, using compact embedding indices on column headers and direct grepping for value hits, supporting both independent and join-based multi-table retrieval with minimal resource overhead (Li et al., 5 Jan 2026).

3. Evaluation Methodologies, Benchmarks, and Metrics

Robust evaluation for multi-table data discovery has been driven by the development of comprehensive benchmarks and standardized metrics:

4. System Architectures and End-to-end Pipelines

Practical multi-table data discovery systems integrate discovery, integration, and analysis in modular workflows:

  • DIALITE: Implements a pluggable pipeline (discovery, alignment/integration, analysis) that can leverage either SANTOS or LSH Ensemble for discovery; supports full disjunction-based integration (ALITE) and downstream analytics (aggregation, ER) (Khatiwada et al., 2023).
  • Ver: Emphasizes robust project-join view discovery without PK-FK metadata through join-path search, 4C (compatible/contained/complementary/contradictory) view classification, and interactive user feedback via a multi-armed bandit (Gong et al., 2021).
  • Octopus: Offers a training-free, entity-centric workflow for both independent and join-based table retrieval and cell-level value extraction, with tight integration of retrieval and downstream NL2SQL execution (Li et al., 5 Jan 2026).
  • TableCopilot: Exemplifies a comprehensive LLM-powered assistant, able to switch between NL-only, union, and join discovery, with dynamic UI elements and downstream workspace management (Cui et al., 11 Jul 2025).

Recent systems have documented significant advances in both discovery accuracy and computational efficiency:

Method/System Setting Main Metric Best-Achieved Value / Improvement
TabSketchFM Join search F1@10 (Wiki) 89.1%, +5.4pp over SBERT
HyperJoin Join@15 (Webtable, USA) Precision@15, Recall@15 +21.4pp, +17.2pp over best baseline
EasyTUS (ET-O) TUSBench (Union) MAP (average over lakes) +34.3% over D3L, 79.2x faster prep
CROFUMA (TableCopilot) nlcTables (Union/Join) NDCG@5 +12–14pp over best single-input method
OmniMatch Any-join (CityGov) F1/PR-AUC 0.857/0.920, +14pp F1 over Starmie
Octopus Join discovery (Spider) Hit@1 56.7%, +29.7pp over Pneuma
Ver/Dataset-on-Demand Candidate view pruning Reduction factor Up to 10x; 4C-distillation in ~100s

This table is representative; systems also report strong robustness under missing metadata, few-shot adaptation to new data lakes, and orders-of-magnitude improvements in offline/online cost over earlier baselines (Otto, 4 Nov 2025, Khatiwada et al., 2024, Dong et al., 2022, Koutras et al., 2024).

6. Limitations, Open Challenges, and Future Directions

Despite progress, several major challenges persist:

  • Semantic Constraint Integration: Models integrating symbolic (table structure) and open-ended NL constraints (including numeric/date reasoning) remain underdeveloped. Existing fusion approaches are limited to relatively simple conditions (Cui et al., 22 Apr 2025, Cui et al., 11 Jul 2025).
  • Set-level Coherence: Naive top-k ranking yields incoherent results; recent hypergraph/message-passing and coherence-aware reranking methods (e.g., HyperJoin) achieve substantial gains, but multi-hop and set-level reasoning remain computationally demanding (Liu et al., 3 Jan 2026, Boutaleb et al., 17 Nov 2025).
  • Scalability: While embedding-based and sketch-based methods achieve strong speed/quality tradeoffs, challenges persist with web-scale and highly dynamic data lakes, or in settings with minimal metadata (Otto, 4 Nov 2025, Cong et al., 2022).
  • Label Scarcity/Robust Evaluation: Data and benchmarks for composite/multi-table scenarios are still limited, and new tasks such as NL-conditioned fuzzy retrieval create annotation bottlenecks (Cui et al., 22 Apr 2025).
  • Generalization and Adaptivity: Effective transfer across domains and efficient adaptation to new table distributions and patterns are ongoing research topics (Khatiwada et al., 2024, Dong et al., 2022).

Future directions indicated in recent literature include: end-to-end NL-conditional table retrieval architectures, learnable/semantic index structures, reinforcement learning from user feedback, dynamic integration with knowledge graphs, and joint ranking+SQL generation for complex information needs (Cui et al., 11 Jul 2025, Cui et al., 22 Apr 2025, Boutaleb et al., 17 Nov 2025, Cong et al., 2022).

7. Practical Guidelines and Deployment Considerations

To deploy high-performance multi-table data discovery in practice, key recommendations include:

In sum, multi-table data discovery is an essential and rapidly evolving research domain that brings together scalable indexing, neural representation learning, graph structures, and LLM-powered semantics for comprehensive, composable analysis over heterogeneous data lakes and open repositories.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Table Data Discovery.