HyperJoin: LLM-Augmented Hypergraph Discovery
- HyperJoin is an LLM-augmented hypergraph framework designed for joinable table discovery in large-scale data lakes.
- It employs a Hierarchical Interaction Network with table-ID and Laplacian encodings to integrate column and hyperedge features.
- Empirical results reveal significant improvements in precision and recall over baseline methods, though challenges in scalability and prompt reliance remain.
HyperJoin is an LLM-augmented hypergraph link prediction framework designed for joinable table discovery in large-scale data lakes. By modeling columns and their structural relationships through a combinatorial hypergraph architecture and leveraging a Hierarchical Interaction Network (HIN) for representation learning, HyperJoin significantly improves both the accuracy and coherence of discovered joinable columns. Its design directly addresses and resolves limitations found in earlier LLM-based and pairwise similarity-based methods.
1. Formal Problem Statement and Hypergraph Modeling
Let denote the hypergraph constructed to represent a data lake, where are column nodes, and are hyperedges partitioned into intra-table and inter-table types. An intra-table hyperedge models all columns within table . Inter-table hyperedges encode sets of joinable columns, potentially augmented via LLM-driven schema variant detection. The incidence matrix formalizes these relationships.
Joinable table discovery is framed as a link prediction problem: for query column (node ), the objective is to identify new inter-table hyperedges that should connect to other columns. This formulation promotes richer context-aware reasoning relative to prior approaches based only on isolated or pairwise modeling of columns (Liu et al., 3 Jan 2026).
2. Hypergraph Construction and LLM-Augmented Schema Augmentation
The hypergraph is constructed in a multi-stage process:
- Node Feature Engineering: Each node is initialized with a vector synthesizing table name, column name, and cell-value statistics through a learned weighted sum and projection layer.
- Intra-table Hyperedges: For every table , construct including all column nodes from .
- LLM-Augmented Inter-table Hyperedges: Key columns are detected based on keyness heuristics. For each candidate, a LLM is prompted to enumerate semantically equivalent name variants (e.g., expansions, abbreviations, synonyms). These variants, together with exact matches, are assembled into a join-graph where nodes represent key columns and edges indicate equivalence. Each connected component in yields an inter-table hyperedge .
This LLM augmentation enables HyperJoin to overcome brittle reliance on exact heuristics and latent pairwise signals, supporting robust discovery across diverse naming conventions and schema variations (Liu et al., 3 Jan 2026).
3. Hierarchical Interaction Network Architecture
The centerpiece of HyperJoin’s representation learning is the Hierarchical Interaction Network (HIN), which propagates and integrates information at both column (node) and hyperedge levels in the hypergraph.
Positional Encodings
Node features are enhanced via two positional encodings:
- Table-ID Encoding: For node in table , .
- Laplacian Encoding: Considering the adjacency matrix of the pairwise join graph, compute the normalized Laplacian , extract its dominant eigenvectors, and map to each column node via MLP. The unified embedding is .
Node and Hyperedge Updates
- Node-Transform Layers: Perform linear transformation and normalization on node states.
- Hyperedge Pooling and Specialized Updates: Aggregate node embeddings by mean within each hyperedge (separately for intra- and inter-table types) and apply learnable projections.
Global Hyperedge Mixing
Construct a hyperedge adjacency matrix . Stack multiple structure-aware attention mixer layers, which integrate the structural signal of into the attention computation and result in refined hyperedge embeddings .
Backpropagation to Columns
Update each column node’s state with aggregated information from connected hyperedges:
Ablation demonstrates this hierarchical message propagation is critical: omitting it results in drops of up to 20 points in P@15 and similar degradation in recall (Liu et al., 3 Jan 2026).
4. Coherence-Aware Candidate Selection and Reranking
The online phase introduces a coherence-aware top- column selection framework:
- Initial Retrieval: Retrieve a candidate set via approximate nearest neighbor (ANN) search on final node embeddings.
- Objective Function: Maximize an aggregate score over sets with cardinality :
with . is computed as the weight of the maximum spanning tree (MST) over the induced candidate graph.
- NP-Hardness: Theorem 4.5 establishes that maximizing this coherence-aware objective is NP-hard for MST-based coherence metrics.
- Greedy MST-based Reranking: A greedy algorithm inspired by Prim’s MST method reranks candidates, promoting both relevance to the query and internal set coherence. At each iteration, candidates are selected maximizing a surrogate marginal gain; Theorem 4.8 ensures each selection lower-bounds the true gain in MST weight.
| Module | Degradation if omitted (P@15 / R@15) | Impact |
|---|---|---|
| Coherent reranking ("w/o CR") | 3–6 pts (P@15) | Decreased set coherence |
| Hierarchical interaction (HIN) | ≈20 pts (WebTable, P@15) | Catastrophic on web tables |
| Hypergraph structure ("w/o HG") | 18–25 pts (structured corpora) | Major drop on structured data |
5. Empirical Evaluation
Extensive experiments validate HyperJoin on multiple benchmarks (USA, CAN, UK_SG government portals; WebTable) against baselines including JOSIE, LSH Ensemble, DeepJoin, Snoopy, Omnimatch. Precision@K and Recall@K for are used as evaluation metrics.
At , HyperJoin achieves mean gains of 21.4 percentage points in Precision@15 and 17.2 in Recall@15 over the best-performing baseline. On the UK_SG benchmark, HyperJoin attains Precision@15, considerably higher than the of the closest prior method. Ablation further confirms the indispensability of each system component (Liu et al., 3 Jan 2026).
6. Limitations and Prospective Directions
Two principal limitations are noted:
- The reliance on LLM prompts for LLM-augmented inter-table hyperedge generation may restrict scalability and introduces dependency on prompt design and LLM quality.
- Hypergraph size and complexity increase with the data lake's scale, which can become a bottleneck.
Future work includes learning hyperedge weights directly from data, supporting numeric and temporal join types, introducing efficient hypergraph sampling for very large data lakes, and generalizing the hypergraph plus reranking paradigm to tasks such as union-based search and semantic schema matching. This suggests that the HyperJoin framework is extensible beyond join discovery and could underpin general-purpose multi-table reasoning in data lakes (Liu et al., 3 Jan 2026).