Papers
Topics
Authors
Recent
Search
2000 character limit reached

HyperJoin: LLM-Augmented Hypergraph Discovery

Updated 10 January 2026
  • HyperJoin is an LLM-augmented hypergraph framework designed for joinable table discovery in large-scale data lakes.
  • It employs a Hierarchical Interaction Network with table-ID and Laplacian encodings to integrate column and hyperedge features.
  • Empirical results reveal significant improvements in precision and recall over baseline methods, though challenges in scalability and prompt reliance remain.

HyperJoin is an LLM-augmented hypergraph link prediction framework designed for joinable table discovery in large-scale data lakes. By modeling columns and their structural relationships through a combinatorial hypergraph architecture and leveraging a Hierarchical Interaction Network (HIN) for representation learning, HyperJoin significantly improves both the accuracy and coherence of discovered joinable columns. Its design directly addresses and resolves limitations found in earlier LLM-based and pairwise similarity-based methods.

1. Formal Problem Statement and Hypergraph Modeling

Let H=(V,E,Xv,Π)\mathcal{H} = (\mathcal{V},\,\mathcal{E},\,\mathbf{X}^v,\,\boldsymbol{\Pi}) denote the hypergraph constructed to represent a data lake, where V={vi}\mathcal{V} = \{v_i\} are column nodes, and E={ej}\mathcal{E} = \{e_j\} are hyperedges partitioned into intra-table and inter-table types. An intra-table hyperedge eTintra={vivi is a column of T}e_T^{\rm intra} = \{v_i \mid v_i\text{ is a column of }T\} models all columns within table TT. Inter-table hyperedges ekinterVe_k^{\rm inter} \subseteq \mathcal{V} encode sets of joinable columns, potentially augmented via LLM-driven schema variant detection. The incidence matrix Π{0,1}V×E\boldsymbol{\Pi} \in \{0,1\}^{|\mathcal{V}| \times |\mathcal{E}|} formalizes these relationships.

Joinable table discovery is framed as a link prediction problem: for query column CqC_q (node vqv_q), the objective is to identify new inter-table hyperedges that should connect vqv_q to other columns. This formulation promotes richer context-aware reasoning relative to prior approaches based only on isolated or pairwise modeling of columns (Liu et al., 3 Jan 2026).

2. Hypergraph Construction and LLM-Augmented Schema Augmentation

The hypergraph is constructed in a multi-stage process:

  • Node Feature Engineering: Each node vv is initialized with a vector xvvRd\mathbf{x}_v^v \in \mathbb{R}^d synthesizing table name, column name, and cell-value statistics through a learned weighted sum and projection layer.
  • Intra-table Hyperedges: For every table TT, construct eTintrae_T^{\rm intra} including all column nodes from TT.
  • LLM-Augmented Inter-table Hyperedges: Key columns are detected based on keyness heuristics. For each candidate, a LLM is prompted to enumerate semantically equivalent name variants (e.g., expansions, abbreviations, synonyms). These variants, together with exact matches, are assembled into a join-graph Gjoin\mathcal{G}_{\rm join} where nodes represent key columns and edges indicate equivalence. Each connected component Ck\mathcal{C}_k in Gjoin\mathcal{G}_{\rm join} yields an inter-table hyperedge ekinter=Cke_k^{\rm inter} = \mathcal{C}_k.

This LLM augmentation enables HyperJoin to overcome brittle reliance on exact heuristics and latent pairwise signals, supporting robust discovery across diverse naming conventions and schema variations (Liu et al., 3 Jan 2026).

3. Hierarchical Interaction Network Architecture

The centerpiece of HyperJoin’s representation learning is the Hierarchical Interaction Network (HIN), which propagates and integrates information at both column (node) and hyperedge levels in the hypergraph.

Positional Encodings

Node features are enhanced via two positional encodings:

  • Table-ID Encoding: For node vv in table t(v)t(v), hvtblPE=xvv+αEtbl[t(v)]\mathbf{h}_v^{\rm tblPE} = \mathbf{x}_v^v + \alpha \mathbf{E}_{\rm tbl}[t(v)].
  • Laplacian Encoding: Considering the adjacency matrix AA of the pairwise join graph, compute the normalized Laplacian L=ID1/2AD1/2L = I - D^{-1/2} A D^{-1/2}, extract its dominant eigenvectors, and map to each column node via MLP. The unified embedding is hv(0)=xvv+αEtbl[t(v)]+βPEcol(v)\mathbf{h}_v^{(0)} = \mathbf{x}_v^v + \alpha \mathbf{E}_{\rm tbl}[t(v)] + \beta \mathbf{PE}_{\rm col}(v).

Node and Hyperedge Updates

  • Node-Transform Layers: Perform linear transformation and normalization on node states.
  • Hyperedge Pooling and Specialized Updates: Aggregate node embeddings by mean within each hyperedge (separately for intra- and inter-table types) and apply learnable projections.

Global Hyperedge Mixing

Construct a hyperedge adjacency matrix Ahe=RowNormalize(ΠΠdiag(ΠΠ))A_{\rm he} = \mathrm{RowNormalize}(\Pi^\top\Pi - \mathrm{diag}(\Pi^\top\Pi)). Stack multiple structure-aware attention mixer layers, which integrate the structural signal of AheA_{\rm he} into the attention computation and result in refined hyperedge embeddings zejoutz_{e_j}^{\rm out}.

Backpropagation to Columns

Update each column node’s state with aggregated information from connected hyperedges:

hvhyp=1N(v)jN(v)Wh2czejout,hvfinal=L2Norm(hv(0)+hvhyp)\mathbf{h}_v^{\rm hyp} = \frac{1}{|\mathcal{N}(v)|}\sum_{j\in\mathcal{N}(v)} W_{h2c} z_{e_j}^{\rm out}, \quad \mathbf{h}_v^{\rm final} = \mathrm{L2Norm}(\mathbf{h}_v^{(0)}+\mathbf{h}_v^{\rm hyp})

Ablation demonstrates this hierarchical message propagation is critical: omitting it results in drops of up to 20 points in P@15 and similar degradation in recall (Liu et al., 3 Jan 2026).

4. Coherence-Aware Candidate Selection and Reranking

The online phase introduces a coherence-aware top-kk column selection framework:

  • Initial Retrieval: Retrieve a candidate set CB\mathcal{C}_B via approximate nearest neighbor (ANN) search on final node embeddings.
  • Objective Function: Maximize an aggregate score over sets RCBR \subseteq \mathcal{C}_B with cardinality KK:

maxRCB,R=KG(R),G(R)=CRw(Cq,C)+λCoherence(R)\max_{R \subseteq \mathcal{C}_B, |R|=K} G(R), \quad G(R) = \sum_{C \in R} w(C_q, C) + \lambda \cdot \mathrm{Coherence}(R)

with w(i,j)=sim(hi,hj)w(i,j) = \mathrm{sim}(h_i, h_j). Coherence(R)\mathrm{Coherence}(R) is computed as the weight of the maximum spanning tree (MST) over the induced candidate graph.

  • NP-Hardness: Theorem 4.5 establishes that maximizing this coherence-aware objective is NP-hard for MST-based coherence metrics.
  • Greedy MST-based Reranking: A greedy algorithm inspired by Prim’s MST method reranks candidates, promoting both relevance to the query and internal set coherence. At each iteration, candidates are selected maximizing a surrogate marginal gain; Theorem 4.8 ensures each selection lower-bounds the true gain in MST weight.
Module Degradation if omitted (P@15 / R@15) Impact
Coherent reranking ("w/o CR") 3–6 pts (P@15) Decreased set coherence
Hierarchical interaction (HIN) ≈20 pts (WebTable, P@15) Catastrophic on web tables
Hypergraph structure ("w/o HG") 18–25 pts (structured corpora) Major drop on structured data

5. Empirical Evaluation

Extensive experiments validate HyperJoin on multiple benchmarks (USA, CAN, UK_SG government portals; WebTable) against baselines including JOSIE, LSH Ensemble, DeepJoin, Snoopy, Omnimatch. Precision@K and Recall@K for K{5,15,25}K \in \{5, 15, 25\} are used as evaluation metrics.

At K=15K=15, HyperJoin achieves mean gains of 21.4 percentage points in Precision@15 and 17.2 in Recall@15 over the best-performing baseline. On the UK_SG benchmark, HyperJoin attains 89%\approx89\% Precision@15, considerably higher than the 60%\approx60\% of the closest prior method. Ablation further confirms the indispensability of each system component (Liu et al., 3 Jan 2026).

6. Limitations and Prospective Directions

Two principal limitations are noted:

  • The reliance on LLM prompts for LLM-augmented inter-table hyperedge generation may restrict scalability and introduces dependency on prompt design and LLM quality.
  • Hypergraph size and complexity increase with the data lake's scale, which can become a bottleneck.

Future work includes learning hyperedge weights directly from data, supporting numeric and temporal join types, introducing efficient hypergraph sampling for very large data lakes, and generalizing the hypergraph plus reranking paradigm to tasks such as union-based search and semantic schema matching. This suggests that the HyperJoin framework is extensible beyond join discovery and could underpin general-purpose multi-table reasoning in data lakes (Liu et al., 3 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HyperJoin.