HyperJoin: LLM-Augmented Hypergraph Discovery

Updated 10 January 2026

HyperJoin is an LLM-augmented hypergraph framework designed for joinable table discovery in large-scale data lakes.
It employs a Hierarchical Interaction Network with table-ID and Laplacian encodings to integrate column and hyperedge features.
Empirical results reveal significant improvements in precision and recall over baseline methods, though challenges in scalability and prompt reliance remain.

HyperJoin is an LLM-augmented hypergraph link prediction framework designed for joinable table discovery in large-scale data lakes. By modeling columns and their structural relationships through a combinatorial hypergraph architecture and leveraging a Hierarchical Interaction Network (HIN) for representation learning, HyperJoin significantly improves both the accuracy and coherence of discovered joinable columns. Its design directly addresses and resolves limitations found in earlier LLM-based and pairwise similarity-based methods.

1. Formal Problem Statement and Hypergraph Modeling

Let $\mathcal{H} = (\mathcal{V},\,\mathcal{E},\,\mathbf{X}^v,\,\boldsymbol{\Pi})$ denote the hypergraph constructed to represent a data lake, where $\mathcal{V} = \{v_i\}$ are column nodes, and $\mathcal{E} = \{e_j\}$ are hyperedges partitioned into intra-table and inter-table types. An intra-table hyperedge $e_T^{\rm intra} = \{v_i \mid v_i\text{ is a column of }T\}$ models all columns within table $T$ . Inter-table hyperedges $e_k^{\rm inter} \subseteq \mathcal{V}$ encode sets of joinable columns, potentially augmented via LLM-driven schema variant detection. The incidence matrix $\boldsymbol{\Pi} \in \{0,1\}^{|\mathcal{V}| \times |\mathcal{E}|}$ formalizes these relationships.

Joinable table discovery is framed as a link prediction problem: for query column $C_q$ (node $v_q$ ), the objective is to identify new inter-table hyperedges that should connect $v_q$ to other columns. This formulation promotes richer context-aware reasoning relative to prior approaches based only on isolated or pairwise modeling of columns (Liu et al., 3 Jan 2026).

2. Hypergraph Construction and LLM-Augmented Schema Augmentation

The hypergraph is constructed in a multi-stage process:

Node Feature Engineering: Each node $\mathcal{V} = \{v_i\}$ 0 is initialized with a vector $\mathcal{V} = \{v_i\}$ 1 synthesizing table name, column name, and cell-value statistics through a learned weighted sum and projection layer.
Intra-table Hyperedges: For every table $\mathcal{V} = \{v_i\}$ 2, construct $\mathcal{V} = \{v_i\}$ 3 including all column nodes from $\mathcal{V} = \{v_i\}$ 4.
LLM-Augmented Inter-table Hyperedges: Key columns are detected based on keyness heuristics. For each candidate, a LLM is prompted to enumerate semantically equivalent name variants (e.g., expansions, abbreviations, synonyms). These variants, together with exact matches, are assembled into a join-graph $\mathcal{V} = \{v_i\}$ 5 where nodes represent key columns and edges indicate equivalence. Each connected component $\mathcal{V} = \{v_i\}$ 6 in $\mathcal{V} = \{v_i\}$ 7 yields an inter-table hyperedge $\mathcal{V} = \{v_i\}$ 8.

This LLM augmentation enables HyperJoin to overcome brittle reliance on exact heuristics and latent pairwise signals, supporting robust discovery across diverse naming conventions and schema variations (Liu et al., 3 Jan 2026).

3. Hierarchical Interaction Network Architecture

The centerpiece of HyperJoin’s representation learning is the Hierarchical Interaction Network (HIN), which propagates and integrates information at both column (node) and hyperedge levels in the hypergraph.

Positional Encodings

Node features are enhanced via two positional encodings:

Table-ID Encoding: For node $\mathcal{V} = \{v_i\}$ 9 in table $\mathcal{E} = \{e_j\}$ 0, $\mathcal{E} = \{e_j\}$ 1.
Laplacian Encoding: Considering the adjacency matrix $\mathcal{E} = \{e_j\}$ 2 of the pairwise join graph, compute the normalized Laplacian $\mathcal{E} = \{e_j\}$ 3, extract its dominant eigenvectors, and map to each column node via MLP. The unified embedding is $\mathcal{E} = \{e_j\}$ 4.

Node and Hyperedge Updates

Node-Transform Layers: Perform linear transformation and normalization on node states.
Hyperedge Pooling and Specialized Updates: Aggregate node embeddings by mean within each hyperedge (separately for intra- and inter-table types) and apply learnable projections.

Global Hyperedge Mixing

Construct a hyperedge adjacency matrix $\mathcal{E} = \{e_j\}$ 5. Stack multiple structure-aware attention mixer layers, which integrate the structural signal of $\mathcal{E} = \{e_j\}$ 6 into the attention computation and result in refined hyperedge embeddings $\mathcal{E} = \{e_j\}$ 7.

Backpropagation to Columns

Update each column node’s state with aggregated information from connected hyperedges:

$\mathcal{E} = \{e_j\}$ 8

Ablation demonstrates this hierarchical message propagation is critical: omitting it results in drops of up to 20 points in P@15 and similar degradation in recall (Liu et al., 3 Jan 2026).

4. Coherence-Aware Candidate Selection and Reranking

The online phase introduces a coherence-aware top- $\mathcal{E} = \{e_j\}$ 9 column selection framework:

Initial Retrieval: Retrieve a candidate set $e_T^{\rm intra} = \{v_i \mid v_i\text{ is a column of }T\}$ 0 via approximate nearest neighbor (ANN) search on final node embeddings.
Objective Function: Maximize an aggregate score over sets $e_T^{\rm intra} = \{v_i \mid v_i\text{ is a column of }T\}$ 1 with cardinality $e_T^{\rm intra} = \{v_i \mid v_i\text{ is a column of }T\}$ 2:

$e_T^{\rm intra} = \{v_i \mid v_i\text{ is a column of }T\}$ 3

with $e_T^{\rm intra} = \{v_i \mid v_i\text{ is a column of }T\}$ 4. $e_T^{\rm intra} = \{v_i \mid v_i\text{ is a column of }T\}$ 5 is computed as the weight of the maximum spanning tree (MST) over the induced candidate graph.

NP-Hardness: Theorem 4.5 establishes that maximizing this coherence-aware objective is NP-hard for MST-based coherence metrics.
Greedy MST-based Reranking: A greedy algorithm inspired by Prim’s MST method reranks candidates, promoting both relevance to the query and internal set coherence. At each iteration, candidates are selected maximizing a surrogate marginal gain; Theorem 4.8 ensures each selection lower-bounds the true gain in MST weight.

Module	Degradation if omitted (P@15 / R@15)	Impact
Coherent reranking ("w/o CR")	3–6 pts (P@15)	Decreased set coherence
Hierarchical interaction (HIN)	≈20 pts (WebTable, P@15)	Catastrophic on web tables
Hypergraph structure ("w/o HG")	18–25 pts (structured corpora)	Major drop on structured data

5. Empirical Evaluation

Extensive experiments validate HyperJoin on multiple benchmarks (USA, CAN, UK_SG government portals; WebTable) against baselines including JOSIE, LSH Ensemble, DeepJoin, Snoopy, Omnimatch. Precision@K and Recall@K for $e_T^{\rm intra} = \{v_i \mid v_i\text{ is a column of }T\}$ 6 are used as evaluation metrics.

At $e_T^{\rm intra} = \{v_i \mid v_i\text{ is a column of }T\}$ 7, HyperJoin achieves mean gains of 21.4 percentage points in Precision@15 and 17.2 in Recall@15 over the best-performing baseline. On the UK_SG benchmark, HyperJoin attains $e_T^{\rm intra} = \{v_i \mid v_i\text{ is a column of }T\}$ 8 Precision@15, considerably higher than the $e_T^{\rm intra} = \{v_i \mid v_i\text{ is a column of }T\}$ 9 of the closest prior method. Ablation further confirms the indispensability of each system component (Liu et al., 3 Jan 2026).

6. Limitations and Prospective Directions

Two principal limitations are noted:

The reliance on LLM prompts for LLM-augmented inter-table hyperedge generation may restrict scalability and introduces dependency on prompt design and LLM quality.
Hypergraph size and complexity increase with the data lake's scale, which can become a bottleneck.

Future work includes learning hyperedge weights directly from data, supporting numeric and temporal join types, introducing efficient hypergraph sampling for very large data lakes, and generalizing the hypergraph plus reranking paradigm to tasks such as union-based search and semantic schema matching. This suggests that the HyperJoin framework is extensible beyond join discovery and could underpin general-purpose multi-table reasoning in data lakes (Liu et al., 3 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

HyperJoin: LLM-augmented Hypergraph Link Prediction for Joinable Table Discovery (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HyperJoin.