Proximity Graph Table Union Search (PGTUS)

Updated 15 November 2025

The paper introduces a multi-stage framework that approximates costly bipartite matching using centroid clustering and proximity graph indexing to score table unionability.
PGTUS leverages an HNSW proximity graph for efficient nearest neighbor retrieval and employs tiered candidate pruning to drastically reduce the number of expensive matchings.
Empirical evaluations show that PGTUS maintains >95% recall while achieving up to 35.3× speedup and significant memory savings over existing table union search methods.

Proximity Graph-based Table Union Search (PGTUS) is a highly efficient multi-stage framework for finding semantically compatible tables that can be merged (“unioned”) with a given query table. The approach is motivated by the prohibitive cost of maximum-weight bipartite matching required for computing the unionability score of multi-vector table representations. PGTUS accelerates table union search while preserving retrieval quality by integrating centroid-based clustering, proximity graph indexing, and tiered pruning strategies based on both classical and novel algorithmic insights.

1. Formalization of the Table Union Search Problem

Given a collection of tables, each table $T_i$ is represented by a set of column embeddings $V_i$ . Let $D_T = \{T_1, \dots, T_n\}$ denote the repository of tables, and $D_E = \{V_1, \dots, V_n\}$ be the corresponding repository of column vector-sets, where $V_i$ comprises one embedding per column of $T_i$ .

A key operation in table union search is establishing a t-matching: for query and candidate vector-sets $V_i, V_j$ and a similarity threshold $\tau$ , a t-matching is a one-to-one mapping $h_a: V_i' \to V_j'$ , with $|V_i'| = t$ , $V_i' \subseteq V_i$ , $V_j' \subseteq V_j$ , such that every match $\langle v_p, h_a(v_p) \rangle \ge \tau$ . The unionability score $U(V_i, V_j, \tau)$ is defined as the maximum similarity sum over all maximum t-matchings: $U(V_i, V_j, \tau) = \max_{h_a \in M_a(V_i, V_j, \tau)} \sum_{v_p \in V_i'} \langle v_p, h_a(v_p) \rangle.$ The objective is: given a query table $T_Q$ (vectors $V_Q$ ), threshold $\tau$ , and integer $k$ , return the top- $k$ tables in $D_T$ maximizing $U(V_Q, V_i, \tau)$ .

Multi-vector models (column-wise embeddings) offer high recall in retrieval, but exact computation of $U(V_Q, V_i, \tau)$ via the Hungarian algorithm is cubic in the number of columns, motivating efficient approximate search strategies.

2. Proximity Graph and Centroid-Based Indexing

PGTUS exploits centroid-based clustering and proximity graph indexing to support efficient candidate retrieval in the high-dimensional multi-vector search space.

Clustering and Centroids: All vectors $E = \bigcup_i V_i$ are clustered using $K$ -means into $n_c$ centroids $C_E = \{c_1, \dots, c_{n_c}\}$ . An inverted index $I_v[c_x] = \{v \in E : \mu(v) = c_x\}$ maps each centroid to its assigned vectors, with $\mu(v)$ indicating the nearest centroid.
Table-wise Centroid Partition: For each table, $I_w[i][c_x]$ records the count of vectors in $V_i$ assigned to centroid $c_x$ .
Proximity Graph (HNSW): Centroids $C_E$ are indexed via an HNSW (Hierarchical Navigable Small World) graph. Each centroid $c_x$ is connected to its $L$ nearest neighbors under cosine similarity. At query time, the HNSW structure is traversed to retrieve the top $\phi_c$ centroids for each query vector.

$w(c_x, c_y) = \langle c_x, c_y \rangle \quad\text{and}\quad N_L(c_x) = \{c_y: c_y \text{ is among the top-} L \text{ neighbors of } c_x\}$

This proximity graph architecture underpins efficient nearest neighbor searches for the subsequent candidate selection stage.

3. Multi-Stage Candidate Selection and Pruning

The core of PGTUS consists of a three-stage pipeline designed to minimize reliance on full bipartite matching:

PGTUS first identifies a restricted candidate set via an approximate many-to-one t-matching. Every candidate's vector set is partitioned into centroid-groups (from the clustering phase). The algorithm seeks a mapping $h_b: V_Q' \to P_i'$ (with $|V_Q'| = t$ ) assigning query vectors to groups such that the per-group capacity is respected, maximizing the summed centroid similarities.

Algorithm 1 (Refinement):

Build an HNSW index on centroids.
For each query vector $v_q$ , retrieve top $\phi_c$ centroids and store similarity scores in a max-heap.
Track, for each candidate table, the cumulative score and centroid usage via hash tables.
While the heap is nonempty, process candidates, updating scores only if group capacity allows.
Return the $\phi_{ref}$ tables with the highest approximate scores.

Time complexity is $O(|V_Q|\phi_c\log|C_E| + |V_Q|\phi_c\alpha)$ with $\alpha$ as the average inverted list size.

3.2 Filtering Using Matching Bounds

For the refined candidate set ( $\phi_{ref}$ in size), PGTUS applies aggressive pruning by bounding the possible maximum matching score.

Algorithm 2 (BoundsForMWMTO):

For each group in a candidate's partition, record group capacity.
For each $v_q$ in $V_Q$ , compute $sim_{q,k} = \max_{c \in G_k} \langle v_q, c \rangle$ .
The lower bound is computed by greedy assignment; the upper bound is given as the sum of the top $\min(|V_Q|, |V_i|)$ similarity values.

Algorithm 3 (BasePrune):

Use a min-heap to maintain the current top- $k$ candidates.
For each candidate, compute bounds:
- If the lower bound exceeds the minimum on the heap, retrieve the exact unionability score (using the Hungarian algorithm) and replace the minimum if improved.
- If the upper bound meets the minimum, the exact score is computed and possibly replaces the minimum.
- Otherwise, the candidate is discarded.

Only a small subset (final candidates) undergoes the full cubic-time matching.

3.3 Enhanced Pruning with Double-Ended Priority Queues

PGTUS $^+$ introduces two Double-Ended Priority Queues (DEPQs) to further reduce unnecessary bound computations by closely tracking bounds across candidates.

Algorithm 4 (EnhancedPrune):

Maintain DEPQs for the top $\phi_r$ lower and upper bounds.
For candidates with upper bounds strictly below the $\phi_r$ -th largest lower bound, pruning is applied without error.
Candidates are updated only if they might affect the final result.

Efficiency is further increased, with this mechanism provably not affecting the correctness of the returned top- $k$ .

4. Computational Complexity and Storage

The three-stage design ensures efficiency both in computation and storage:

Centroid Clustering and Indexing: $K$ -means clustering is $O(|E|\,n_c\,t_{\mathrm{iter}})$ , while the inverted list is $O(|E|)$ in storage.
Refinement Stage: $O(|V_Q|(\phi_c\log|C_E| + \phi_c\alpha))$ per query; auxiliary hash tables require $O(|V_Q|\phi_c + n)$ space.
Filtering Stage: Bounds per candidate: $O(|V_Q|\cdot|P_i|)$ ; full matching (rarely needed): $O(m^3)$ where $m$ is columns per table.
Enhanced Pruning Stage: DEPQ operations per candidate: $O(\log \phi_r)$ . Early diskards minimize bounding.

The bulk of the computational cost is offloaded from expensive global bipartite matchings to efficient centroid- and bound-based filtering.

5. Empirical Evaluation and Quantitative Results

PGTUS has been evaluated extensively on six benchmark datasets, spanning corpora from small (hundreds of tables) to large web data (millions of tables).

Dataset	#Tables	#Columns	#Queries	Avg Cols
Santos Small	550	6,322	50	12.3
Santos Large	11,086	121,796	80	12.7
TUS Small	1,530	14,810	10	34.5
TUS Large	5,043	55,000	4,293	11.9
Open Data	13,000	218,000	1,000	30.5
LakeBench WebTable	2.8M	14.8M	1,000	14.6

Recall@k: PGTUS preserves retrieval performance, consistently achieving >95% recall.
Latency: PGTUS delivers 2–4× speedup over Starmie at equivalent recall. The enhanced pruning variant PGTUS $^+$ attains higher gains, particularly on large, high-dispersion datasets (e.g., Open Data, WebTable), reaching speedups up to 35.3× over brute-force baselines.
Storage and Build Time: PGTUS index for Open Data uses 414 MB (building in 35 s), substantially improving over Starmie (1,285 MB, 5 min). On the largest benchmark, WebTable, PGTUS index occupies 30.6 GB, compared to Starmie’s 98 GB.

Method	Open Data Storage	Build Time	Webtable Storage	Build Time
Starmie	1,285 MB	5 m	98 GB	2.5 h
Starmie (VQ)	361 MB	12 s	21.8 GB	1 h
PGTUS	414 MB	35 s	30.6 GB	2 h

PGTUS(VQ), leveraging vector quantization, achieves a 3.5× reduction in memory over Starmie with a minimal drop in recall.

Parameter sensitivity analysis indicates:

Refinement candidate size $\phi_{ref}/k$ improves recall up to a plateau at about $3k$.
Filtering candidate size $\phi_r \approx 2k$ offers diminishing returns beyond this threshold.
Centroid probe sizes $\phi_c$ as small as 1–4 suffice for large web-scale data, with higher values required for low-dispersion data.

6. Relationship to Prior Work

PGTUS advances the state of the art over single-vector retrieval methods (e.g., BM25 on concatenated columns) and previous multi-vector designs like Starmie:

Single-vector methods: Lack the granularity necessary to model column–column correspondences, resulting in reduced recall.
Starmie: Utilizes multi-vector matchings but scales poorly for large datasets due to exhaustive bipartite matching in retrieval.
PGTUS: By combining centroid-based preparation, efficient proximity search, and multi-stage pruning (including a novel bound-tightening DEPQ approach), achieves a 3.6–6.0× speedup with negligible recall degradation and reduces the number of expensive matchings by over 90%.

Additionally, the use of vector quantization in index storage furthers space savings with little performance penalty.

7. Limitations and Prospects

PGTUS presupposes a fixed similarity threshold $\tau$ across all query-candidate pairs; adaptive or context-aware thresholds remain an open avenue. The quality of centroid clustering (e.g., via K-means) directly impacts pruning and candidate grouping; datasets with highly uneven distributions of vector embeddings may necessitate more sophisticated centroid schemes, such as “cascade centroids.”

DEPQ pruning efficiency depends on achieving tight bounds. In edge cases—particularly, queries or candidates with very small or very large numbers of columns—bound tightness may deteriorate and degrade pruning efficiency.

Areas identified for future work include dynamic online updates of centroids, on-the-fly query adaptation, and acceleration of matching stages via GPU-based computation.

In conclusion, PGTUS establishes a scalable, robust, and storage-efficient methodology for high-recall table union search, leveraging proximity graph construction, centroid-indexing, and priority-queue-based pruning to offer significant improvements over previous methods in both speed and memory utilization, within the constraints and capacities defined by current neural multi-vector embedding paradigms.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Proximity Graph-based Table Union Search (PGTUS).

Proximity Graph Table Union Search (PGTUS)

1. Formalization of the Table Union Search Problem

2. Proximity Graph and Centroid-Based Indexing

3. Multi-Stage Candidate Selection and Pruning

3.1 Refinement via Many-to-One Matching

3.2 Filtering Using Matching Bounds

3.3 Enhanced Pruning with Double-Ended Priority Queues

4. Computational Complexity and Storage

5. Empirical Evaluation and Quantitative Results

6. Relationship to Prior Work

7. Limitations and Prospects

Whiteboard

Follow Topic

Continue Learning

Proximity Graph Table Union Search (PGTUS)

1. Formalization of the Table Union Search Problem

2. Proximity Graph and Centroid-Based Indexing

3. Multi-Stage Candidate Selection and Pruning

3.1 Refinement via Many-to-One Matching

3.2 Filtering Using Matching Bounds

3.3 Enhanced Pruning with Double-Ended Priority Queues

4. Computational Complexity and Storage

5. Empirical Evaluation and Quantitative Results

6. Relationship to Prior Work

7. Limitations and Prospects

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics