Papers
Topics
Authors
Recent
2000 character limit reached

Proximity Graph Table Union Search (PGTUS)

Updated 15 November 2025
  • The paper introduces a multi-stage framework that approximates costly bipartite matching using centroid clustering and proximity graph indexing to score table unionability.
  • PGTUS leverages an HNSW proximity graph for efficient nearest neighbor retrieval and employs tiered candidate pruning to drastically reduce the number of expensive matchings.
  • Empirical evaluations show that PGTUS maintains >95% recall while achieving up to 35.3× speedup and significant memory savings over existing table union search methods.

Proximity Graph-based Table Union Search (PGTUS) is a highly efficient multi-stage framework for finding semantically compatible tables that can be merged (“unioned”) with a given query table. The approach is motivated by the prohibitive cost of maximum-weight bipartite matching required for computing the unionability score of multi-vector table representations. PGTUS accelerates table union search while preserving retrieval quality by integrating centroid-based clustering, proximity graph indexing, and tiered pruning strategies based on both classical and novel algorithmic insights.

1. Formalization of the Table Union Search Problem

Given a collection of tables, each table TiT_i is represented by a set of column embeddings ViV_i. Let DT={T1,,Tn}D_T = \{T_1, \dots, T_n\} denote the repository of tables, and DE={V1,,Vn}D_E = \{V_1, \dots, V_n\} be the corresponding repository of column vector-sets, where ViV_i comprises one embedding per column of TiT_i.

A key operation in table union search is establishing a t-matching: for query and candidate vector-sets Vi,VjV_i, V_j and a similarity threshold τ\tau, a t-matching is a one-to-one mapping ha:ViVjh_a: V_i' \to V_j', with Vi=t|V_i'| = t, ViViV_i' \subseteq V_i, VjVjV_j' \subseteq V_j, such that every match vp,ha(vp)τ\langle v_p, h_a(v_p) \rangle \ge \tau. The unionability score U(Vi,Vj,τ)U(V_i, V_j, \tau) is defined as the maximum similarity sum over all maximum t-matchings: U(Vi,Vj,τ)=maxhaMa(Vi,Vj,τ)vpVivp,ha(vp).U(V_i, V_j, \tau) = \max_{h_a \in M_a(V_i, V_j, \tau)} \sum_{v_p \in V_i'} \langle v_p, h_a(v_p) \rangle. The objective is: given a query table TQT_Q (vectors VQV_Q), threshold τ\tau, and integer kk, return the top-kk tables in DTD_T maximizing U(VQ,Vi,τ)U(V_Q, V_i, \tau).

Multi-vector models (column-wise embeddings) offer high recall in retrieval, but exact computation of U(VQ,Vi,τ)U(V_Q, V_i, \tau) via the Hungarian algorithm is cubic in the number of columns, motivating efficient approximate search strategies.

2. Proximity Graph and Centroid-Based Indexing

PGTUS exploits centroid-based clustering and proximity graph indexing to support efficient candidate retrieval in the high-dimensional multi-vector search space.

  • Clustering and Centroids: All vectors E=iViE = \bigcup_i V_i are clustered using KK-means into ncn_c centroids CE={c1,,cnc}C_E = \{c_1, \dots, c_{n_c}\}. An inverted index Iv[cx]={vE:μ(v)=cx}I_v[c_x] = \{v \in E : \mu(v) = c_x\} maps each centroid to its assigned vectors, with μ(v)\mu(v) indicating the nearest centroid.
  • Table-wise Centroid Partition: For each table, Iw[i][cx]I_w[i][c_x] records the count of vectors in ViV_i assigned to centroid cxc_x.
  • Proximity Graph (HNSW): Centroids CEC_E are indexed via an HNSW (Hierarchical Navigable Small World) graph. Each centroid cxc_x is connected to its LL nearest neighbors under cosine similarity. At query time, the HNSW structure is traversed to retrieve the top ϕc\phi_c centroids for each query vector.

w(cx,cy)=cx,cyandNL(cx)={cy:cy is among the top-L neighbors of cx}w(c_x, c_y) = \langle c_x, c_y \rangle \quad\text{and}\quad N_L(c_x) = \{c_y: c_y \text{ is among the top-} L \text{ neighbors of } c_x\}

This proximity graph architecture underpins efficient nearest neighbor searches for the subsequent candidate selection stage.

3. Multi-Stage Candidate Selection and Pruning

The core of PGTUS consists of a three-stage pipeline designed to minimize reliance on full bipartite matching:

3.1 Refinement via Many-to-One Matching

PGTUS first identifies a restricted candidate set via an approximate many-to-one t-matching. Every candidate's vector set is partitioned into centroid-groups (from the clustering phase). The algorithm seeks a mapping hb:VQPih_b: V_Q' \to P_i' (with VQ=t|V_Q'| = t) assigning query vectors to groups such that the per-group capacity is respected, maximizing the summed centroid similarities.

Algorithm 1 (Refinement):

  1. Build an HNSW index on centroids.
  2. For each query vector vqv_q, retrieve top ϕc\phi_c centroids and store similarity scores in a max-heap.
  3. Track, for each candidate table, the cumulative score and centroid usage via hash tables.
  4. While the heap is nonempty, process candidates, updating scores only if group capacity allows.
  5. Return the ϕref\phi_{ref} tables with the highest approximate scores.

Time complexity is O(VQϕclogCE+VQϕcα)O(|V_Q|\phi_c\log|C_E| + |V_Q|\phi_c\alpha) with α\alpha as the average inverted list size.

3.2 Filtering Using Matching Bounds

For the refined candidate set (ϕref\phi_{ref} in size), PGTUS applies aggressive pruning by bounding the possible maximum matching score.

Algorithm 2 (BoundsForMWMTO):

  • For each group in a candidate's partition, record group capacity.
  • For each vqv_q in VQV_Q, compute simq,k=maxcGkvq,csim_{q,k} = \max_{c \in G_k} \langle v_q, c \rangle.
  • The lower bound is computed by greedy assignment; the upper bound is given as the sum of the top min(VQ,Vi)\min(|V_Q|, |V_i|) similarity values.

Algorithm 3 (BasePrune):

  • Use a min-heap to maintain the current top-kk candidates.
  • For each candidate, compute bounds:
    • If the lower bound exceeds the minimum on the heap, retrieve the exact unionability score (using the Hungarian algorithm) and replace the minimum if improved.
    • If the upper bound meets the minimum, the exact score is computed and possibly replaces the minimum.
    • Otherwise, the candidate is discarded.

Only a small subset (final candidates) undergoes the full cubic-time matching.

3.3 Enhanced Pruning with Double-Ended Priority Queues

PGTUS+^+ introduces two Double-Ended Priority Queues (DEPQs) to further reduce unnecessary bound computations by closely tracking bounds across candidates.

Algorithm 4 (EnhancedPrune):

  • Maintain DEPQs for the top ϕr\phi_r lower and upper bounds.
  • For candidates with upper bounds strictly below the ϕr\phi_r-th largest lower bound, pruning is applied without error.
  • Candidates are updated only if they might affect the final result.

Efficiency is further increased, with this mechanism provably not affecting the correctness of the returned top-kk.

4. Computational Complexity and Storage

The three-stage design ensures efficiency both in computation and storage:

  • Centroid Clustering and Indexing: KK-means clustering is O(Enctiter)O(|E|\,n_c\,t_{\mathrm{iter}}), while the inverted list is O(E)O(|E|) in storage.
  • Refinement Stage: O(VQ(ϕclogCE+ϕcα))O(|V_Q|(\phi_c\log|C_E| + \phi_c\alpha)) per query; auxiliary hash tables require O(VQϕc+n)O(|V_Q|\phi_c + n) space.
  • Filtering Stage: Bounds per candidate: O(VQPi)O(|V_Q|\cdot|P_i|); full matching (rarely needed): O(m3)O(m^3) where mm is columns per table.
  • Enhanced Pruning Stage: DEPQ operations per candidate: O(logϕr)O(\log \phi_r). Early diskards minimize bounding.

The bulk of the computational cost is offloaded from expensive global bipartite matchings to efficient centroid- and bound-based filtering.

5. Empirical Evaluation and Quantitative Results

PGTUS has been evaluated extensively on six benchmark datasets, spanning corpora from small (hundreds of tables) to large web data (millions of tables).

Dataset #Tables #Columns #Queries Avg Cols
Santos Small 550 6,322 50 12.3
Santos Large 11,086 121,796 80 12.7
TUS Small 1,530 14,810 10 34.5
TUS Large 5,043 55,000 4,293 11.9
Open Data 13,000 218,000 1,000 30.5
LakeBench WebTable 2.8M 14.8M 1,000 14.6
  • Recall@k: PGTUS preserves retrieval performance, consistently achieving >95% recall.
  • Latency: PGTUS delivers 2–4× speedup over Starmie at equivalent recall. The enhanced pruning variant PGTUS+^+ attains higher gains, particularly on large, high-dispersion datasets (e.g., Open Data, WebTable), reaching speedups up to 35.3× over brute-force baselines.
  • Storage and Build Time: PGTUS index for Open Data uses 414 MB (building in 35 s), substantially improving over Starmie (1,285 MB, 5 min). On the largest benchmark, WebTable, PGTUS index occupies 30.6 GB, compared to Starmie’s 98 GB.
Method Open Data Storage Build Time Webtable Storage Build Time
Starmie 1,285 MB 5 m 98 GB 2.5 h
Starmie (VQ) 361 MB 12 s 21.8 GB 1 h
PGTUS 414 MB 35 s 30.6 GB 2 h

PGTUS(VQ), leveraging vector quantization, achieves a 3.5× reduction in memory over Starmie with a minimal drop in recall.

Parameter sensitivity analysis indicates:

  • Refinement candidate size ϕref/k\phi_{ref}/k improves recall up to a plateau at about $3k$.
  • Filtering candidate size ϕr2k\phi_r \approx 2k offers diminishing returns beyond this threshold.
  • Centroid probe sizes ϕc\phi_c as small as 1–4 suffice for large web-scale data, with higher values required for low-dispersion data.

6. Relationship to Prior Work

PGTUS advances the state of the art over single-vector retrieval methods (e.g., BM25 on concatenated columns) and previous multi-vector designs like Starmie:

  • Single-vector methods: Lack the granularity necessary to model column–column correspondences, resulting in reduced recall.
  • Starmie: Utilizes multi-vector matchings but scales poorly for large datasets due to exhaustive bipartite matching in retrieval.
  • PGTUS: By combining centroid-based preparation, efficient proximity search, and multi-stage pruning (including a novel bound-tightening DEPQ approach), achieves a 3.6–6.0× speedup with negligible recall degradation and reduces the number of expensive matchings by over 90%.

Additionally, the use of vector quantization in index storage furthers space savings with little performance penalty.

7. Limitations and Prospects

PGTUS presupposes a fixed similarity threshold τ\tau across all query-candidate pairs; adaptive or context-aware thresholds remain an open avenue. The quality of centroid clustering (e.g., via K-means) directly impacts pruning and candidate grouping; datasets with highly uneven distributions of vector embeddings may necessitate more sophisticated centroid schemes, such as “cascade centroids.”

DEPQ pruning efficiency depends on achieving tight bounds. In edge cases—particularly, queries or candidates with very small or very large numbers of columns—bound tightness may deteriorate and degrade pruning efficiency.

Areas identified for future work include dynamic online updates of centroids, on-the-fly query adaptation, and acceleration of matching stages via GPU-based computation.

In conclusion, PGTUS establishes a scalable, robust, and storage-efficient methodology for high-recall table union search, leveraging proximity graph construction, centroid-indexing, and priority-queue-based pruning to offer significant improvements over previous methods in both speed and memory utilization, within the constraints and capacities defined by current neural multi-vector embedding paradigms.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Proximity Graph-based Table Union Search (PGTUS).