Proximity Graph Table Union Search (PGTUS)
- The paper introduces a multi-stage framework that approximates costly bipartite matching using centroid clustering and proximity graph indexing to score table unionability.
- PGTUS leverages an HNSW proximity graph for efficient nearest neighbor retrieval and employs tiered candidate pruning to drastically reduce the number of expensive matchings.
- Empirical evaluations show that PGTUS maintains >95% recall while achieving up to 35.3× speedup and significant memory savings over existing table union search methods.
Proximity Graph-based Table Union Search (PGTUS) is a highly efficient multi-stage framework for finding semantically compatible tables that can be merged (“unioned”) with a given query table. The approach is motivated by the prohibitive cost of maximum-weight bipartite matching required for computing the unionability score of multi-vector table representations. PGTUS accelerates table union search while preserving retrieval quality by integrating centroid-based clustering, proximity graph indexing, and tiered pruning strategies based on both classical and novel algorithmic insights.
1. Formalization of the Table Union Search Problem
Given a collection of tables, each table is represented by a set of column embeddings . Let denote the repository of tables, and be the corresponding repository of column vector-sets, where comprises one embedding per column of .
A key operation in table union search is establishing a t-matching: for query and candidate vector-sets and a similarity threshold , a t-matching is a one-to-one mapping , with , , , such that every match . The unionability score is defined as the maximum similarity sum over all maximum t-matchings: The objective is: given a query table (vectors ), threshold , and integer , return the top- tables in maximizing .
Multi-vector models (column-wise embeddings) offer high recall in retrieval, but exact computation of via the Hungarian algorithm is cubic in the number of columns, motivating efficient approximate search strategies.
2. Proximity Graph and Centroid-Based Indexing
PGTUS exploits centroid-based clustering and proximity graph indexing to support efficient candidate retrieval in the high-dimensional multi-vector search space.
- Clustering and Centroids: All vectors are clustered using -means into centroids . An inverted index maps each centroid to its assigned vectors, with indicating the nearest centroid.
- Table-wise Centroid Partition: For each table, records the count of vectors in assigned to centroid .
- Proximity Graph (HNSW): Centroids are indexed via an HNSW (Hierarchical Navigable Small World) graph. Each centroid is connected to its nearest neighbors under cosine similarity. At query time, the HNSW structure is traversed to retrieve the top centroids for each query vector.
This proximity graph architecture underpins efficient nearest neighbor searches for the subsequent candidate selection stage.
3. Multi-Stage Candidate Selection and Pruning
The core of PGTUS consists of a three-stage pipeline designed to minimize reliance on full bipartite matching:
3.1 Refinement via Many-to-One Matching
PGTUS first identifies a restricted candidate set via an approximate many-to-one t-matching. Every candidate's vector set is partitioned into centroid-groups (from the clustering phase). The algorithm seeks a mapping (with ) assigning query vectors to groups such that the per-group capacity is respected, maximizing the summed centroid similarities.
Algorithm 1 (Refinement):
- Build an HNSW index on centroids.
- For each query vector , retrieve top centroids and store similarity scores in a max-heap.
- Track, for each candidate table, the cumulative score and centroid usage via hash tables.
- While the heap is nonempty, process candidates, updating scores only if group capacity allows.
- Return the tables with the highest approximate scores.
Time complexity is with as the average inverted list size.
3.2 Filtering Using Matching Bounds
For the refined candidate set ( in size), PGTUS applies aggressive pruning by bounding the possible maximum matching score.
Algorithm 2 (BoundsForMWMTO):
- For each group in a candidate's partition, record group capacity.
- For each in , compute .
- The lower bound is computed by greedy assignment; the upper bound is given as the sum of the top similarity values.
Algorithm 3 (BasePrune):
- Use a min-heap to maintain the current top- candidates.
- For each candidate, compute bounds:
- If the lower bound exceeds the minimum on the heap, retrieve the exact unionability score (using the Hungarian algorithm) and replace the minimum if improved.
- If the upper bound meets the minimum, the exact score is computed and possibly replaces the minimum.
- Otherwise, the candidate is discarded.
Only a small subset (final candidates) undergoes the full cubic-time matching.
3.3 Enhanced Pruning with Double-Ended Priority Queues
PGTUS introduces two Double-Ended Priority Queues (DEPQs) to further reduce unnecessary bound computations by closely tracking bounds across candidates.
Algorithm 4 (EnhancedPrune):
- Maintain DEPQs for the top lower and upper bounds.
- For candidates with upper bounds strictly below the -th largest lower bound, pruning is applied without error.
- Candidates are updated only if they might affect the final result.
Efficiency is further increased, with this mechanism provably not affecting the correctness of the returned top-.
4. Computational Complexity and Storage
The three-stage design ensures efficiency both in computation and storage:
- Centroid Clustering and Indexing: -means clustering is , while the inverted list is in storage.
- Refinement Stage: per query; auxiliary hash tables require space.
- Filtering Stage: Bounds per candidate: ; full matching (rarely needed): where is columns per table.
- Enhanced Pruning Stage: DEPQ operations per candidate: . Early diskards minimize bounding.
The bulk of the computational cost is offloaded from expensive global bipartite matchings to efficient centroid- and bound-based filtering.
5. Empirical Evaluation and Quantitative Results
PGTUS has been evaluated extensively on six benchmark datasets, spanning corpora from small (hundreds of tables) to large web data (millions of tables).
| Dataset | #Tables | #Columns | #Queries | Avg Cols |
|---|---|---|---|---|
| Santos Small | 550 | 6,322 | 50 | 12.3 |
| Santos Large | 11,086 | 121,796 | 80 | 12.7 |
| TUS Small | 1,530 | 14,810 | 10 | 34.5 |
| TUS Large | 5,043 | 55,000 | 4,293 | 11.9 |
| Open Data | 13,000 | 218,000 | 1,000 | 30.5 |
| LakeBench WebTable | 2.8M | 14.8M | 1,000 | 14.6 |
- Recall@k: PGTUS preserves retrieval performance, consistently achieving >95% recall.
- Latency: PGTUS delivers 2–4× speedup over Starmie at equivalent recall. The enhanced pruning variant PGTUS attains higher gains, particularly on large, high-dispersion datasets (e.g., Open Data, WebTable), reaching speedups up to 35.3× over brute-force baselines.
- Storage and Build Time: PGTUS index for Open Data uses 414 MB (building in 35 s), substantially improving over Starmie (1,285 MB, 5 min). On the largest benchmark, WebTable, PGTUS index occupies 30.6 GB, compared to Starmie’s 98 GB.
| Method | Open Data Storage | Build Time | Webtable Storage | Build Time |
|---|---|---|---|---|
| Starmie | 1,285 MB | 5 m | 98 GB | 2.5 h |
| Starmie (VQ) | 361 MB | 12 s | 21.8 GB | 1 h |
| PGTUS | 414 MB | 35 s | 30.6 GB | 2 h |
PGTUS(VQ), leveraging vector quantization, achieves a 3.5× reduction in memory over Starmie with a minimal drop in recall.
Parameter sensitivity analysis indicates:
- Refinement candidate size improves recall up to a plateau at about $3k$.
- Filtering candidate size offers diminishing returns beyond this threshold.
- Centroid probe sizes as small as 1–4 suffice for large web-scale data, with higher values required for low-dispersion data.
6. Relationship to Prior Work
PGTUS advances the state of the art over single-vector retrieval methods (e.g., BM25 on concatenated columns) and previous multi-vector designs like Starmie:
- Single-vector methods: Lack the granularity necessary to model column–column correspondences, resulting in reduced recall.
- Starmie: Utilizes multi-vector matchings but scales poorly for large datasets due to exhaustive bipartite matching in retrieval.
- PGTUS: By combining centroid-based preparation, efficient proximity search, and multi-stage pruning (including a novel bound-tightening DEPQ approach), achieves a 3.6–6.0× speedup with negligible recall degradation and reduces the number of expensive matchings by over 90%.
Additionally, the use of vector quantization in index storage furthers space savings with little performance penalty.
7. Limitations and Prospects
PGTUS presupposes a fixed similarity threshold across all query-candidate pairs; adaptive or context-aware thresholds remain an open avenue. The quality of centroid clustering (e.g., via K-means) directly impacts pruning and candidate grouping; datasets with highly uneven distributions of vector embeddings may necessitate more sophisticated centroid schemes, such as “cascade centroids.”
DEPQ pruning efficiency depends on achieving tight bounds. In edge cases—particularly, queries or candidates with very small or very large numbers of columns—bound tightness may deteriorate and degrade pruning efficiency.
Areas identified for future work include dynamic online updates of centroids, on-the-fly query adaptation, and acceleration of matching stages via GPU-based computation.
In conclusion, PGTUS establishes a scalable, robust, and storage-efficient methodology for high-recall table union search, leveraging proximity graph construction, centroid-indexing, and priority-queue-based pruning to offer significant improvements over previous methods in both speed and memory utilization, within the constraints and capacities defined by current neural multi-vector embedding paradigms.