Starmie: End-to-End Dataset Discovery
- Starmie is an end-to-end framework for semantics-aware dataset discovery, leveraging self-supervised contrastive learning and multi-column contextualization to represent table columns accurately.
- It employs a two-phase workflow where an offline phase pretrains column embeddings, and an online phase retrieves unionable tables via cosine similarity and weighted bipartite matching.
- Integrating HNSW indexing enables rapid approximate nearest neighbor search, achieving state-of-the-art efficiency with a 6.8% MAP improvement and up to 3,000× speedup over baselines.
Starmie is an end-to-end framework for semantics-aware dataset discovery from large-scale data lakes, with its principal application in table union search. The design of Starmie integrates self-supervised contrastive learning for column representation, multi-column contextualization based on pre-trained LLMs, efficient unionability scoring, and advanced approximate nearest neighbor search using the Hierarchical Navigable Small World (HNSW) index. The framework achieves state-of-the-art effectiveness in identifying unionable tables, operating efficiently even in scenarios with inconsistent or incomplete metadata.
1. System Architecture and Workflow
Starmie is structured into two discrete operational stages: an offline phase and an online phase. In the offline phase, a column encoder is pretrained on the input data lake’s tables, transforming each column into a high-dimensional embedding that encodes both its intrinsic and contextual semantic properties. These embeddings are indexed by advanced high-dimensional vector indices—specifically, HNSW—for rapid nearest neighbor retrieval.
During the online phase, given a query table, Starmie retrieves candidate unionable tables by searching for columns in the data lake with high cosine similarity to the query’s column embeddings. Candidate tables undergo aggregation of column-level scores, computed using techniques such as weighted bipartite matching, yielding an overall unionability score for each candidate. This architecture is robust to inconsistent or missing metadata due to its reliance on neural semantic representations.
2. Contrastive Learning for Column Representation
Central to Starmie is a self-supervised contrastive learning strategy for column encoder training. For each column, two semantic-preserving “views” or augmentations are generated, for example by randomly sampling cells or dropping tokens. Both views are encoded via a neural model initialized from a pre-trained LLM (e.g., RoBERTa, BERT), and further fine-tuned in an unsupervised manner.
The contrastive loss is modeled after InfoNCE: where and are embeddings of paired views of the same column, is a temperature hyper-parameter (empirically set to 0.07), and cosine similarity quantifies embedding similarity. The optimization objective is to “pull” unionable/semantically similar columns together and “push” dissimilar columns apart in representation space. This fully unsupervised setup avoids reliance on annotated training data.
3. Multi-column Contextualization and Table Encoding
The column encoder in Starmie captures not only the syntactic and semantic signals from the column’s own values, but also its position and context in the table schema. The multi-column table encoder serializes the entire table into a sequence with a designated separator token (e.g., “<s>”) prepending each column. The entire sequence is processed by the Transformer layers of the pre-trained LLM, with the representation of each separator token serving as a contextualized embedding for its corresponding column.
This multi-column pre-training strategy ensures that each column’s embedding incorporates signals from its neighboring columns, enabling the system to disambiguate columns with otherwise similar content but different contextual significance (e.g., distinguishing “Destination” as a travel city vs. bird sighting location).
4. Unionability Scoring, Filtering, and Verification
To assess whether two tables are unionable, Starmie computes column embedding similarities using cosine similarity. For a pair of tables (query) and (candidate), the mapping is formalized as a maximum weighted bipartite matching problem: one set of nodes for ’s columns, another for ’s. Edges are established only between column pairs whose cosine similarity exceeds a threshold , and edge weights are set accordingly. The aggregate of edge weights from the optimal matching yields the overall table unionability score, .
For computational efficiency, a filter-and-verification framework is used:
- Filtering: Candidate tables are shortlisted via fast approximate vector joins.
- Verification: Lower and upper bounds () of the matching score are estimated to prune unlikely candidates before the expensive full matching algorithm is applied.
This approach offers a principled balance between recall and computational cost, essential for large-scale data lakes.
5. High-dimensional Indexing with HNSW
Starmie is the first system in data lake table search to leverage the Hierarchical Navigable Small World (HNSW) index for efficient nearest neighbor retrieval in high-dimensional embedding spaces. HNSW constructs a multi-layer proximity graph, supporting rapid approximate similarity queries. Empirical results demonstrate that HNSW yields a 3,000× speedup over the linear scan baseline and a 400× improvement over Locality Sensitive Hashing (LSH), the prior state-of-the-art, with only a minor reduction in MAP and recall. This substantial acceleration is critical to supporting large-scale deployment, enabling efficient search over millions of candidate tables.
6. Empirical Performance and Evaluation
Starmie is empirically validated on multiple real-world table benchmarks for the union search task. It demonstrates a 6.8% absolute improvement in Mean Average Precision (MAP) and recall relative to the preceding state-of-the-art. Reported MAP values approach 99% in some settings, attributable to the use of contrastive learning and contextualized column embeddings, maintaining high effectiveness despite heterogeneous or incomplete metadata. The capability to rapidly and accurately discover unionable tables is thus established at scale.
7. Technical Innovations and Broader Implications
Starmie introduces multiple new elements to dataset discovery:
- A fully unsupervised, contrastive learning pipeline for semantically rich, context-aware column embedding without annotated data.
- Integration of a multi-column Transformer-based encoder for leveraging table context.
- Table unionability formalized as a weighted bipartite matching problem, combined with a filter-and-verification mechanism using upper and lower bound estimators.
- The novel use of HNSW for rapid indexing and retrieval in noisy, large-scale, high-dimensional column embedding spaces.
Collectively, these techniques comprise an extensible and robust methodology for scalable dataset discovery, with further applications anticipated in joinable table search and column clustering. A plausible implication is that, by decoupling representation learning from metadata dependence and scaling efficient candidate search, Starmie’s architecture can be extended to address additional table understanding and search problems in evolving data lake ecosystems.