NextiaJD: Scalable Join Discovery
- NextiaJD is a distributed, learning-based system for join discovery that profiles attributes and predicts join quality using supervised models.
- It leverages succinct unary and binary profiles—capturing metrics like containment and cardinality proportion—to efficiently rank joinable candidate pairs.
- The system achieves high precision and linear scalability by integrating Spark-based profiling with efficient candidate filtering and machine learning inference.
NextiaJD is a distributed, learning-based system for scalable join discovery across large, heterogeneous data repositories. It addresses the problem of efficiently and accurately finding joinable attribute pairs among datasets—an operation that underpins data integration, discovery, and analytics at web and enterprise scale. NextiaJD relies on succinct attribute-level profiles and introduces a join quality metric that jointly considers value containment and cardinality proportion, leveraging supervised machine learning for predictive ranking. Its architecture and algorithms are designed for linear scalability and high precision while avoiding the prohibitive costs of traditional value-indexing or sketching techniques (Nadal et al., 2023, Flores et al., 2020).
1. Problem Domain and Limitations of Prior Work
Join discovery at scale requires identifying pairs of attributes from independent datasets that can be meaningfully joined on their values. Traditional solutions based on value-level inverted indices, exact computation of containment or Jaccard similarity, and hash-based sketches (including MinHash, LSH Ensemble, and related methods) suffer from several intractabilities when faced with thousands of tables, billions of rows, and high heterogeneity:
- Value-level indexing demands storing and querying potentially millions of distinct values per attribute, leading to excessive memory and I/O requirements.
- Hash-based or sketch-based approaches still require enumerating every value to generate signatures, suffer high false-positive rates due to hash collisions, and face scalability bottlenecks in updating and comparing large sets of sketches.
- Both classes of methods remain fundamentally limited in precision, especially as scale and heterogeneity increase.
NextiaJD overcomes these barriers by operating entirely at the level of profiled summaries, minimizing the need for value-level interaction and enabling efficient, parallelized discovery (Nadal et al., 2023, Flores et al., 2020).
2. Profile Representations
NextiaJD utilizes two primary classes of profiles: unary (per-attribute) and binary (per-attribute-pair) summaries, which collectively encapsulate information required for accurate joinability assessment.
Unary profile for attribute comprises meta-features, which include:
- Cardinality statistics: number of distinct values , uniqueness (), incompleteness (fraction null), and empirical entropy .
- Value-distribution statistics: distributional moments (mean, min, max, std) of value counts, octiles, constancy (), top frequent values, and top 10 Soundex codes.
- Syntactic and type patterns: detected datatype (numeric, alphabetic, datetime, etc.), specific type via regex (email, URL, IP, phone, username, phrase), string/word length statistics (min, max, mean, std).
Binary profile for attribute pair includes:
- Normalized Levenshtein distance between attribute names.
- Containment upper bound: .
- Cardinality proportion: .
All profiles are compact, dense vectors (typically 50–80 real-valued entries per attribute), yielding per-dataset profile footprints of a few hundred KB even for hundreds of attributes (Nadal et al., 2023, Flores et al., 2020).
| Profile Type | Meta-features | Typical Size |
|---|---|---|
| Unary, | Cardinality, distribution, syntax/type, length | 50–80 reals |
| Binary, | Name distance, containment, cardinality ratio | 2–3 reals |
Profiling is implemented as SparkSQL operators, allowing extraction to be embarrassingly parallel across attributes and datasets, with near-linear scaling in the number of Spark workers. Profiles are persisted as Parquet or JSON side files.
3. Join-Quality Metrics
NextiaJD introduces a join-quality metric sensitive to both containment and attribute cardinality proportion. This is in contrast to pure value-overlap metrics, which fail to penalize spurious joins between attributes of widely divergent sizes.
Let be attributes with distinct value sets.
- Containment:
- Cardinality proportion:
This yields both a discrete, multi-level metric and a continuous, tunable metric:
- Discrete: For levels,
For , this equates to binary classification; for larger , more granular join-quality levels.
- Continuous: Fit a truncated bivariate normal CDF to the empirical of labeled data.
where is a strictness hyperparameter, and are estimated from ground truth (Nadal et al., 2023, Flores et al., 2020).
This approach simultaneously rewards high value-overlap and penalizes granular mismatches, resulting in improved ranking of truly joinable candidate pairs.
4. Learning-Based Prediction Pipeline
NextiaJD employs a learning-based pipeline for join prediction and ranking:
- Feature Vector Construction: For each , compute the vector of Z-score normalized absolute differences between unary meta-features of and , concatenated with their binary profile features.
- Regression Model: Feed the constructed feature vector into a trained regressor, either a multilayer perceptron (MLP, with one 100-unit ReLU hidden layer and L2 regularization ) or, in prior iterations, a chain of one-vs-rest Random Forest classifiers. These models predict , the estimated join quality (Nadal et al., 2023, Flores et al., 2020).
- Prediction and Ranking: Output joinability predictions for all candidate pairs, ranking by decreasing .
The regression model is trained on labeled pairs, with all features globally normalized. On held-out data, the MLP achieves . For classification, five-level accuracy is “high” (precision \% for the “high” class, recall \%, \%); remapped to binary, precision is \%, recall \% (\%) (Flores et al., 2020). A plausible implication is the system’s robustness to dataset diversity and noise.
5. System Architecture and Scalability
NextiaJD comprises three main layers:
- Offline Profiler: Spark-based, parallel extraction of unary profiles, outputting sidecar files per column.
- Online Index and Candidate Selector: Maintains in-memory indices of attribute names and coarse profile statistics. Candidate filtering leverages datatype, name similarity, and simple heuristics to limit downstream scoring calls.
- Predictor and Ranker: Loads required profiles on demand, computes feature vectors for candidates, invokes the trained model, and returns top- join candidates (Nadal et al., 2023).
Scalability optimizations include:
- Linear scaling of profiling throughput with workers and data size.
- Parquet input yields 4–8× speedups over CSV due to efficient column statistics.
- Candidate filtering reduces regression cost from to per query attribute, with .
- Prediction throughput is 1M attribute-pair evaluations/second/core.
6. Empirical Evaluation
NextiaJD has been benchmarked on several settings:
- GitTables: 1M tables (M rows); MSE = 0.04, MAE = 0.13, Precision@Q0.8 ≈ 0.75, Recall ≈ 0.87.
- Valentine Suite: 15 real-world dataset pairs; NextiaJD’s recall at size outperforms all seven schema-matching baselines (instance-only and hybrid), often by 10–20 percentage points.
- Custom Testbeds: XS–L (0–1MB to 1GB per table); for , Recall@GT –$0.90$ (vs. $0.50$–$0.70$ for Aurum/D3L/WarpGate baselines).
| Baseline | Precision (%) | Recall (%) | (%) |
|---|---|---|---|
| LSH Ensemble | 53 | 95 | 68 |
| FlexMatcher | 1.3 | 47 | – |
| NextiaJD | 88 | 85 | 86 |
On the largest testbeds, profiling throughput is linear up to 60GB; average profile sizes range from 132KB to 398KB per column depending on dataset size. Candidate scoring rates remain high due to lightweight inference (Nadal et al., 2023, Flores et al., 2020).
7. Complexity and Design Trade-offs
Profile extraction requires a single full scan per column, with time per column, fully parallelizable. Pairwise comparison is per candidate pair (), also trivially parallel. Model inference is negligible compared to I/O and dominated by a few hundred tree or neural-layer operations.
Trade-offs include:
- Profiles compress raw data to 100KB/column but lose exact join statistics; model-based estimation compensates for this loss, providing accurate joinability scores.
- Strictness hyperparameter tunes the precision–recall balance (higher implies higher precision, lower recall).
- NextiaJD sacrifices a minor recall drop for a significant precision gain, notably fewer false positives compared to sketching and embedding alternatives (Nadal et al., 2023, Flores et al., 2020).
A plausible implication is that future profile-driven methods could further reduce screening costs via profile-embedding learning or hardware acceleration, though empirical validation at greater scales remains an open area.
References:
- "Measuring and Predicting the Quality of a Join for Data Discovery" (Nadal et al., 2023)
- "Scalable Data Discovery Using Profiles" (Flores et al., 2020)