NextiaJD: Scalable Join Discovery

Updated 17 March 2026

NextiaJD is a distributed, learning-based system for join discovery that profiles attributes and predicts join quality using supervised models.
It leverages succinct unary and binary profiles—capturing metrics like containment and cardinality proportion—to efficiently rank joinable candidate pairs.
The system achieves high precision and linear scalability by integrating Spark-based profiling with efficient candidate filtering and machine learning inference.

NextiaJD is a distributed, learning-based system for scalable join discovery across large, heterogeneous data repositories. It addresses the problem of efficiently and accurately finding joinable attribute pairs among datasets—an operation that underpins data integration, discovery, and analytics at web and enterprise scale. NextiaJD relies on succinct attribute-level profiles and introduces a join quality metric that jointly considers value containment and cardinality proportion, leveraging supervised machine learning for predictive ranking. Its architecture and algorithms are designed for linear scalability and high precision while avoiding the prohibitive costs of traditional value-indexing or sketching techniques (Nadal et al., 2023, Flores et al., 2020).

1. Problem Domain and Limitations of Prior Work

Join discovery at scale requires identifying pairs of attributes from independent datasets that can be meaningfully joined on their values. Traditional solutions based on value-level inverted indices, exact computation of containment or Jaccard similarity, and hash-based sketches (including MinHash, LSH Ensemble, and related methods) suffer from several intractabilities when faced with thousands of tables, billions of rows, and high heterogeneity:

Value-level indexing demands storing and querying potentially millions of distinct values per attribute, leading to excessive memory and I/O requirements.
Hash-based or sketch-based approaches still require enumerating every value to generate signatures, suffer high false-positive rates due to hash collisions, and face scalability bottlenecks in updating and comparing large sets of sketches.
Both classes of methods remain fundamentally limited in precision, especially as scale and heterogeneity increase.

NextiaJD overcomes these barriers by operating entirely at the level of profiled summaries, minimizing the need for value-level interaction and enabling efficient, parallelized discovery (Nadal et al., 2023, Flores et al., 2020).

2. Profile Representations

NextiaJD utilizes two primary classes of profiles: unary (per-attribute) and binary (per-attribute-pair) summaries, which collectively encapsulate information required for accurate joinability assessment.

Unary profile $P_u(A)$ for attribute $A$ comprises meta-features, which include:

Cardinality statistics: number of distinct values $|A|$ , uniqueness ( $|A|/\text{row\_count}(A)$ ), incompleteness (fraction null), and empirical entropy $-\sum_x p_x\log p_x$ .
Value-distribution statistics: distributional moments (mean, min, max, std) of value counts, octiles, constancy ( $\text{freq}_\text{max}/\text{row\_count}$ ), top frequent values, and top 10 Soundex codes.
Syntactic and type patterns: detected datatype (numeric, alphabetic, datetime, etc.), specific type via regex (email, URL, IP, phone, username, phrase), string/word length statistics (min, max, mean, std).

Binary profile $P_b(A,B)$ for attribute pair $(A,B)$ includes:

Normalized Levenshtein distance between attribute names.
Containment upper bound: $\min(|A|,|B|)/|A|$ .
Cardinality proportion: $\min(|A|,|B|)/\max(|A|,|B|)$ .

All profiles are compact, dense vectors (typically 50–80 real-valued entries per attribute), yielding per-dataset profile footprints of a few hundred KB even for hundreds of attributes (Nadal et al., 2023, Flores et al., 2020).

Profile Type	Meta-features	Typical Size
Unary, $P_u(A)$	Cardinality, distribution, syntax/type, length	50–80 reals
Binary, $P_b(A,B)$	Name distance, containment, cardinality ratio	2–3 reals

Profiling is implemented as SparkSQL operators, allowing extraction to be embarrassingly parallel across attributes and datasets, with near-linear scaling in the number of Spark workers. Profiles are persisted as Parquet or JSON side files.

3. Join-Quality Metrics

NextiaJD introduces a join-quality metric sensitive to both containment and attribute cardinality proportion. This is in contrast to pure value-overlap metrics, which fail to penalize spurious joins between attributes of widely divergent sizes.

Let $A,B$ be attributes with distinct value sets.

Containment: $C(A,B) = \frac{|A\cap B|}{|A|}$
Cardinality proportion: $K(A,B) = \frac{\min(|A|,|B|)}{\max(|A|,|B|)}$

This yields both a discrete, multi-level metric and a continuous, tunable metric:

Discrete: For $L$ levels,

$Q_L(A,B) = \max\left\{i \in\{0,\ldots,L\}: C(A,B)\geq 1-\frac{i}{L} \text{ and } K(A,B)\geq \frac{1}{2^i}\right\}$

For $L=2$ , this equates to binary classification; for larger $L$ , more granular join-quality levels.

Continuous: Fit a truncated bivariate normal CDF to the empirical $(C,K)$ of labeled data.

$Q(A,B,s) = \Phi_\text{trunc}\left(\frac{C(A,B)-\mu_C-s}{\sigma_C}\right) \cdot \Phi_\text{trunc}\left(\frac{K(A,B)-\mu_K}{\sigma_K}\right)$

where $s$ is a strictness hyperparameter, and $\mu_\cdot, \sigma_\cdot$ are estimated from ground truth (Nadal et al., 2023, Flores et al., 2020).

This approach simultaneously rewards high value-overlap and penalizes granular mismatches, resulting in improved ranking of truly joinable candidate pairs.

4. Learning-Based Prediction Pipeline

NextiaJD employs a learning-based pipeline for join prediction and ranking:

Feature Vector Construction: For each $(A,B)$ , compute the vector of Z-score normalized absolute differences between unary meta-features of $A$ and $B$ , concatenated with their binary profile features.
Regression Model: Feed the constructed feature vector into a trained regressor, either a multilayer perceptron (MLP, with one 100-unit ReLU hidden layer and L2 regularization $\alpha = 10^{-4}$ ) or, in prior iterations, a chain of one-vs-rest Random Forest classifiers. These models predict $\hat{Q}(A,B)$ , the estimated join quality (Nadal et al., 2023, Flores et al., 2020).
Prediction and Ranking: Output joinability predictions for all candidate pairs, ranking by decreasing $\hat{Q}$ .

The regression model is trained on labeled pairs, with all features globally normalized. On held-out data, the MLP achieves $R^2 \approx 0.88$ . For classification, five-level accuracy is “high” (precision $\approx 98$ \% for the “high” class, recall $\approx 51$ \%, $F_1 \approx 67$ \%); remapped to binary, precision is $\approx 88$ \%, recall $\approx 85$ \% ( $F_1 \approx 86$ \%) (Flores et al., 2020). A plausible implication is the system’s robustness to dataset diversity and noise.

5. System Architecture and Scalability

NextiaJD comprises three main layers:

Offline Profiler: Spark-based, parallel extraction of unary profiles, outputting sidecar files per column.
Online Index and Candidate Selector: Maintains in-memory indices of attribute names and coarse profile statistics. Candidate filtering leverages datatype, name similarity, and simple heuristics to limit downstream scoring calls.
Predictor and Ranker: Loads required profiles on demand, computes feature vectors for candidates, invokes the trained model, and returns top- $K$ join candidates (Nadal et al., 2023).

Scalability optimizations include:

Linear scaling of profiling throughput with workers and data size.
Parquet input yields 4–8× speedups over CSV due to efficient column statistics.
Candidate filtering reduces regression cost from $O(N)$ to $O(N')$ per query attribute, with $N' \ll N$ .
Prediction throughput is $\sim$ 1M attribute-pair evaluations/second/core.

6. Empirical Evaluation

NextiaJD has been benchmarked on several settings:

GitTables: 1M tables ( $\leq 200$ M rows); MSE = 0.04, MAE = 0.13, Precision@Q $>$ 0.8 ≈ 0.75, Recall ≈ 0.87.
Valentine Suite: 15 real-world dataset pairs; NextiaJD’s recall at size outperforms all seven schema-matching baselines (instance-only and hybrid), often by 10–20 percentage points.
Custom Testbeds: XS–L (0–1MB to $>$ 1GB per table); for $K=5_\text{top}$ , Recall@GT $\approx 0.80$ –$0.90$ (vs. $0.50$–$0.70$ for Aurum/D3L/WarpGate baselines).

Baseline	Precision (%)	Recall (%)	$F_1$ (%)
LSH Ensemble	53	95	68
FlexMatcher	1.3	47	–
NextiaJD	88	85	86

On the largest testbeds, profiling throughput is linear up to 60GB; average profile sizes range from 132KB to 398KB per column depending on dataset size. Candidate scoring rates remain high due to lightweight inference (Nadal et al., 2023, Flores et al., 2020).

7. Complexity and Design Trade-offs

Profile extraction requires a single full scan per column, with $O(\mathrm{rows} + \mathrm{distinct})$ time per column, fully parallelizable. Pairwise comparison is $O(d)$ per candidate pair ( $d \sim 100$ ), also trivially parallel. Model inference is negligible compared to I/O and dominated by a few hundred tree or neural-layer operations.

Trade-offs include:

Profiles compress raw data to $\sim$ 100KB/column but lose exact join statistics; model-based estimation compensates for this loss, providing accurate joinability scores.
Strictness hyperparameter $s$ tunes the precision–recall balance (higher $s$ implies higher precision, lower recall).
NextiaJD sacrifices a minor recall drop for a significant precision gain, notably fewer false positives compared to sketching and embedding alternatives (Nadal et al., 2023, Flores et al., 2020).

A plausible implication is that future profile-driven methods could further reduce screening costs via profile-embedding learning or hardware acceleration, though empirical validation at greater scales remains an open area.

References:

"Measuring and Predicting the Quality of a Join for Data Discovery" (Nadal et al., 2023)
"Scalable Data Discovery Using Profiles" (Flores et al., 2020)

Markdown Report Issue Upgrade to Chat

References (2)

Measuring and Predicting the Quality of a Join for Data Discovery (2023)

Scalable Data Discovery Using Profiles (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NextiaJD.