Semantic Versioning in Databases Benchmark
- Semantic Versioning in Databases is a systematic approach to identify dataset versions through semantics-preserving transformations.
- The SDVB benchmark standardizes evaluation across datasets with metrics like validity, separation score, and generalizability.
- Methods such as SAVeD and Explain-Da-V demonstrate enhanced accuracy in both version discovery and programmatic transformation explanation.
Semantic versioning in databases describes the systematic identification, detection, and explanation of when two structured datasets are distinct “versions” of the same underlying content—differing by semantics-preserving transformations rather than in their real-world subject matter. The field addresses persistent challenges in data management and analysis, where analysts and systems must distinguish between truly new datasets and derivations produced by processes like column reordering, encoding changes, or row subsampling. To systematically benchmark solutions for semantic version discovery and explanation, recent work has introduced standardized evaluation datasets, metrics, and tasks focused on semantic content equivalence and the interpretability of data transformations (Frenk et al., 21 Nov 2025, Shraga et al., 2023).
1. Formal Problem Definition
A structured dataset is modeled as a flat relational table , where is a set of attributes and is a set of tuples. A new version may result from semantics-preserving transformations applied to . Attribute and tuple correspondences can be formalized via relations and identically-valued primary keys. Semantic versioning involves not simply recording these changes, but explaining them: identifying programmatic transformations that map subsets of (“origins”) to deltas in (“goals”). More formally, given unmatched attribute or tuple sets, an explanation precisely describes how maps the relation of to that of a goal datum (Shraga et al., 2023).
A positive version pair exists if for some composition of allowed semantics-preserving transformations . Non-version pairs include cross-dataset pairs and pairs below a semantic-similarity threshold, even if from the same seed table (Frenk et al., 21 Nov 2025).
2. The Semantic Versioning in Databases Benchmark (SDVB)
The Semantic Versioning in Databases Benchmark (“SDVB”; Editor’s term) is a rigorously curated set of evaluation suites designed to test both version-discovery and transformation-explanation algorithms. The SDVB comprises five canonical datasets—IMDB, IRIS, NBA, TITANIC, and WINE_small—each provided with 20–50 artificially derived versions using preprocessing steps like dummy encoding, missing value injection, and column renaming under documented version lineages (Frenk et al., 21 Nov 2025). For each possible pair of tables, SDVB specifies whether the pair constitutes a semantic version or not, based on the ability to derive one via allowed internal transformations.
Variants include:
| Domain | Rows (||) | Attrs (||) | #Versions | #Version-pairs | |-----------|-----------|----------|-----------|-----------------| | IMDB | 1,000 | 6 | 72 | 29 | | NBA | 11,700 | 9 | 68 | 27 | | WINE | 129,971 | 6 | 72 | 29 | | IRIS | 150 | 5 | 58 | 22 | | TITANIC | 891 | 6 | 72 | 29 |
Each version set includes both vertical (attribute) and horizontal (tuple) changes, simulating real-world data wrangling pipelines (Shraga et al., 2023). The benchmark also facilitates hold-out evaluation by applying the same modifications to a reserved subset of each table to test generalizability.
3. Methods for Semantic Version Discovery
3.1 Fully Unsupervised Embedding-Based Discovery (SAVeD)
SAVeD (Semantically Aware Version Detection) employs self-supervised contrastive learning to learn semantic equivalence between table versions without the need for metadata or explicit labels (Frenk et al., 21 Nov 2025). The core architecture extends SimCLR’s contrastive paradigm to tabular data:
- Table-specific augmentations: Eight operations mimic data-science edits—e.g., random column dropout, various categorical encodings, missing-value injection, column/row shuffling, row dropping, and Gaussian jitter.
- Linearization and tokenization: Each augmented table view is serialized as a sequence, tokenized via byte-pair encoding (BPE) with a vocabulary of 12,000.
- Transformer encoder: The tokenized view is embedded using a transformer architecture with layers, attention heads, mean-pooling, and a projection head to a final embedding space.
- Contrastive loss: The NT-Xent loss minimizes distance between augmentations of the same table and maximizes it between unrelated tables.
This pipeline enables the separation of versioned pairs in the learned latent space, as measured by either threshold-based classification or the margin between means of versioned and non-versioned similarities.
3.2 Programmatic Transformation Explanation (Explain-Da-V)
Explain-Da-V discovers, for each table pair , explicit transformation programs or models mapping unmatched portions of to (Shraga et al., 2023). The method decomposes the explanation problem by data type (numeric, categorical, textual), using subroutines such as ridge/lasso regression, decision trees, or program synthesis (e.g., via Foofah+) to construct candidate explanations. The best candidate is selected by maximizing validity—fraction of exactly recovered values—and, if tied, explainability (a function of model conciseness and “concentration” over conceptual chunks).
This approach provides not only version-detection but also human-interpretable provenance for changes, supporting explainability and facilitate transformations audits.
4. Benchmark Evaluation Metrics
Evaluation in SDVB is standardized along several axes:
- Validation Accuracy (TPR): The fraction of correctly identified version/non-version pairs on a hold-out set, with respect to a cosine similarity threshold in embedding space.
- Separation Score (): Defined as , where and are the average similarities for versioned and non-versioned table pairs, respectively.
- Validity: For transformation explanations, the exact match rate between the predicted and goal attribute/tuple values.
- Generalizability: The validity of the learned transformation when applied to a held-out subset of the data, modified using the same procedure as in training.
- Explainability: A weighted combination of conciseness (inverse number of model components) and concentration (inverse number of conceptual “chunks” used by the transformation).
A summary of performance of leading methods on SDVB is as follows (Frenk et al., 21 Nov 2025, Shraga et al., 2023):
| Method | Validity/Gen. (IRIS) | Separation (IRIS) | #Params |
|---|---|---|---|
| Starmie | 17.1 / — | 0.0030 | 124M |
| Discover-GPT (LLaMA2) | 94.4 / — | 0.4871 | ∼7B |
| SAVeD | 90.6 / — | 0.8010 | 14M |
| Explain-Da-V | 0.93 / 0.83 | — | — |
| Foofah (baseline) | 0.23 / 0.23 | — | — |
SAVeD achieves higher accuracy and separation yet uses two orders of magnitude fewer parameters than LLM pipelines. Explain-Da-V demonstrates absolute gains of 140–250% in validity and 110–220% in generalizability over prior baselines, and achieves per-attribute explanation times of 2.4 seconds, scalable to hundreds of attributes and tens of thousands of rows (Shraga et al., 2023).
5. Principal Experimental Findings and Analysis
SAVeD significantly improves over existing version-discovery embeddings—with over 70 percentage point accuracy gains relative to “Starmie” baselines, and better or comparable accuracy to 7B-parameter LLMs at a fraction of the computational cost. Separation margins as high as 0.8 indicate that semantically-equivalent datasets are well-clustered in the learned latent space (Frenk et al., 21 Nov 2025).
Explain-Da-V outperforms prior program-synthesis and pipeline-explanation methods (unmodified Foofah, Foofah+, Auto-pipeline*) in both coverage and validity, particularly for numeric, categorical, and textual transformation goals. Ablation studies confirm the importance of feature-engineering and origin-detection: omitting feature-extensions or functional-dependency pruning sharply degrades performance (Shraga et al., 2023).
6. Limitations and Future Directions
Current benchmarks and methods have several explicit limitations:
- SAVeD is restricted to single-table transformations and does not cover cross-table joins or external enrichments.
- Both SDVB and Explain-Da-V benchmarks do not natively support pivot/unpivot operations, more complex reshaping, or synthesis over combined datasets.
- Extremely large or long tables may exceed model capacity (e.g., >1,000 token sequences in SAVeD).
- Some classes of explanations (e.g., “count commas + 1” for genre-count attributes) remain out of scope for Explain-Da-V.
Future research may incorporate hierarchical or graph-based models for multi-table lineage, new augmentations simulating table joins, and hybrid strategies that combine unsupervised version discovery with explicit version-control metadata. A plausible implication is that zero-shot clustering of dataset versions may require dynamic similarity thresholds, and that explainability-guided discovery remains an important but open research axis (Frenk et al., 21 Nov 2025, Shraga et al., 2023).
7. Significance and Impact
Semantic versioning in databases, operationalized through robust benchmarks like SDVB and methods such as SAVeD and Explain-Da-V, addresses the longstanding challenge of tracking and interpreting reproducibility and provenance in modern data analytics. By anchoring the semantics of “version” in formalized transformations and rigorous discriminative tasks, these efforts provide both a principled evaluation standard and high-utility tools for data lake management, collaborative analysis, and automated discovery. The empirical advances suggest that purely content-based, model-driven approaches can match or exceed the reliability of metadata-centric or rule-based pipelines, representing a substantial evolution in the management of dataset version histories (Frenk et al., 21 Nov 2025, Shraga et al., 2023).