Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Survive the Schema Changes: Integration of Unmanaged Data Using Deep Learning (2010.07586v1)

Published 15 Oct 2020 in cs.DB and cs.LG

Abstract: Data is the king in the age of AI. However data integration is often a laborious task that is hard to automate. Schema change is one significant obstacle to the automation of the end-to-end data integration process. Although there exist mechanisms such as query discovery and schema modification language to handle the problem, these approaches can only work with the assumption that the schema is maintained by a database. However, we observe diversified schema changes in heterogeneous data and open data, most of which has no schema defined. In this work, we propose to use deep learning to automatically deal with schema changes through a super cell representation and automatic injection of perturbations to the training data to make the model robust to schema changes. Our experimental results demonstrate that our proposed approach is effective for two real-world data integration scenarios: coronavirus data integration, and machine log integration.

Citations (3)

Summary

  • The paper introduces a deep learning pipeline that automates integrating unmanaged data by predicting target mappings through a novel super cell representation and adversarial perturbations.
  • It leverages sequence models and transformer architectures, achieving accuracies up to 99.8% in experiments on diverse datasets including COVID-19 and machine logs.
  • Automated training data creation and controlled perturbations significantly reduce manual effort and enhance robustness against frequent, unpredictable schema changes.

The paper "Survive the Schema Changes: Integration of Unmanaged Data Using Deep Learning" (Survive the Schema Changes: Integration of Unmanaged Data Using Deep Learning, 2020) addresses the significant challenge of data integration for unmanaged, fast-evolving data sources like publicly available CSV, JSON, HTML files, or real-time sensor data. Unlike data managed in traditional databases with defined schemas and versioning mechanisms, unmanaged data often lacks explicit schema definitions, and its format can change frequently and unpredictably. These schema changes (e.g., attribute name changes, additions/removals, type changes, key changes) break traditional hardcoded data integration scripts (like Python/Pandas code), requiring substantial manual effort from data scientists.

The authors propose a novel deep learning-based pipeline to automate this process and make it robust to schema changes without requiring human intervention during schema evolution events. The core ideas involve formulating the data integration task as a prediction problem, developing a flexible data representation, and employing adversarial training techniques to handle schema variations.

Key Concepts and Implementation Details:

  1. Super Cell Representation:
    • To represent heterogeneous and evolving data, the paper introduces the "super cell" concept. A super cell is a group of related data items (cells, i.e., attribute values) within a source object (like a row or JSON object) that are typically processed together in the integration task.
    • This representation balances the granularity between the fine-grained cell-level (which would require too many predictions) and the coarse-grained object-level (which might not be expressive enough for attribute-specific transformations).
    • A source super cell is represented as a triplet: (keyij,attributeij,valueij)(\vec{key}_{ij}, \vec{attribute}_{ij}, \vec{value}_{ij}), vectors for shared keys, attribute names, and values within the super cell.
    • The target position for a super cell is represented as a list of triples: {(keyTij,attributeTij,agg_mode)}\{(\vec{key^T}_{ij}, \vec{attribute^T}_{ij}, agg\_mode)\}, indicating where the super cell's values should be mapped in the target tabular dataset, the corresponding target attribute names, and an aggregation mode (e.g., sum, avg, replace, discard) for values mapped to the same position.
    • This formulation allows treating the integration task as a prediction problem: given a source super cell, predict its target position(s) and aggregation mode.
  2. Handling Schema Changes via Perturbation:
    • The paper identifies common schema changes in unmanaged data (domain pivoting, key expansion, attribute name/ordering change, value type/format change, attribute addition/removal).
    • These changes are viewed as injecting noise or obfuscations into the data, similar to adversarial attacks in machine learning.
    • To make the deep learning model robust to these changes, the authors propose adding specially designed perturbations to the training data.
    • For changes like attribute renaming and value format variations, perturbations are added by replacing words (attribute names, value tokens) with randomly changed words or synonyms (from sources like Google Knowledge Graph or a self-coded dictionary).
    • A character-based embedding (like FastText) is used instead of word-based embeddings. This helps the model recognize similarity between words with minor variations or out-of-vocabulary terms resulting from schema changes.
    • Key expansion (where a source tuple splits into multiple target tuples) is handled by the 'aggregation mode' label, allowing the model to predict how values from expanded keys should be combined in the target.
  3. Automated Training Data Preparation:
    • Recognizing that manual data annotation for training is a bottleneck, the paper proposes automating training data creation from an initial version of the user's data integration code (e.g., a Python script for the first version of the data).
    • This code is translated into an Intermediate Representation (IR), like Lachesis IR (Lachesis: Automatic Partitioning for UDF-Centric Analytics, 2020). By analyzing the IR (a DAG of operators like map, join, aggregate), the system can understand the data flow, identify super cells, and generate feature,label\langle feature, label \rangle pairs corresponding to the super cell mappings, which form the base training data.
    • Perturbations are then automatically injected into this base training data.
    • For unstructured data where initial parsing code is opaque, the paper discusses crowdsourcing (found challenging for identifying keys and domain knowledge) and model reuse (using LSH-based similarity to find and adapt models trained for similar tasks in a ModelHub) as potential ways to reduce human effort.
  4. Model Architectures and Training:
    • The predictive task (mapping source super cells to target positions/aggregation modes) is implemented using deep learning models.
    • The paper evaluates two types:
      • Sequence Models (Bi-LSTM): A simpler, compact model with a local character-based embedding layer, Bi-LSTM layer, and a fully-connected layer.
      • Transformer Models (GPT-2, BERT): More complex, larger pre-trained models (GPT-2 small, BERT base) used as backends, connected to a CNN + fully-connected frontend classifier. Parameters of the pre-trained models are frozen during training, only the frontend parameters are updated.
    • Training involves minimizing the loss for predicting target keys, attributes, and aggregation modes based on the source super cell features.
    • Transformer models converge faster (few epochs) but are significantly larger and require more resources per epoch/inference compared to Bi-LSTM (requires more epochs).
  5. Inference and Assembling:
    • During inference, the trained model takes a source super cell and predicts its target position(s) and aggregation mode(s).
    • A data assembler then takes these predictions and constructs the target tabular dataset by placing the super cell values in the predicted positions, applying the specified aggregation mode if multiple values map to the same target cell.
    • The assembler can work locally or dispatch results to distributed workers. Assembling latency is affected by the number of super cells and target table size.

Experimental Results and Practical Implications:

  • Experiments on COVID-19 data integration (tabular/semi-structured CSV) and machine log integration (unstructured text) demonstrate the effectiveness of the approach.
  • Accuracy: Transformer models (GPT-2, BERT) generally achieve higher accuracy (up to 99.8%) compared to Bi-LSTM (up to 96.6% for COVID-19, 91.2% for machine logs), especially for the challenging unstructured machine log data (transformers up to 99.7% vs. Bi-LSTM 82.1% for smallest granularity).
  • Performance vs. Granularity: Increasing super cell granularity (fewer super cells per target tuple) significantly reduces training time, inference time, and assembling time, but can slightly decrease accuracy, especially for transformer models. This highlights a trade-off depending on resource constraints and required accuracy.
  • Perturbations Effectiveness: Ablation studies show that perturbations, particularly using a customized dictionary for synonyms/variations (like date formats, region abbreviations), significantly improve robustness against schema changes compared to no perturbations or using general synonyms. Character-based embedding is superior to word-based embedding for handling variations.
  • Human Productivity: The deep learning pipeline can seamlessly handle schema changes that would break traditional code, eliminating system downtime and manual debugging efforts, which saves significant human time and makes the integration process predictable.
  • Comparison to Manual: While a manual Python script might be faster for integrating a single, stable data version on simple hardware, the deep learning approach is robust to schema changes and can leverage GPUs to potentially outperform manual pipelines involving complex pre-processing (like filtering large files) when dealing with frequently changing, high-volume data.

In summary, the paper provides a practical deep learning framework for handling the persistent problem of schema evolution in unmanaged data integration. By using a flexible super cell representation, injecting schema-change-inspired perturbations during training, and automating training data generation from initial code, the approach allows for uninterruptible data integration, significantly improving productivity compared to traditional manual methods, especially when dealing with diverse and fast-evolving open data sources. The choice of model architecture depends on the trade-off between required accuracy and available computational resources.