Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

Towards Next Generation Data Engineering Pipelines (2507.13892v1)

Published 18 Jul 2025 in cs.DB

Abstract: Data engineering pipelines are a widespread way to provide high-quality data for all kinds of data science applications. However, numerous challenges still remain in the composition and operation of such pipelines. Data engineering pipelines do not always deliver high-quality data. By default, they are also not reactive to changes. When new data is coming in which deviates from prior data, the pipeline could crash or output undesired results. We therefore envision three levels of next generation data engineering pipelines: optimized data pipelines, self-aware data pipelines, and self-adapting data pipelines. Pipeline optimization addresses the composition of operators and their parametrization in order to achieve the highest possible data quality. Self-aware data engineering pipelines enable a continuous monitoring of its current state, notifying data engineers on significant changes. Self-adapting data engineering pipelines are then even able to automatically react to those changes. We propose approaches to achieve each of these levels.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.