Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 62 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 12 tok/s Pro
GPT-5 High 10 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 139 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach (2408.11635v1)

Published 21 Aug 2024 in cs.DC and cs.SE

Abstract: The rapid advancement of big data technologies has underscored the need for robust and efficient data processing solutions. Traditional Spark-based Platform-as-a-Service (PaaS) solutions, such as Databricks and Amazon Web Services Elastic MapReduce, provide powerful analytics capabilities but often result in high operational costs and vendor lock-in issues. These platforms, while user-friendly, can lead to significant inefficiencies due to their cost structures and lack of transparent pricing. This paper introduces a cost-effective and flexible orchestration framework using Dagster. Our solution aims to reduce dependency on any single PaaS provider by integrating various Spark execution environments. We demonstrate how Dagster's orchestration capabilities can enhance data processing efficiency, enforce best coding practices, and significantly reduce operational costs. In our implementation, we achieved a 12% performance improvement over EMR and a 40% cost reduction compared to DBR, translating to over 300 euros saved per pipeline run. Our goal is to provide a flexible, developer-controlled computing environment that maintains or improves performance and scalability while mitigating the risks associated with vendor lock-in. The proposed framework supports rapid prototyping and testing, which is essential for continuous development and operational efficiency, contributing to a more sustainable model of large data processing.

Summary

We haven't generated a summary for this paper yet.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube