Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficiently Processing Workflow Provenance Queries on SPARK (1808.08424v2)

Published 25 Aug 2018 in cs.DC

Abstract: In this paper, we investigate how we can leverage Spark platform for efficiently processing provenance queries on large volumes of workflow provenance data. We focus on processing provenance queries at attribute-value level which is the finest granularity available. We propose a novel weakly connected component based framework which is carefully engineered to quickly determine a minimal volume of data containing the entire lineage of the queried attribute-value. This minimal volume of data is then processed to figure out the provenance of the queried attribute-value. The proposed framework computes weakly connected components on the workflow provenance graph and further partitions the large components as a collection of weakly connected sets. The framework exploits the workflow dependency graph to effectively partition the large components into a collection of weakly connected sets. We study the effectiveness of the proposed framework through experiments on a provenance trace obtained from a real-life unstructured text curation workflow. On provenance graphs containing upto 500M nodes and edges, we show that the proposed framework answers provenance queries in real-time and easily outperforms the naive approaches.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Rajmohan C (4 papers)
  2. Pranay Lohia (9 papers)
  3. Himanshu Gupta (54 papers)
  4. Siddhartha Brahma (20 papers)
  5. Mauricio Hernandez (1 paper)
  6. Sameep Mehta (27 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.