To Store or Not to Store: a graph theoretical approach for Dataset Versioning (2402.11741v1)
Abstract: In this work, we study the cost efficient data versioning problem, where the goal is to optimize the storage and reconstruction (retrieval) costs of data versions, given a graph of datasets as nodes and edges capturing edit/delta information. One central variant we study is MinSum Retrieval (MSR) where the goal is to minimize the total retrieval costs, while keeping the storage costs bounded. This problem (along with its variants) was introduced by Bhattacherjee et al. [VLDB'15]. While such problems are frequently encountered in collaborative tools (e.g., version control systems and data analysis pipelines), to the best of our knowledge, no existing research studies the theoretical aspects of these problems. We establish that the currently best-known heuristic, LMG, can perform arbitrarily badly in a simple worst case. Moreover, we show that it is hard to get $o(n)$-approximation for MSR on general graphs even if we relax the storage constraints by an $O(\log n)$ factor. Similar hardness results are shown for other variants. Meanwhile, we propose poly-time approximation schemes for tree-like graphs, motivated by the fact that the graphs arising in practice from typical edit operations are often not arbitrary. As version graphs typically have low treewidth, we further develop new algorithms for bounded treewidth graphs. Furthermore, we propose two new heuristics and evaluate them empirically. First, we extend LMG by considering more potential ``moves'', to propose a new heuristic LMG-All. LMG-All consistently outperforms LMG while having comparable run time on a wide variety of datasets, i.e., version graphs. Secondly, we apply our tree algorithms on the minimum-storage arborescence of an instance, yielding algorithms that are qualitatively better than all previous heuristics for MSR, as well as for another variant BoundedMin Retrieval (BMR).
- Git. https://github.com/git/git, 2005. last accessed: 13-Oct-22.
- Pachyderm. https://github.com/pachyderm/pachyderm, 2016. last accessed: 13-Oct-22.
- DVC. https://github.com/iterative/dvc, 2017. last accessed: 13-Oct-22.
- Dolt. https://github.com/dolthub/dolt, 2019. last accessed: 13-Oct-22.
- TerminusDB. https://github.com/terminusdb/terminusdb, 2019. last accessed: 13-Oct-22.
- LakeFS. https://github.com/treeverse/lakeFS, 2020. last accessed: 13-Oct-22.
- Taming the cloud object storage with mos. In Proceedings of the 10th Parallel Data Storage Workshop, pages 7–12, 2015.
- Mos: Workload-aware elasticity for cloud object stores. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, pages 177–188, 2016.
- Aaron Archer. Inapproximability of the asymmetric facility location and k-median problems. 2000.
- Aaron Archer. Two o(log*k)-approximation algorithms for the asymmetric k-center problem. In Karen Aardal and Bert Gerards, editors, Integer Programming and Combinatorial Optimization, pages 1–14, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
- Complexity of finding embeddings in a k-tree. SIAM Journal on Algebraic Discrete Methods, 8(2):277–284, 1987. arXiv:https://doi.org/10.1137/0608024, doi:10.1137/0608024.
- Finding all leftmost separators of size ≤kabsent𝑘\leq k≤ italic_k. In Combinatorial Optimization and Applications: 15th International Conference, COCOA 2021, Tianjin, China, December 17–19, 2021, Proceedings, page 273–287, Berlin, Heidelberg, 2021. Springer-Verlag. doi:10.1007/978-3-030-92681-623.
- On non-serial dynamic programming. Journal of Combinatorial Theory, Series A, 14(2):137–148, 1973. doi:10.1016/0097-3165(73)90016-2.
- Datahub: Collaborative data science & dataset version management at scale. In Seventh Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015. URL: http://cidrdb.org/cidr2015/Papers/CIDR15_Paper18.pdf.
- Principles of dataset versioning: Exploring the recreation/storage tradeoff. Proc. VLDB Endow., 8(12):1346–1357, 2015. URL: http://www.vldb.org/pvldb/vol8/p1346-bhattacherjee.pdf, doi:10.14778/2824032.2824035.
- Hans L. Bodlaender. A linear time algorithm for finding tree-decompositions of small treewidth. In Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, STOC ’93, page 226–234, New York, NY, USA, 1993. Association for Computing Machinery. doi:10.1145/167088.167161.
- Hans L. Bodlaender. A partial k-arboretum of graphs with bounded treewidth. Theoretical Computer Science, 209(1):1–45, 1998. doi:10.1016/S0304-3975(97)00228-4.
- Dataset Discovery in Data Lakes. 2020 IEEE 36th International Conference on Data Engineering (ICDE), 00:709–720, 2020. arXiv:2011.10427, doi:10.1109/icde48307.2020.00067.
- Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. The World Wide Web Conference, pages 1365–1375, 2019. doi:10.1145/3308558.3313685.
- DSDB: An Open-Source System for Database Versioning & Curation. 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 00:299–307, 2021. doi:10.1109/jcdl52503.2021.00044.
- Data Provenance: Some Basic Issues. Lecture Notes in Computer Science, pages 87–93, 2000. doi:10.1007/3-540-44450-5_6.
- Randal C. Burns and Darrell D. E. Long. In-place reconstruction of delta compressed files. Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing - PODC ’98, pages 267–275, 1998. doi:10.1145/277697.277747.
- DEX: Query Execution in a Delta-based Storage System. Proceedings of the 2017 ACM International Conference on Management of Data, pages 171–186, 2017. doi:10.1145/3035918.3064056.
- Cast: Tiering storage for data analytics in the cloud. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, pages 45–56, 2015.
- Network Design Problems with Bounded Distances via Shallow-Light Steiner Trees. In Ernst W. Mayr and Nicolas Ollinger, editors, 32nd International Symposium on Theoretical Aspects of Computer Science (STACS 2015), volume 30 of Leibniz International Proceedings in Informatics (LIPIcs), pages 238–248, Dagstuhl, Germany, 2015. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. URL: http://drops.dagstuhl.de/opus/volltexte/2015/4917, doi:10.4230/LIPIcs.STACS.2015.238.
- Asymmetric k-center is log* n-hard to approximate. J. ACM, 52(4):538–551, jul 2005. doi:10.1145/1082036.1082038.
- P. Crescenzi. A short guide to approximation preserving reductions. In Proceedings of Computational Complexity. Twelfth Annual IEEE Conference, pages 262–273, 1997. doi:10.1109/CCC.1997.612321.
- Materialization and reuse optimizations for production data science pipelines. SIGMOD ’22, page 1962–1976, New York, NY, USA, 2022. Association for Computing Machinery. doi:10.1145/3514221.3526186.
- Hcompress: Hierarchical data compression for multi-tiered storage environments. In 2020 IEEE IPDPS, pages 557–566. IEEE, 2020.
- Hfetch: Hierarchical data prefetching for scientific workflows in multi-tiered storage environments. In 2020 IEEE IPDPS, pages 62–72. IEEE, 2020.
- Analytical approach to parallel repetition. In Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing, STOC ’14, page 624–633, New York, NY, USA, 2014. Association for Computing Machinery. doi:10.1145/2591796.2591884.
- Online cost optimization algorithms for tiered cloud storage services. Journal of Systems and Software, 160:110457, 2020. doi:10.1016/j.jss.2019.110457.
- Uriel Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634–652, jul 1998. doi:10.1145/285055.285059.
- Improved approximation algorithms for minimum-weight vertex separators. In Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’05, page 563–572, New York, NY, USA, 2005. Association for Computing Machinery. doi:10.1145/1060590.1060674.
- Aurum: A Data Discovery System. 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1001–1012, 2018. doi:10.1109/icde.2018.00094.
- Fully polynomial-time parameterized computations for graphs and matrices of low treewidth. ACM Trans. Algorithms, 14(3), jun 2018. doi:10.1145/3186898.
- Large induced subgraphs via triangulations and cmso. SIAM Journal on Computing, 44(1):54–87, 2015. arXiv:https://doi.org/10.1137/140964801, doi:10.1137/140964801.
- Yong Gao. Treewidth of erdős–rényi random graphs, random intersection graphs, and scale-free random graphs. Discrete Applied Mathematics, 160(4-5):566–578, 2012.
- GB Gens and YV Levner. Approximate algorithms for certain universal problems in scheduling theory. Engineering Cybernetics, 16(6):31–36, 1978.
- A fast approximation algorithm for the subset-sum problem. INFOR: Information Systems and Operational Research, 32(3):143–148, 1994.
- Quasi-polynomial algorithms for submodular tree orienteering and directed network design problems. Mathematics of Operations Research, 47(2):1612–1630, 2022.
- Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical computer science, 38:293–306, 1985.
- Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2022. URL: https://www.gurobi.com.
- Tree embeddings for hop-constrained network design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, page 356–369, New York, NY, USA, 2021. Association for Computing Machinery. doi:10.1145/3406325.3451053.
- Approximating buy-at-bulk and shallow-light k-steiner trees. Algorithmica, 53(1):89–103, 2009.
- ORPHEUSDB: bolt-on versioning for relational databases (extended version). The VLDB Journal, 29(1):509–538, 2020. doi:10.1007/s00778-019-00594-5.
- Delta algorithms: an empirical analysis. ACM Transactions on Software Engineering and Methodology (TOSEM), 7(2):192–214, 1998. doi:10.1145/279310.279321.
- Fast approximation algorithms for the knapsack and sum of subset problems. J. ACM, 22(4):463–468, oct 1975. doi:10.1145/321906.321909.
- A new greedy approach for facility location problems. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC ’02, page 731–740, New York, NY, USA, 2002. Association for Computing Machinery. doi:10.1145/509907.510012.
- DFS: A Dataset File System for Data Discovering Users. 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 00:355–356, 2019. arXiv:1905.13363, doi:10.1109/jcdl.2019.00068.
- Directed tree-width. Journal of Combinatorial Theory. Series B, 82(1):138–154, May 2001. Funding Information: 1Partially supported by the NSF under Grant DMS-9701598. 2 Research partially supported by the DIMACS Center, Rutgers University, New Brunswick, NJ 08903. 3Partially supported by the NSF under Grant DMS-9401981. 4Partially supported by the ONR under Contact N00014-97-1-0512. 5Partially supported by the NSF under Grant DMS-9623031 and by the NSA under Contract MDA904-98-1-0517. doi:10.1006/jctb.2000.2031.
- Richard M Karp. The fast approximate solution of hard combinatorial problems. In Proc. 6th South-Eastern Conf. Combinatorics, Graph Theory and Computing (Florida Atlantic U. 1975), pages 15–31, 1975.
- An efficient fully polynomial approximation scheme for the subset-sum problem. Journal of Computer and System Sciences, 66(2):349–370, 2003.
- Improved approximations for buy-at-bulk and shallow-light k-steiner trees and (k,2)-subgraph. J. Comb. Optim., 31(2):669–685, feb 2016. doi:10.1007/s10878-014-9774-5.
- Balancing minimum spanning trees and shortest-path trees. Algorithmica, 14(4):305–321, 1995. doi:10.1007/BF01294129.
- Efficient Snapshot Retrieval over Historical Graph Data. arXiv, 2012. Graph database systems — stroing dynamic graphs so that a graph at a specific time can be queried. Vertices are marked with bits encoding information on which versions it belong to. arXiv:1207.5777, doi:10.48550/arxiv.1207.5777.
- Cost-performance evaluation of heterogeneous tierless storage management in a public cloud. In 2021 Ninth International Symposium on Computing and Networking (CANDAR), pages 121–126. IEEE, 2021.
- Tuukka Korhonen. A single-exponential time 2-approximation algorithm for treewidth. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 184–192, 2022. doi:10.1109/FOCS52979.2021.00026.
- Approximating shallow-light trees. In Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms, pages 103–110, 1997.
- Hermes: a heterogeneous-aware multi-tiered distributed i/o buffering system. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, pages 219–230, 2018.
- Cost modelling for optimal data placement in heterogeneous main memory. Proceedings of the VLDB Endowment, 15(11):2867–2880, 2022.
- Workload-driven placement of column-store data structures on dram and nvm. In Proceedings of the 17th International Workshop on Data Management on New Hardware (DaMoN 2021), pages 1–8, 2021.
- To transfer or not: An online cost optimization algorithm for using two-tier storage-as-a-service clouds. IEEE Access, 7:94263–94275, 2019.
- Keep hot or go cold: A randomized online migration algorithm for cost optimization in staas clouds. IEEE Transactions on Network and Service Management, 18(4):4563–4575, 2021.
- Josh MacDonald. File system support for delta compression. PhD thesis, Masters thesis. Department of Electrical Engineering and Computer Science, University of California at Berkley, 2000.
- Decibel: The Relational Dataset Branching System. Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, 9(9):624–635, 2016. doi:10.14778/2947618.2947619.
- CHEX: multiversion replay with ordered checkpoints. Proceedings of the VLDB Endowment, 15(6):1297–1310, 2022. doi:10.14778/3514061.3514075.
- Bicriteria network design problems. Journal of algorithms, 28(1):142–171, 1998.
- Towards optimizing storage costs on the cloud. IEEE 39th International Conference on Data Engineering (ICDE) (To Appear), 2023.
- William Nagel. Subversion: not just for code anymore. Linux Journal, 2006(143):10, 2006.
- Data lake management: Challenges and opportunities. Proc. VLDB Endow., 12(12):1986–1989, aug 2019. doi:10.14778/3352063.3352116.
- R. Ravi. Rapid rumor ramification: approximating the minimum broadcast time. In Proceedings 35th Annual Symposium on Foundations of Computer Science, pages 202–213, 1994. doi:10.1109/SFCS.1994.365693.
- Announcing the availability of data lineage with unity catalog. https://www.databricks.com/blog/2022/06/08/announcing-the-availability-of-data-lineage-with-unity-catalog.html, 2022. last accessed: 13-Oct-22.
- Versioning in Main-Memory Database Systems: From MusaeusDB to TardisDB. Proceedings of the 31st International Conference on Scientific and Statistical Database Management, pages 169–180, 2019. doi:10.1145/3335783.3335792.
- Efficient Versioning for Scientific Array Databases. 2012 IEEE 28th International Conference on Data Engineering, 1:1013–1024, 2012. doi:10.1109/icde.2012.102.
- A cost-driven online auto-scaling algorithm for web applications in cloud environments. Knowledge-Based Systems, 244:108523, 2022.
- A survey of data provenance techniques.
- Roberto Solis-Oba. Approximation Algorithms for the k-Median Problem, pages 292–320. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. doi:10.1007/1167154110.
- Lock-free parallel dynamic programming. Journal of Parallel and Distributed Computing, 70(8):839–848, 2010.
- Dimitre Trendafilov Nasir Memon Torsten Suel. zdelta: An efficient delta compression tool. 2002.
- Mosaic: a budget-conscious storage engine for relational database systems. Proceedings of the VLDB Endowment, 13(12):2662–2675, 2020.
- Forkbase: An efficient storage engine for blockchain and forkable applications. Proc. VLDB Endow., 11(10):1137–1150, jun 2018. doi:10.14778/3231751.3231762.
- Ddelta: A deduplication-inspired fast delta compression approach. Performance Evaluation, 79:258–272, 2014. doi:10.1016/j.peva.2014.07.016.
- Pensieve: Skewness-aware version switching for efficient graph processing. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD ’20, page 699–713, New York, NY, USA, 2020. Association for Computing Machinery. doi:10.1145/3318464.3380590.
- Storage and recreation trade-off for multi-version data management. In Yi Cai, Yoshiharu Ishikawa, and Jianliang Xu, editors, Web and Big Data - Second International Joint Conference, APWeb-WAIM 2018, Macau, China, July 23-25, 2018, Proceedings, Part II, volume 10988 of Lecture Notes in Computer Science, pages 394–409. Springer, 2018. doi:10.1007/978-3-319-96893-3_30.
- Anxin Guo (2 papers)
- Jingwei Li (56 papers)
- Pattara Sukprasert (9 papers)
- Samir Khuller (30 papers)
- Amol Deshpande (31 papers)
- Koyel Mukherjee (15 papers)