Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gen-T: Table Reclamation in Data Lakes (2403.14128v2)

Published 21 Mar 2024 in cs.DB

Abstract: We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomplete or inconsistent. To do this, we define a new measure of table similarity, called error-aware instance similarity, to measure how close a reclaimed table is to a Source Table, a measure grounded in instance similarity used in data exchange. Our search covers not only SELECT-PROJECT- JOIN queries, but integration queries with unions, outerjoins, and the unary operators subsumption and complementation that have been shown to be important in data integration and fusion. Using reclamation, a data scientist can understand if any tables in a repository can be used to exactly reclaim a tuple in the Source. If not, one can understand if this is due to differences in values or to incompleteness in the data. Our solution, Gen-T, performs table discovery to retrieve a set of candidate tables from the table repository, filters these down to a set of originating tables, then integrates these tables to reclaim the Source as closely as possible. We show that our solution, while approximate, is accurate, efficient and scalable in the size of the table repository with experiments on real data lakes containing up to 15K tables, where the average number of tuples varies from small (web tables) to extremely large (open data tables) up to 1M tuples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. J. Cheney, L. Chiticariu, and W. C. Tan, “Provenance in databases: Why, how, and where,” Found. Trends Databases, vol. 1, no. 4, pp. 379–474, 2009.
  2. J. Cheney and W. Tan, “Provenance in databases,” in Encyclopedia of Database Systems, Second Edition.   Springer, 2018.
  3. M. Koehler, E. Abel, A. Bogatu, C. Civili, L. Mazilu, N. Konstantinou, A. A. A. Fernandes, J. A. Keane, L. Libkin, and N. W. Paton, “Incorporating data context to cost-effectively automate end-to-end data wrangling,” IEEE Trans. Big Data, vol. 7, no. 1, pp. 169–186, 2021.
  4. C. Wang, A. Cheung, and R. Bodík, “Synthesizing highly expressive SQL queries from input-output examples,” in PLDI, 2017, pp. 452–466.
  5. Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding, and L. Novik, “Discovering queries based on example tuples,” in SIGMOD, 2014, pp. 493–504.
  6. D. Deutch and A. Gilad, “QPlain: Query by explanation,” in ICDE, 2016, pp. 1358–1361.
  7. R. Bavishi, C. Lemieux, R. Fox, K. Sen, and I. Stoica, “AutoPandas: neural-backed generators for program synthesis,” Proc. ACM Program. Lang., vol. 3, no. OOPSLA, pp. 168:1–168:27, 2019.
  8. J. Yang, Y. He, and S. Chaudhuri, “Auto-Pipeline: Synthesize data pipelines by-target using reinforcement learning and search,” Proc. VLDB Endow., vol. 14, no. 11, pp. 2563–2575, 2021.
  9. M. Zhang, H. Elmeleegy, C. M. Procopiuc, and D. Srivastava, “Reverse engineering complex join queries,” in SIGMOD, 2013, pp. 809–820.
  10. D. V. Kalashnikov, L. V. S. Lakshmanan, and D. Srivastava, “FastQRE: Fast query reverse engineering,” in SIGMOD, 2018, pp. 337–350.
  11. P. Orvalho, M. Terra-Neves, M. Ventura, R. Martins, and V. M. Manquinho, “SQUARES: A SQL synthesizer using query reverse engineering,” Proc. VLDB Endow., vol. 13, no. 12, pp. 2853–2856, 2020.
  12. Q. T. Tran, C. Chan, and S. Parthasarathy, “Query by output,” in SIGMOD, 2009, pp. 535–548.
  13. A. Bonifati, R. Ciucanu, A. Lemay, and S. Staworko, “A paradigm for learning queries on big data,” in Data4U@VLDB, 2014, p. 7.
  14. F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller, “Table union search on open data,” Proc. VLDB Endow., vol. 11, no. 7, pp. 813–825, 2018.
  15. M. D. Adelfio and H. Samet, “Schema extraction for tabular data on the web,” Proc. VLDB Endow., vol. 6, no. 6, pp. 421–432, 2013.
  16. M. H. Farid, A. Roatis, I. F. Ilyas, H. Hoffmann, and X. Chu, “CLAMS: bringing quality to data lakes,” in SIGMOD, 2016, pp. 2089–2092.
  17. F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, and P. C. Arocena, “Data lake management: Challenges and opportunities,” Proc. VLDB Endow., vol. 12, no. 12, pp. 1986–1989, 2019.
  18. J. Bleiholder and F. Naumann, “Data fusion,” ACM Comput. Surv., vol. 41, no. 1, pp. 1:1–1:41, 2008.
  19. B. Alexe, M. A. Hernández, L. Popa, and W. C. Tan, “MapMerge: Correlating independent schema mappings,” VLDB J., vol. 21, no. 2, pp. 191–211, 2012.
  20. L. Jiang and F. Naumann, “Holistic primary key and foreign key detection,” J. Intell. Inf. Syst., vol. 54, no. 3, pp. 439–461, 2020.
  21. L. Bornemann, T. Bleifuß, D. V. Kalashnikov, F. Naumann, and D. Srivastava, “Natural key discovery in wikipedia tables,” in WWW, 2020, pp. 2789–2795.
  22. C. A. Galindo-Legaria, “Outerjoins as disjunctions,” in SIGMOD, 1994, pp. 348–358.
  23. A. Khatiwada, R. Shraga, W. Gatterbauer, and R. J. Miller, “Integrating data lake tables,” Proc. VLDB Endow., vol. 16, pp. 932–945, 2022.
  24. M. J. Cafarella, A. Y. Halevy, and N. Khoussainova, “Data integration for the relational web,” Proc. VLDB Endow., vol. 2, no. 1, pp. 1090–1101, 2009.
  25. D. Brickley, M. Burgess, and N. F. Noy, “Google dataset search: Building a search engine for datasets in an open web ecosystem,” in WWW, 2019, pp. 1365–1375.
  26. G. Limaye, S. Sarawagi, and S. Chakrabarti, “Annotating and searching web tables using entities, types and relationships,” Proc. VLDB Endow., vol. 3, no. 1, pp. 1338–1347, 2010.
  27. R. Shraga, H. Roitman, G. Feigenblat, and M. Canim, “Ad hoc table retrieval using intrinsic and extrinsic similarities,” in WWW, 2020, pp. 2479–2485.
  28. R.  Shraga, H. Roitman, G. Feigenblat, and M. Canim, “Web table retrieval using multimodal deep learning,” in SIGIR, 2020, pp. 1399–1408.
  29. A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu, “Finding related tables,” in SIGMOD, 2012, pp. 817–828.
  30. E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller, “LSH ensemble: Internet-scale domain search,” Proc. VLDB Endow., vol. 9, no. 12, pp. 1185–1196, 2016.
  31. E. Zhu, D. Deng, F. Nargesian, and R. J. Miller, “JOSIE: overlap set similarity search for finding joinable tables in data lakes,” in SIGMOD, 2019, pp. 847–864.
  32. R. C. Fernandez, E. Mansour, A. A. Qahtan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang, “Seeping semantics: Linking datasets using word embeddings for data discovery,” in ICDE, 2018, pp. 989–1000.
  33. M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri, “InfoGather: entity augmentation and attribute discovery by holistic matching with web tables,” in SIGMOD, 2012, pp. 97–108.
  34. O. Lehmberg, D. Ritze, P. Ristoski, R. Meusel, H. Paulheim, and C. Bizer, “The mannheim search join engine,” J. Web Semant., vol. 35, pp. 159–166, 2015.
  35. M. Esmailoghli, J. Quiané-Ruiz, and Z. Abedjan, “MATE: multi-attribute table extraction,” Proc. VLDB Endow., vol. 15, no. 8, pp. 1684–1696, 2022.
  36. Y. Dong, C. Xiao, T. Nozawa, M. Enomoto, and M. Oyamada, “DeepJoin: Joinable table discovery with pre-trained language models,” Proc. VLDB Endow., vol. 16, no. 10, pp. 2458–2470, 2023.
  37. X. Ling, A. Y. Halevy, F. Wu, and C. Yu, “Synthesizing union tables from the web,” in IJCAI, 2013, pp. 2677–2683.
  38. A. Khatiwada, G. Fan, R. Shraga, Z. Chen, W. Gatterbauer, R. J. Miller, and M. Riedewald, “SANTOS: Relationship-based semantic table union search,” in SIGMOD, 2023.
  39. G. Fan, J. Wang, Y. Li, D. Zhang, and R. J. Miller, “Semantics-aware dataset discovery from data lakes with contextualized column-based representation learning,” Proc. VLDB Endow., vol. 16, no. 7, pp. 1726–1739, 2023.
  40. S. Galhotra, Y. Gong, and R. C. Fernandez, “Metam: Goal-oriented data discovery,” in ICDE, 2023, pp. 2780–2793.
  41. O. Lehmberg and C. Bizer, “Stitching web tables for improving matching quality,” Proc. VLDB Endow., vol. 10, no. 11, pp. 1502–1513, 2017.
  42. E. Rahm and P. A. Bernstein, “A survey of approaches to automatic schema matching,” VLDB J., vol. 10, no. 4, pp. 334–350, 2001.
  43. R. Shraga, A. Gal, and H. Roitman, “ADnEV: Cross-domain schema matching using deep similarity matrix adjustment and evaluation,” Proc. VLDB Endow., vol. 13, no. 9, pp. 1401–1415, 2020.
  44. C. Koutras, G. Siachamis, A. Ionescu, K. Psarakis, J. Brons, M. Fragkoulis, C. Lofi, A. Bonifati, and A. Katsifodimos, “Valentine: Evaluating matching techniques for dataset discovery,” in ICDE, 2021, pp. 468–479.
  45. H. H. Do and E. Rahm, “COMA - A system for flexible combination of schema matching approaches,” in Proc. VLDB Endow., 2002, pp. 610–621.
  46. S. Melnik, H. Garcia-Molina, and E. Rahm, “Similarity flooding: A versatile graph matching algorithm and its application to schema matching,” in ICDE, 2002, pp. 117–128.
  47. C. Chen, B. Golshan, A. Y. Halevy, W. Tan, and A. Doan, “BigGorilla: An open-source ecosystem for data preparation and integration,” IEEE Data Eng. Bull., vol. 41, no. 2, pp. 10–22, 2018.
  48. R. Cappuzzo, P. Papotti, and S. Thirumuruganathan, “Creating embeddings of heterogeneous relational datasets for data integration tasks,” in SIGMOD, 2020, pp. 1335–1349.
  49. Y. Li, J. Li, Y. Suhara, A. Doan, and W. Tan, “Deep entity matching with pre-trained language models,” Proc. VLDB Endow., vol. 14, no. 1, pp. 50–60, 2020.
  50. Y. Li, J. Li, Y. Suhara, J. Wang, W. Hirota, and W. Tan, “Deep entity matching: Challenges and opportunities,” ACM J. Data Inf. Qual., vol. 13, no. 1, pp. 1:1–1:17, 2021.
  51. V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Stefanidis, “An overview of end-to-end entity resolution for big data,” ACM Comput. Surv., vol. 53, no. 6, pp. 127:1–127:42, 2021.
  52. L. Getoor and A. Machanavajjhala, “Entity resolution: Theory, practice & open challenges,” Proc. VLDB Endow., vol. 5, no. 12, pp. 2018–2019, 2012.
  53. S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra, “Deep learning for entity matching: A design space exploration,” in SIGMOD, 2018, pp. 19–34.
  54. D. Zhang, Y. Nie, S. Wu, Y. Shen, and K. Tan, “Multi-context attention for entity matching,” in WWW, 2020, pp. 2634–2640.
  55. C. Zhao and Y. He, “Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning,” in WWW, 2019, pp. 2413–2424.
  56. S. Gurajada, L. Popa, K. Qian, and P. Sen, “Learning-based methods with human-in-the-loop for entity resolution,” in CIKM, 2019, pp. 2969–2970.
  57. M. M. Zloof, “Query by example,” in American Federation of Information Processing Societies: 1975 National Computer Conference, ser. AFIPS Conference Proceedings, vol. 44.   AFIPS Press, 1975, pp. 431–438.
  58. Y. Gong, Z. Zhu, S. Galhotra, and R. C. Fernandez, “Ver: View discovery in the wild,” in ICDE, 2023, pp. 503–516.
  59. E. K. Rezig, A. Bhandari, A. Fariha, B. Price, A. Vanterpool, V. Gadepally, and M. Stonebraker, “DICE: data discovery by example,” Proc. VLDB Endow., vol. 14, no. 12, pp. 2819–2822, 2021.
  60. R. C. Fernandez, N. Tang, M. Ouzzani, M. Stonebraker, and S. Madden, “Dataset-On-Demand: Automatic view search and presentation for data discovery,” CoRR, vol. abs/1911.11876, 2019.
  61. R. Shraga and R. J. Miller, “Explaining dataset changes for semantic data versioning with Explain-Da-V,” Proc. VLDB Endow., vol. 16, no. 6, pp. 1587–1600, 2023.
  62. M. Mahdavi and Z. Abedjan, “Baran: Effective error correction via a unified context representation and transfer learning,” Proc. VLDB Endow., vol. 13, no. 11, pp. 1948–1961, 2020.
  63. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa, “Data exchange: semantics and query answering,” Theor. Comput. Sci., vol. 336, no. 1, pp. 89–124, 2005.
  64. R. Fagin, P. G. Kolaitis, A. Nash, and L. Popa, “Towards a theory of schema-mapping optimization,” in PODS, 2008, pp. 33–42.
  65. X. Zheng, S. Dasgupta, and A. Gupta, “P2KG: Declarative construction and quality evaluation of knowledge graph from polystores,” in ADBIS, 2023, pp. 427–439.
  66. P. C. Arocena, B. Glavic, R. Ciucanu, and R. J. Miller, “The iBench integration metadata generator,” Proc. VLDB Endow., vol. 9, no. 3, pp. 108–119, 2015.
  67. L. Mazilu, N. W. Paton, A. A. A. Fernandes, and M. Koehler, “Dynamap: Schema mapping generation in the wild,” in SSDBM, 2019, pp. 37–48.
  68. P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, and D. Santoro, “Benchmarking data curation systems,” IEEE Data Eng. Bull., vol. 39, no. 2, pp. 47–62, 2016.
  69. J. Bleiholder, S. Szott, M. Herschel, and F. Naumann, “Complement union for data integration,” in ICDE, 2010, pp. 183–186.
  70. E. F. Codd, “Extending the data base relational model to capture more meaning (abstract),” in SIGMOD, 1979, p. 161.
  71. G. Fan, R. Shraga, and R. J. Miller, “Gen-T: Table reclamation on data lakes,” CoRR, vol. abs/2403.14128, 2024.
  72. Y. Zhang and Z. G. Ives, “Finding related tables in data lakes for interactive data science,” in SIGMOD, 2020, pp. 1951–1966.
  73. M. Koch, M. Esmailoghli, S. Auer, and Z. Abedjan, “Duplicate table discovery with xash,” BTW, 2023.
  74. O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer, “A large public corpus of web tables containing time and context metadata,” in WWW (Companion Volume), 2016, pp. 75–76.
  75. G. Fan, R. Shraga, and R. J. Miller, “Gen-T repository,” 2024, {https://github.com/northeastern-datalab/gen-t}, last accessed on Mar 22, 2024.
  76. Z. Jin, M. R. Anderson, M. J. Cafarella, and H. V. Jagadish, “Foofah: Transforming data by example,” in SIGMOD, 2017, pp. 683–698.
  77. B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro, and E. Veltri, “Similarity measures for incomplete database instances,” in EDBT, 2024, pp. 461–473.
  78. N. Tang, C. Yang, J. Fan, L. Cao, and A. Halevy, “VerifAI: Verified generative AI,” CoRR, vol. abs/2307.02796, 2023.
  79. R. C. Fernandez, A. J. Elmore, M. J. Franklin, S. Krishnan, and C. Tan, “How large language models will disrupt data management,” Proc. VLDB Endow., vol. 16, no. 11, pp. 3302–3309, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Grace Fan (3 papers)
  2. Roee Shraga (20 papers)
  3. Renée J. Miller (15 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com