Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What if an SQL Statement Returned a Database? (2312.00638v1)

Published 1 Dec 2023 in cs.DB

Abstract: Every SQL statement is limited to return a single, possibly denormalized, table. This design decision has far reaching consequences. (1.) for databases users in terms of slow query performance, long query result transfer times, usability-issues of SQL in web applications and object-relational mappers. In addition, (2.) for database architects it has consequences when designing query optimizers leading to logical (algebraic) join enumeration effort, memory consumption for intermediate result materialization, and physical operator selection effort. So basically, the entire query optimization stack is shaped by that design decision. In this paper, we argue that the single-table limitation should be dropped. We extend the SELECT-clause of SQL by a keyword 'RESULTDB' to support returning a result database. Our approach has clear semantics, i.e. our extended SQL returns subsets of all tables with only those tuples that would be part of the traditional (single-table) query result set, however without performing any denormalization through joins. Our SQL-extension is downward compatible. Moreover, we discuss the surprisingly long list of benefits of our approach. First, for database users: far simpler and more readable application code, better query performance, smaller query results, better query result transfer times. Second, for database architects, we present how to leverage existing closed source systems as well as change open source database systems to support our feature. We propose a couple of algorithms to integrate our feature into both closed-source as well as open source database systems. We present an initial experimental study with promising results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. 2014. Sql Flaws. https://wiki.c2.com/?SqlFlaws. Accessed: 2023-11-28.
  2. AsterixDB: A Scalable, Open Source BDMS. Proc. VLDB Endow. 7, 14 (2014), 1905–1916. https://doi.org/10.14778/2733085.2733096
  3. Philip A. Bernstein and Dah-Ming W. Chiu. 1981. Using Semi-Joins to Solve Relational Queries. J. ACM 28, 1 (1981), 25–40. https://doi.org/10.1145/322234.322238
  4. The Skyline Operator. In Proceedings of the 17th International Conference on Data Engineering, April 2-6, 2001, Heidelberg, Germany, Dimitrios Georgakopoulos and Alexander Buchmann (Eds.). IEEE Computer Society, 421–430. https://doi.org/10.1109/ICDE.2001.914855
  5. E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13, 6 (1970), 377–387. https://doi.org/10.1145/362384.362685
  6. C. J. Date. 1984. A Critique of the SQL Database Language. SIGMOD Rec. 14, 3 (1984), 8–54. https://doi.org/10.1145/984549.984551
  7. Bridging the Gap between OLAP and SQL. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September 2, 2005, Klemens Böhm, Christian S. Jensen, Laura M. Haas, Martin L. Kersten, Per-Åke Larson, and Beng Chin Ooi (Eds.). ACM, 1031–1042. http://www.vldb.org/archives/website/2005/program/paper/tue/p1031-dittrich.pdf
  8. John Grant. 2008. Null values in SQL. SIGMOD Rec. 37, 3 (2008), 23–25. https://doi.org/10.1145/1462571.1462575
  9. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total. In Proceedings of the Twelfth International Conference on Data Engineering, February 26 - March 1, 1996, New Orleans, Louisiana, USA, Stanley Y. W. Su (Ed.). IEEE Computer Society, 152–159. https://doi.org/10.1109/ICDE.1996.492099
  10. The Vertica Analytic Database: C-Store 7 Years Later. Proc. VLDB Endow. 5, 12 (2012), 1790–1801. https://doi.org/10.14778/2367502.2367518
  11. Thomas Neumann and Gerhard Weikum. 2009. Scalable join processing on very large RDF graphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, Ugur Çetintemel, Stanley B. Zdonik, Donald Kossmann, and Nesime Tatbul (Eds.). ACM, 627–640. https://doi.org/10.1145/1559845.1559911
  12. Dan Olteanu and Maximilian Schleich. 2016. Factorized Databases. SIGMOD Rec. 45, 2 (2016), 5–16. https://doi.org/10.1145/3003665.3003667
  13. The SQL++ Semi-structured Data Model and Query Language: A Capabilities Survey of SQL-on-Hadoop, NoSQL and NewSQL Databases. CoRR abs/1405.3631 (2014). arXiv:1405.3631 http://arxiv.org/abs/1405.3631
  14. Elvis Pranskevichus. 2019. We Can Do Better Than SQL. https://www.edgedb.com/blog/we-can-do-better-than-sql. Accessed: 2023-11-28.
  15. Materialization strategies in the Vertica analytic database: Lessons learned. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, Christian S. Jensen, Christopher M. Jermaine, and Xiaofang Zhou (Eds.). IEEE Computer Society, 1196–1207. https://doi.org/10.1109/ICDE.2013.6544909
  16. Errors and Complications in SQL Query Formulation. ACM Trans. Comput. Educ. 18, 3 (2018), 15:1–15:29. https://doi.org/10.1145/3231712
  17. Mihalis Yannakakis. 1981. Algorithms for Acyclic Database Schemes. In Very Large Data Bases, 7th International Conference, September 9-11, 1981, Cannes, France, Proceedings. IEEE Computer Society, 82–94.
  18. Looking Ahead Makes Query Plans Robust. Proc. VLDB Endow. 10, 8 (2017), 889–900. https://doi.org/10.14778/3090163.3090167

Summary

  • The paper introduces the RESULTDB extension that enables SQL to return a subdatabase and preserve relational context instead of a denormalized table.
  • It evaluates various methods including dynamic and materialized SELECT DISTINCT queries alongside Yannakakis-based algorithms for acyclic join optimization.
  • The approach reduces data redundancy and transmission costs while simplifying application development by maintaining key relational information.

Analyzing the Constraints and Enhancement of SQL: Introduction of RESULTDB

The paper "What if an SQL Statement Returned a Database?" presents a compelling discourse on the limitations imposed by SQL's design choice of returning query results as single tables. SQL traditionally returns the result of a query as a single, possibly denormalized table, which has wide-ranging implications for both database users and architects.

Problems with Single Table Returns

The restriction to single-table query results in SQL imposes several technical challenges:

  1. Redundant Data: The result tables often contain redundant data, particularly when joins are involved. This redundancy arises because the current SQL architecture denormalizes the results by repeating values, leading to inefficiencies in data storage and transmission.
  2. Information Loss: Key and relational information is often lost when denormalizing data during joins. This makes it difficult to maintain the original context and relationships of the data in the output.
  3. Data Transmission Costs: Larger result tables with redundant data inflate the cost of data transmission from the server to the client, posing a significant performance bottleneck, especially for large datasets.
  4. Memory Consumption: The materialization of large intermediate results due to join operations in memory can lead to high memory usage and inefficiencies during query processing.
  5. Complexity in Application Code: Developers often have to write additional code to transform these flat, denormalized results back into a form that their applications can use effectively. This additional complexity increases the potential for errors and inefficiency.

Proposed Solution: RESULTDB

In response to these challenges, the paper proposes an extension to SQL that adds a RESULTDB keyword to the SELECT clause. The SELECT RESULTDB syntax allows a query to return a subdatabase instead of a single table. This subdatabase includes the necessary subsets of the original tables without denormalizing the data through joins.

Benefits of the RESULTDB Approach

The RESULTDB extension offers several advantages:

  • Simplified Application Development: By receiving results as a subdatabase, application developers can more easily manipulate the data without writing complex transformation logic.
  • Improved Query Performance: Avoiding the creation of denormalized join results can lead to smaller intermediate tables, reducing memory consumption and potentially enhancing query performance.
  • Reduced Data Redundancy and Transmission Costs: As the subdatabase contains only the necessary subsets of tables, data redundancy is minimized, and the cost of transmitting the query results is reduced.
  • Preservation of Relational Information: Returning a subdatabase preserves key and relational information, maintaining the integrity and relationships inherent in the original schema.

Implementation Methods

The paper discusses several methods for integrating the RESULTDB functionality into existing SQL systems, categorized into SQL-based rewrites and direct integration into DBMS query optimizers:

  1. Dynamic SELECT DISTINCT: This method transforms the original query into multiple SELECT DISTINCT queries, each reducing the dataset to the tuples that participate in the join. While this approach is straightforward, it risks repeatedly computing expensive joins.
  2. Materialized SELECT DISTINCT: Here, a materialized view (MV) of the query result is created and used to issue distinct queries against the MV, potentially reducing repeated computation at the cost of materialization overhead.
  3. Dynamic Subquery: Subqueries are used to enforce semi-join semantics dynamically, guiding the optimizer to use semi-joins instead of full inner joins, though this approach depends heavily on the optimizer's capabilities.
  4. Materialized Subquery: This method prematerializes parts of the join in separate materialized views, optimizing for semi-join strategies to balance computation and storage costs.

Algorithmic Approach

For more efficient processing, the paper introduces an algorithm leveraging Yannakakis' algorithm to handle acyclic join graphs. To address cyclic joins, a folding method is proposed to transform the cyclic join graphs into acyclic ones, enabling the application of Yannakakis' efficient reduction techniques. This approach retains the relational context and minimizes computational waste.

Implications and Future Work

The implications of this research are significant for both theoretical and practical aspects of database management. Theoretically, it reintroduces semi-join reduction in a novel context, ensuring minimal data redundancy. Practically, it provides a more efficient method of interacting with databases, which could lead to considerable performance improvements in real-world applications.

Future developments could focus on optimizing the root selection and join order in semi-join reductions to further maximize the efficiency of the proposed methods. Additionally, exploring database transformation techniques could complement the data retrieval enhancements discussed in this paper.

Conclusion

The proposition and initial experiments detailed in the paper mark an important step towards refining SQL for improved performance and usability. By addressing the critical issues stemming from the single-table result paradigm, the RESULTDB extension offers a cleaner, more efficient way to handle complex queries, promising significant advancements in the field of database management systems.

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com