Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems (2201.02215v1)

Published 6 Jan 2022 in cs.SE

Abstract: Code smells indicate software design problems that harm software quality. Data-intensive systems that frequently access databases often suffer from SQL code smells besides the traditional smells. While there have been extensive studies on traditional code smells, recently, there has been a growing interest in SQL code smells. In this paper, we conduct an empirical study to investigate the prevalence and evolution of SQL code smells in open-source, data-intensive systems. We collected 150 projects and examined both traditional and SQL code smells in these projects. Our investigation delivers several important findings. First, SQL code smells are indeed prevalent in data-intensive software systems. Second, SQL code smells have a weak co-occurrence with traditional code smells. Third, SQL code smells have a weaker association with bugs than that of traditional code smells. Fourth, SQL code smells are more likely to be introduced at the beginning of the project lifetime and likely to be left in the code without a fix, compared to traditional code smells. Overall, our results show that SQL code smells are indeed prevalent and persistent in the studied data-intensive software systems. Developers should be aware of these smells and consider detecting and refactoring SQL code smells and traditional code smells separately, using dedicated tools.

Citations (35)

Summary

  • The paper finds SQL code smells, especially Implicit Columns, are prevalent across all domains with a median occurrence of 1.67%.
  • The study employs SQLInspect, DECOR, and PyDriller to detect smells and uses statistical tests to analyze their evolution and association with traditional smells and bugs.
  • The paper reveals that SQL smells persist longer than traditional smells, with up to 80.5% of files retaining them, highlighting the need for targeted refactoring.

This paper investigates the prevalence, impact, evolution, and co-occurrence of SQL code smells in data-intensive systems, comparing them with traditional code smells (On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems, 2022). SQL code smells are poor practices in SQL query writing (e.g., using SELECT * instead of specifying columns) which can negatively impact system performance and maintainability, particularly in applications heavily reliant on databases.

Methodology:

  1. Project Selection: The paper analyzed 150 open-source Java projects from GitHub that use common database access APIs (Android Database API, JDBC, JPA, Hibernate). Projects were categorized into four domains: Business, Library, Multimedia, and Utility.
  2. Smell Detection:
    • SQL Smells: Detected using the SQLInspect tool, focusing on four types based on Karwin's catalogue: Implicit Columns, Fear of the Unknown, Ambiguous Groups, and Random Selection.
    • Traditional Smells: Detected using the DECOR tool, covering 19 types (e.g., LongMethod, ComplexClass).
    • Analysis was performed on project snapshots taken every 500 commits.
  3. Bug Analysis:
    • PyDriller was used to identify bug-fixing commits based on keywords (fix, bug, error, etc.) in commit messages.
    • The SZZ algorithm (implemented in PyDriller) was used to identify the corresponding bug-inducing commits.
  4. Data Analysis: Smells, file evolution (tracked using git diff), and bug information were stored in a relational database. Statistical methods like the Apriori algorithm, Cramer's V test, Chi-squared test, RandomForest classification, and Kaplan-Meier survival analysis were employed.

Key Findings:

  • RQ1: Prevalence of SQL Smells:
    • SQL code smells are prevalent across all studied application domains.
    • Implicit Columns (e.g., SELECT *) was the most frequent smell (median prevalence 1.67%), followed by Fear of the Unknown (improper NULL handling, median 0.8%).
    • Ambiguous Groups and Random Selection were rarely found in the latest versions.
    • Business and Library domains showed slightly higher prevalence for Implicit Columns. The high prevalence in libraries is concerning as they are reused.
    • Manual inspection of Implicit Columns smells revealed many unused columns being retrieved, potentially causing performance and maintenance issues.
  • RQ2: Co-occurrence with Traditional Smells:
    • Association rule mining (Apriori) and statistical tests (Chi-squared, Cramer's V) were used.
    • Some statistically significant co-occurrences were found (e.g., Implicit Columns with LongMethod and ComplexClass).
    • However, the strength of this association was consistently weak (low Lift, Leverage, and Cramer's V values). This suggests SQL smells and traditional smells often occur independently.
  • RQ3: Co-occurrence with Bugs:
    • The paper linked smells present in file versions to bug-inducing commits affecting those files.
    • Chi-squared tests showed no statistically significant association between the studied SQL smells (Implicit Columns, Fear of the Unknown) and the file versions being bug-inducing.
    • In contrast, several traditional smells (e.g., ComplexClass, SpaghettiCode) showed a statistically significant association with bugs.
    • A RandomForest model trained to predict bug-inducing file versions based on smells showed that traditional smells (especially ComplexClass) had much higher feature importance than SQL smells.
  • RQ4: Survival of SQL Smells:
    • Kaplan-Meier survival analysis revealed that SQL code smells persist significantly longer than traditional code smells (LongMethod, LongParameterList).
    • A large fraction of SQL smells were introduced early in the project lifecycle (e.g., 89.5% of files with Implicit Columns had the smell in the first tracked snapshot).
    • Many smells remained unfixed throughout the observation period (e.g., 80.5% of files with Implicit Columns contained it in all snapshots). This indicates SQL smells receive little refactoring attention.

Practical Implications:

  • Developers should be aware that SQL smells, particularly Implicit Columns and Fear of the Unknown, are common in data-intensive applications.
  • Detection and refactoring of SQL smells require specific attention and potentially dedicated tools (like SQLInspect), as they don't strongly co-occur with traditional smells and may be missed by traditional smell detectors.
  • While this paper didn't find a strong link between SQL smells and functional bugs, their persistence and nature (e.g., retrieving unnecessary data) suggest potential impacts on performance and maintainability that warrant attention.
  • The longevity of SQL smells suggests they are often overlooked during maintenance and refactoring. Teams should consider incorporating SQL smell checks into their code review and quality assurance processes.

In conclusion, the paper provides empirical evidence that SQL code smells are a distinct and persistent problem in data-intensive systems, often independent of traditional code smells and less directly linked to bug introduction than some traditional smells, but potentially harmful nonetheless and largely ignored by developers (On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems, 2022).