Skip to content

Commit

Permalink
Feature/spaghetti query (#10)
Browse files Browse the repository at this point in the history
* General setup for Spaghetti Query detector

* Setup Halstead Metric and function for complexity

* Add function to get all columns from ORDER BY

* Add tests for the get_all_columns method

* Linting

* Add description

* Add sql metadata package

* Add method to format query

* Add function to determine whether query has subqueries

* Add distiction between subquery and UNION

* Add function that checks if the query has a UNION

* Change has_subqueries to look at (SELECT instead of just SELECT

* Add method that gets all queries in a union sql query

* Add function to get SQL operators

* Add type hints to halstead metric function

* Implement halstead complexity

* - Add spaghetti query detector to detector
- Fix error when query is not defined

* Add spaghetti query anti-pattern detector

* Relax thresholds for spaghetti query

* Relax thresholds for spaghetti query again

* Change printer to only output location when provided

* Add check to not print descriptions if there are no descriptions

* Update test cases

* Linting and MyPy
  • Loading branch information
leonardomathon authored Mar 29, 2022
1 parent 1ea7233 commit bb4f3a5
Show file tree
Hide file tree
Showing 11 changed files with 597 additions and 6 deletions.
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
sqlparse==0.4.2
sql-metadata==2.4.0
rich==12.0.0
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ packages =
sqleyes.utils
install_requires =
sqlparse>=0.4.2
sql-metadata>=2.4.0
rich>=12.0.0
python_requires = >=3.6
package_dir =
Expand Down
22 changes: 22 additions & 0 deletions sqleyes/definitions/antipatterns/spaghetti_query.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
### Spaghetti Query

Queries can be of varying degrees of difficulty. Some significantly more sophisticated than others, such as a complex join between two databases or a recursive subqueries. Sometimes, during development of a query for a complex task, the query becomes too complex that the programmer gets stuck. This is most likely because programmers are fixated on solving the task both elegantly and efficiently, thus they try to complete it with a single query. However, the complexity of these single queries can increase exponentially, making both maintainability and correctness more difficult to achieve.

#### Example code

```SQL
SELECT COUNT(p.pID) AS numberOfDistINctProducts,
SUM(i.quantity) AS numberOfProducts,
AVG(i.unit_price) AS averagePrice,
city
FROM products p
JOIN inventories i ON (p.pID = i.pID)
JOIN stores s ON (i.sID = s.sID)
GROUP BY s.city
```

The query above can be considered overly complex for what it does, but it demonstrates the type of problem that can occur when a programmer tries to solve a complicated problem in one query. SQL is a sophisticated language that allows you to do a great deal with a single query or statement. However, this does not mean that it is essential to try to solve every problem with a single query or line of code.

#### Fix

Sometimes, it is better to write seperate queries for a certain task, then trying to accomplish it with one query. A simple way to tackle complex queries is to use the **divide and conquer** method, where you divide the problem into multiple parts so you can solve them independently. In other words, if you break up a long complex query into several simpler queries, you can then focus on each part individually and do a better job of each of them, since they are less complex. While it is not always possible to split a query this way, it is a good general strategy, which is often all that is necessary.
5 changes: 5 additions & 0 deletions sqleyes/definitions/definitions.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,11 @@
"filename": "random_selection.md",
"title": "Avoid ORDER BY RAND() usage",
"type": "Random Selection"
},
"spaghetti_query": {
"filename": "spaghetti_query.md",
"title": "Avoid complex queries",
"type": "Spaghetti Query"
}
}
}
42 changes: 42 additions & 0 deletions sqleyes/detector/antipatterns/spaghetti_query.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
"""Implicit Columns anti-pattern detector class"""
from sqleyes.definitions.definitions import DEFINITIONS
from sqleyes.detector.antipatterns.abstract_base_class import AbstractDetector
from sqleyes.detector.detector_output import DetectorOutput
from sqleyes.utils.query_functions import get_query_complexity


class SpaghettiQueryDetector(AbstractDetector):

filename = DEFINITIONS["anti_patterns"]["spaghetti_query"]["filename"]
type = DEFINITIONS["anti_patterns"]["spaghetti_query"]["type"]
title = DEFINITIONS["anti_patterns"]["spaghetti_query"]["title"]

def __init__(self, query):
super().__init__(query)

def check(self):
LOW_THRESHOLD = 60
MEDIUM_THRESHOLD = 75
HIGH_THRESHOLD = 90

query_complexity = get_query_complexity(self.query)

if query_complexity < LOW_THRESHOLD:
return None

if LOW_THRESHOLD < query_complexity < MEDIUM_THRESHOLD:
certainty = "low"

if MEDIUM_THRESHOLD < query_complexity < HIGH_THRESHOLD:
certainty = "medium"

if HIGH_THRESHOLD < query_complexity:
certainty = "high"

return DetectorOutput(query=self.query,
certainty=certainty,
description=super().get_description(),
detector_type=self.detector_type,
locations=[],
title=self.title,
type=self.type)
7 changes: 7 additions & 0 deletions sqleyes/detector/detector.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from sqleyes.detector.antipatterns.implicit_columns import ImplicitColumnsDetector
from sqleyes.detector.antipatterns.poor_mans_search_engine import PoorMansSearchEngineDetector
from sqleyes.detector.antipatterns.random_selection import RandomSelectionDetector
from sqleyes.detector.antipatterns.spaghetti_query import SpaghettiQueryDetector
from sqleyes.detector.detector_output import DetectorOutput


Expand All @@ -29,6 +30,9 @@ def run(self) -> List[DetectorOutput]:
List[DetectorOutput]: A list of Detector outputs of various
detectors.
"""
if self.query == "":
return []

ap_ambiguous_groups = AmbiguousGroupsDetector(query=self.query) \
.check()
self.anti_pattern_list.append(ap_ambiguous_groups)
Expand All @@ -47,4 +51,7 @@ def run(self) -> List[DetectorOutput]:
ap_random_selection = RandomSelectionDetector(query=self.query).check()
self.anti_pattern_list.append(ap_random_selection)

ap_spaghetti_query = SpaghettiQueryDetector(query=self.query).check()
self.anti_pattern_list.append(ap_spaghetti_query)

return [ap for ap in self.anti_pattern_list if ap is not None]
6 changes: 4 additions & 2 deletions sqleyes/printer/printer.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,9 @@ def print_descriptions(self):
self.console.print()
self.console.print(f"[bold red]type[/bold red]: {type}")
self.console.print(f"[bold red]title[/bold red]: {title}")
self.console.print(f"[bold red]location(s)[/bold red]: {locations}")

if locations:
self.console.print(f"[bold red]location(s)[/bold red]: {locations}")

for snippet in location_snippets:
self.console.print(Padding(snippet, (2, 2, 1, 2)))
Expand All @@ -96,6 +98,6 @@ def print_descriptions(self):

def print(self, descriptions=False):
self.print_summary()
if descriptions:
if descriptions and len(self.detector_output) != 0:
self.console.print()
self.print_descriptions()
37 changes: 37 additions & 0 deletions sqleyes/utils/code_complexity_metrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
"""Utility functions w.r.t code complexity metrics"""
import math


def halstead_metrics(n1: int, n2: int, N1: int, N2: int):
"""
Compute Halstead metrics for a given program.
Parameters:
n1 (int): Number of unique operators in the query.
n2 (int): Number of unique operands in the query.
N1 (int): Number of operators in the query.
N2 (int): Number of operands in the query.
Returns:
N (int): Program length.
n (int): Program vocabulary.
V (float): Program volume.
D (float): Program difficulty.
E (float): Program effort.
"""
# Program length
N = N1 + N2

# Program vocabulary
n = n1 + n2

# Volume
V = N * math.log2(n)

# Difficulty
D = n1/2 * N2/n2

# Effort
E = D * V

return (N, n, V, D, E)
192 changes: 191 additions & 1 deletion sqleyes/utils/query_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,88 @@

import sqlparse

from sqleyes.utils.code_complexity_metrics import halstead_metrics
from sqleyes.utils.query_keywords import SQL_FUNCTIONS


OPERATORS = ["+", "-", "*", "**", "/", "%", "&", "|", "||", "^", "=", ">", "<",
">=", "<=", "!<", "!>", "<>", "+=", "-=", "/=", "/=", "%=", "&=",
"^-=", "|*=", "ALL", "AND", "&&", "ANY", "BETWEEN", "EXISTS",
"IN", "LIKE", "NOT", "OR", "SOME", "IS NULL", "IS NOT NULL",
"UNIQUE"]

EXPRESSIONS = ["CASE", "DECODE", "IF", "NULLIF", "COALESCE", "GREATEST",
"GREATER", "LEAST", "LESSER", "CAST"]


def format_query(query: str) -> str:
"""
This function takes a query string as input and returns a formatted query.
Parameters:
query (str): The query string.
Returns:
str: A query that is properly formatted.
"""
return str(sqlparse.format(query, keyword_case='upper'))


def has_subqueries(query: str) -> bool:
"""
This function takes a query string as input and returns True if that query
contains subqueries.
Parameters:
query (str): The query string.
Returns:
bool: True if query contains subqueries, False otherwise
"""
query = format_query(query)

select_count = re.findall(r'\(\s*SELECT', query, flags=re.DOTALL |
re.IGNORECASE)

return len(select_count) > 0


def has_union(query: str) -> bool:
"""
This function takes a query string as input and returns True if that query
contains a UNION.
Parameters:
query (str): The query string.
Returns:
bool: True if query contains a UNION, False otherwise
"""
query = format_query(query)

union_count = re.findall(r'UNION', query, flags=re.DOTALL |
re.IGNORECASE)

return len(union_count) > 0


def get_unions(query: str) -> List[str]:
"""
This function takes a query string as input and returns a list of query
unions
Parameters:
query (str): The query string.
Returns:
List[str]: A list of query unions
"""
if not has_union(query):
return [query]

return re.split("\\s*UNION\\s*", query, flags=re.DOTALL | re.IGNORECASE)


def get_columns_from_select_statement(query: str) -> List[str]:
"""
This function takes a query string as input and returns a list of columns
Expand All @@ -16,8 +95,10 @@ def get_columns_from_select_statement(query: str) -> List[str]:
query (str): The query string.
Returns:
columns (list): A list of columns selected in the SELECT statement.
List[str]: A list of columns selected in the SELECT statement.
"""
query = format_query(query)

columns = re.findall(r'SELECT (.*?) FROM', query,
flags=re.DOTALL | re.IGNORECASE)

Expand All @@ -40,6 +121,8 @@ def get_columns_from_group_by_statement(query: str) -> List[str]:
Returns:
List[str]: A list of column names in the GROUP BY statement.
"""
query = format_query(query)

tokens = sqlparse.parse(query)[0].tokens

# Find index of group by keyword in tokens
Expand Down Expand Up @@ -68,6 +151,113 @@ def get_columns_from_group_by_statement(query: str) -> List[str]:
return group_columns


def get_columns_from_order_by_statement(query: str) -> List[str]:
"""
This function takes a query string as input and returns a list of columns
in the ORDER BY statement.
Parameters:
query (str): The query string.
Returns:
List[str]: A list of columns selected in the SELECT statement.
"""
query = format_query(query)

tokens = sqlparse.parse(query)[0].tokens

# Find index of group by keyword in tokens
for i in range(0, len(tokens)):
if tokens[i].value.upper() == "ORDER BY":
break

# Query has no GROUP BY statement
if i == len(tokens) - 1:
return []

# Find possible index of next keyword
for j in range(i + 1, len(tokens)):
if tokens[j].ttype is sqlparse.tokens.Keyword:
break

# Get column names
order_columns = []
for item in tokens[i:j + 1]:
if isinstance(item, sqlparse.sql.IdentifierList):
for identifier in item.get_identifiers():
order_columns.append(identifier.get_name())
elif isinstance(item, sqlparse.sql.Identifier):
order_columns.append(item.get_name())

return order_columns


def get_all_columns(query: str) -> List[str]:
select_columns = get_columns_from_select_statement(query)
group_by_columns = get_columns_from_group_by_statement(query)
order_by_columns = get_columns_from_order_by_statement(query)

return select_columns + group_by_columns + order_by_columns


def get_query_ops_and_expr(query: str) -> List[str]:
"""
Finds all the operators and expressions used inside a query. Returns a list
of all operators and expressions
Parameters:
query (str): A SQL query string.
Returns:
List[str]: A list a all operators and expressions from the input query
"""
result = []

for operator in OPERATORS:
count = query.count(operator)
if count != 0:
result.extend([operator] * count)

# Get all expressions used in the query
for expression in EXPRESSIONS:
count = query.count(expression)
if count != 0:
result.extend([expression] * count)

return result


def get_query_complexity(query: str) -> float:
"""
Calculates the complexity of a query based on the Halstead Metric + LoC
Parameters:
query (str): A SQL query string.
Returns:
int: The complexity of the query
"""
query = format_query(query)

# From paper 'Measuring Query Complexity in SQLShare Workload'
# Number of operators and expressions as Halstead operators
operators = get_query_ops_and_expr(query)

# Number of columns referenced in query as Halstead operants
# If query has a UNION, get all columns for each query in the union
operands = []
if (has_union(query)):
for q in get_unions(query):
operands.extend(get_all_columns(q))
else:
operands.extend(get_all_columns(query))

N1, N2 = len(operators), len(operands)
n1, n2 = len(set(operators)), len(set(operands))

return float(halstead_metrics(n1, n2, N1, N2)[4])


def check_single_value_rule(columns: List[str]) -> bool:
"""
This function checks if the columns in the list break the single-value
Expand Down
Loading

0 comments on commit bb4f3a5

Please sign in to comment.