Introduction
As a data engineer or database administrator, optimizing database costs should be a priority. Writing efficient queries is crucial, especially in environments that handle terabytes of data daily. Inefficient queries can lead to increased compute resources, escalating costs significantly over time.
But how can you write efficient SQL queries? There are several practices to address this, but for this discussion, we will focus on understanding Common Table Expressions (CTEs), Views, Temp Tables, and Subqueries. We will explore their strengths and weaknesses and learn when to use each one, rather than relying on various norms found on the internet.
We can categorize these constructs into two groups based on what they store: either they hold logic that can be referenced in other parts of the query or pipeline, or they hold data from a query that can be queried itself. They can be either Materialized or Unmaterialized.
Unmaterialized Constructs
Unmaterialized constructs do not store data physically in the database. Instead, they provide a way to organize and simplify queries by creating temporary result sets that are computed on-the-fly each time they are accessed. They simplify queries, improve readability, and only exist for the duration of the query. These include:
Common Table Expression (CTE)
Views
SubQuery
Common Table Expression (CTE)
CTEs begin with the WITH keyword and exist within the context of a query. They are efficient in improving query readability and organization and are good for breaking down complex queries into simpler parts.
Let's take a quick look at two SQL queries that perform the same function, one written with a subquery and the other with a CTE. In terms of readability, the second code segment divides the query into two individual parts: the first calculates the average salary per department, while the second part references the calculated average to filter employees.
-- Using Subquery
SELECT e.emp_id, e.emp_name, e.salary, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id
WHERE e.salary > (
SELECT AVG(salary)
FROM employees
WHERE dept_id IN (
SELECT dept_id
FROM employees
WHERE emp_id = e.emp_id
)
);
-- Using CTE
WITH AvgSalaryPerDept AS (
SELECT dept_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id
)
SELECT e.emp_id, e.emp_name, e.salary, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id
JOIN AvgSalaryPerDept a ON e.dept_id = a.dept_id
WHERE e.salary > a.avg_salary;
Views
Views are named queries that can be referenced in other queries. Like CTEs, views are used to share transformation logic across the query session or transformation pipeline. They are useful for simplifying complex queries and promoting reusability.
-- Creating a view
CREATE VIEW AvgSalaryView AS
SELECT dept_id, AVG(salary) AS avg_dept_salary
FROM employees
GROUP BY dept_id;
SELECT emp_id, emp_name, salary
FROM employees e
JOIN AvgSalaryView a ON e.dept_id = a.dept_id
WHERE e.salary > a.avg_dept_salary;
This view "AvgSalaryView" is created in the query session and can be shared across other logics. For a data engineer, if you define a logic and want the analyst to query the data using that logic, you can create a view and provide the analyst with that, thereby abstracting them from the underlying logic.
However, when working with some RDBMS, dropping a table that has a dependent view returns an error, unless you drop the view before dropping the table.
Subqureries
Subqueries are similar to CTEs but are generally slower. They are useful when working with older database versions that do not support CTEs. CTEs are preferable due to their better performance and readability.
Usage Bias
Use CTEs the most
Use views if you are sharing logic with other people
Use subqueries for old technology databases
Remember, these constructs do not materialize data. They only store transformation logic but not the data itself, which in some cases might not be desired. This is where materialized options, which store the data itself, come into play. These include:
- Temp Tables
- Materialized Views
Materialized Constructs
Understanding when to use a materialized construct is very important. These constructs store the data itself, which can be reused in multiple spots in the query session. They are useful when dealing with complex logic or when intermediate results need to be stored.
Temp Tables
Temporary tables are similar to regular tables but are stored in a temporary database and typically deleted when the session that created them ends. They are good for storing intermediate results or handling complex transformations.
-- Mysql
CREATE TEMPORARY TABLE AvgSalaryTemp AS
SELECT dept_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id;
SELECT emp_id, emp_name, salary
FROM employees e
JOIN AvgSalaryTemp s ON e.dept_id = s.dept_id
WHERE e.salary > s.avg_salary;
DROP TEMPORARY TABLE AvgSalaryTemp;
-- Creating a Temp Table in SQL Server
CREATE TABLE #AvgSalaryTemp (
dept_id INT,
avg_salary DECIMAL(10, 2)
);
INSERT INTO #AvgSalaryTemp
SELECT dept_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id;
SELECT emp_id, emp_name, salary
FROM employees e
JOIN #AvgSalaryTemp s ON e.dept_id = s.dept_id
WHERE e.salary > s.avg_salary;
DROP TABLE #AvgSalaryTemp;
Temp tables exist for the duration of the query session. They get deleted when you close the session, which can be convenient or inconvenient depending on your use case. Alternatively, you can create a "staging table" or a materialized view meant to last for a set number of days.
Materialized View
Materialized views are precomputed snapshots of data stored for faster access. They are especially useful for complex queries that do not require real-time data and would be too slow to compute on the fly. They improve performance by reducing the computational logic on your database.
-- Creating a Materialized View
CREATE MATERIALIZED VIEW AvgSalaryMatView AS
SELECT dept_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id;
SELECT emp_id, emp_name, salary
FROM employees e
JOIN AvgSalaryMatView s ON e.dept_id = s.dept_id
WHERE e.salary > s.avg_salary;
However, materialized views must be refreshed periodically to ensure the data stays up to date. During the refresh of a materialized view, if other people are querying the view, it can block them unless refreshed concurrently.
Conclusion
Efficient database management and query optimization are crucial for maintaining performance and cost-effectiveness in environments with large datasets. Understanding the differences between unmaterialized constructs (CTEs, views, subqueries) and materialized constructs (temp tables, materialized views) is essential. Use CTEs for readability and breaking down complex queries, views for reusability and sharing logic, and materialized options for storing intermediate results and handling complex transformations. By leveraging these constructs appropriately, you can significantly improve the efficiency of your SQL queries and overall database performance.