Crafting an effective SQL query is like mastering a culinary art – it’s about blending efficiency, data manipulation, and key principles for top-notch database performance, sustainability, and crystal-clear clarity.
Don’t worry if this sounds a bit complex at first. Think of it as a familiar scenario: you fire off a query and end up waiting so long that you can practically brew, sip, and savor a cup of coffee. Or worse, your computer suddenly sounds like it’s running a marathon due to the query gobbling up every ounce of CPU and memory. And then there’s the nightmare scenario: a poorly written query not only throws off your database’s performance, but can even crash the entire server – yikes!
To dodge these SQL pitfalls and ensure a smoother ride through the world of database management, let’s dive into some practical tips and tried-and-true practices that I’ve gathered from my own experience. Your SQL journey is about to get a whole lot smoother.
1.Use the preview option
If you just want a quick overview of your data, you should use the preview option to check the data and its schema. It returns the result instantly and costs nothing.
2.Avoid SELECT *
When writing queries, it would be better to select only the columns that you need, rather than SELECT *
. SELECT *
retrieves unnecessary data besides that it increases the network traffic used for your queries.
- In best practice, you can apply a
LIMIT
clause, aWHERE
clause or aHAVING
clause to aSELECT *
query does not affect the amount of data read. - You also can use
SELECT * EXCEPT
statement to retrieve data. This method enables you to eliminate unwanted columns and significantly reduce the volume of data processed by your query.
Example 1
SELECT gender, age, name
FROM `Employees`
WHERE Salary > 5000;
Example 2
SELECT * EXCEPT (col1, col4, col6, col8)
3.Use Partitioned tables
If your database holds or will hold a lot of historical data, think about using partitioned tables instead of sharded tables (also called date-named tables) when designing it.
A good tip is to not create too many sharding tables. This is because queries that need to gather data from multiple sharding tables can get complicated and slow. This happens because data has to be collected from different places and maybe even put together. Handling a large number of sharding tables can also become tricky, leading to more work to keep everything running smoothly, potential issues, and extra work to keep things going.
When it comes to performance, partitioned tables tend to work better than date-named tables.
- For ingestion-time partitioned tables, use the pseudo-column
_PARTITIONTIME
- For filtering time partitioned tables, use
_PARTITIONTIME
to specify a date or range of dates
Example 3
SELECT col1 , col2, col3 FROM `purchase`
WHERE _PARTITIONTIME
BETWEEN TIMESTAMP("20230101") AND TIMESTAMP("20230601")
4. Reduce data before JOIN
The JOIN
is very resource-intensive, especially when dealing with large datasets. Joining large tables without reducing data first can lead to high memory and CPU usage, potentially causing performance bottlenecks and slowing down the entire database server. Therefore, by reduce data before joining, you can decrease the number of rows that need to be processed and shorten query execution times.
In best practice, you can use filter, aggregation, subquery and CTE to significantly reduce data before using JOIN
- Filtering: Apply appropriate filtering conditions (using
WHERE
clauses) to limit the number of rows that need to be joined. - Aggregation: Perform aggregation operations like
SUM, COUNT, AVG
, etc., before joining, so you’re working with summarized data. - Subqueries: Use subqueries to pre-select relevant data before joining, which can result in more efficient processing.
- Temporary Tables or Common Table Expressions (CTEs): Create temporary tables to hold reduced data subsets that you’ll then join in your main query.
Example 4
# create a temporary table named comments and reduce the data precisely
WITH
comments AS (
SELECT
user_id,
COUNT(text) AS comments_count
FROM
table1
WHERE
user_id IS NOT NULL
GROUP BY user_id
ORDER BY comments_count DESC
LIMIT 10
)
SELECT
user_id,
display_name,
comments_count
FROM comments
JOIN
table2 AS u
ON
user_id = u.id
ORDER BY comments_count DESC;
5. Place ORDER BY at the outermost
You should put the complex operations to the end of the query, otherwise they can easily cause very fast break off your table. The best partner of ORDER BY
clause is LIMIT
clause, especially when you are ordering a very large number of values but don’t need to have all of them returned.
Example 5
SELECT sales_by_category
FROM `myproject.mydatabase.sales`
ORDER BY sales_sum DESC
LIMIT 10;
Conclusion
In the real world, there’s a bunch of ways to make your queries run smoother and faster. Out of these, the five methods we’ve talked about are the ones you’ll likely use the most. They’re practical and hands-on, and they can really give your SQL performance a boost!