Nowadays, more and more decision-makers are relying on elaborated and accurate analysis. However, duplicate records reduce the credibility of the data and weaken the analysis report.
It’s essential to do the data health check before starting any analysis. Dealing with duplicate records is one of the important tasks.
Let’s start creating a temporary table to illustrate the case.
WITH tmp AS
(SELECT 'Adams' as LastName, 15 as Age,'aaa' as SchoolID, UNION ALL
SELECT 'Buchanan',16,'bbb' UNION ALL
SELECT 'Coolidge',15, 'ccc' UNION ALL
SELECT 'Davis',17,'ddd' UNION ALL
SELECT 'Eisenhower',16 ,'eee' UNION ALL
SELECT 'Buchanan',16,'bbb' )
SELECT * FROM tmp
After the query, it will look like this:
Row | LastName | Age | SchoolID |
---|---|---|---|
1 | Adams | 15 | aaa |
2 | Buchanan | 16 | bbb |
3 | Coolidge | 15 | ccc |
4 | Davis | 17 | ddd |
5 | Eisenhower | 16 | eee |
6 | Buchanan | 16 | bbb |
In this table, we can take a second to find out where the duplicate is: row 2 and row 6 are identical. But what can we do when the records are millions or hundreds of millions? The eyes are not the appropriate tool to do such a job.
Step 1: Identify whether your dataset contains duplicates
SELECT
(SELECT COUNT(1) FROM (SELECT DISTINCT * FROM tmp)) AS distinct_rows,
(SELECT COUNT(1) FROM tmp) AS total_rows
The result will be as follows. We can see there are five distinct rows and six rows in total. It means we have one row is duplicated.
Row | distinct_rows | total_rows |
---|---|---|
1 | 5 | 6 |
Step 2: Display the duplicate records
SELECT
COUNT(*) AS num_duplicate_rows, # count the duplicate records
* #also select all the columns
FROM tmp
GROUP BY LastName, Age,SchoolID # group by all columns
HAVING
num_duplicate_rows > 1; #find out where the row records more than 1
The result will be as follows.
Row | num_duplicate_rows | LastName | Age | SchoolID |
---|---|---|---|---|
1 | 2 | Buchanan | 16 | bbb |
Step 3: Remove the duplicates and save the result to a new table
We don’t remove the duplicate record directly, instead, we just need to select the records without duplicates. Therefore, we can do li like this.
#create a new table names tmp_dedup, don't overwrite the current one directly
CREATE OR REPLACE TABLE sandbox.tmp_dedup
AS
SELECT
DISTINCT * # select the records without duplicates
FROM tmp
Voila! The result will be neat and clean as we expect.
Row | LastName | Age | SchoolID |
---|---|---|---|
1 | Adams | 15 | aaa |
2 | Buchanan | 16 | bbb |
3 | Coolidge | 15 | ccc |
4 | Davis | 17 | ddd |
5 | Eisenhower | 16 | eee |