How to deal with duplicate records in BigQuery?

Nowadays, more and more decision-makers are relying on elaborated and accurate analysis. However, duplicate records reduce the credibility of the data and weaken the analysis report.

It’s essential to do the data health check before starting any analysis. Dealing with duplicate records is one of the important tasks.

Let’s start creating a temporary table to illustrate the case.

WITH tmp AS
 (SELECT 'Adams' as LastName, 15 as Age,'aaa' as SchoolID, UNION ALL
  SELECT 'Buchanan',16,'bbb' UNION ALL
  SELECT 'Coolidge',15, 'ccc' UNION ALL
  SELECT 'Davis',17,'ddd' UNION ALL
  SELECT 'Eisenhower',16 ,'eee' UNION ALL
  SELECT 'Buchanan',16,'bbb' )
SELECT * FROM tmp

After the query, it will look like this:

Row	LastName	Age	SchoolID
1	Adams	15	aaa
2	Buchanan	16	bbb
3	Coolidge	15	ccc
4	Davis	17	ddd
5	Eisenhower	16	eee
6	Buchanan	16	bbb

In this table, we can take a second to find out where the duplicate is: row 2 and row 6 are identical. But what can we do when the records are millions or hundreds of millions? The eyes are not the appropriate tool to do such a job.

Step 1: Identify whether your dataset contains duplicates

SELECT
  (SELECT COUNT(1) FROM (SELECT DISTINCT * FROM tmp)) AS distinct_rows,
  (SELECT COUNT(1) FROM tmp) AS total_rows

The result will be as follows. We can see there are five distinct rows and six rows in total. It means we have one row is duplicated.

Row	distinct_rows	total_rows
1	5	6

Step 2: Display the duplicate records

SELECT
  COUNT(*) AS num_duplicate_rows, # count the duplicate records
  *                               #also select all the columns
FROM tmp
GROUP BY LastName, Age,SchoolID   # group by all columns
HAVING
  num_duplicate_rows > 1;        #find out where the row records more than 1

The result will be as follows.

Row	num_duplicate_rows	LastName	Age	SchoolID
1	2	Buchanan	16	bbb

Step 3: Remove the duplicates and save the result to a new table

We don’t remove the duplicate record directly, instead, we just need to select the records without duplicates. Therefore, we can do li like this.

#create a new table names tmp_dedup, don't overwrite the current one directly
CREATE OR REPLACE TABLE sandbox.tmp_dedup      
AS
SELECT
DISTINCT *  # select the records without duplicates
FROM tmp

Voila! The result will be neat and clean as we expect.

Row	LastName	Age	SchoolID
1	Adams	15	aaa
2	Buchanan	16	bbb
3	Coolidge	15	ccc
4	Davis	17	ddd
5	Eisenhower	16	eee

Share on Social Media