Skip to content

How to deal with duplicate records in BigQuery?

Nowadays, more and more decision-makers are relying on elaborated and accurate analysis. However, duplicate records reduce the credibility of the data and weaken the analysis report.

It’s essential to do the data health check before starting any analysis. Dealing with duplicate records is one of the important tasks.

Let’s start creating a temporary table to illustrate the case.

WITH tmp AS
 (SELECT 'Adams' as LastName, 15 as Age,'aaa' as SchoolID, UNION ALL
  SELECT 'Buchanan',16,'bbb' UNION ALL
  SELECT 'Coolidge',15, 'ccc' UNION ALL
  SELECT 'Davis',17,'ddd' UNION ALL
  SELECT 'Eisenhower',16 ,'eee' UNION ALL
  SELECT 'Buchanan',16,'bbb' )
SELECT * FROM tmp

After the query, it will look like this:

RowLastNameAgeSchoolID
1Adams15aaa
2Buchanan16bbb
3Coolidge15ccc
4Davis17ddd
5Eisenhower16eee
6Buchanan16bbb

In this table, we can take a second to find out where the duplicate is: row 2 and row 6 are identical. But what can we do when the records are millions or hundreds of millions? The eyes are not the appropriate tool to do such a job.

Step 1: Identify whether your dataset contains duplicates

SELECT
  (SELECT COUNT(1) FROM (SELECT DISTINCT * FROM tmp)) AS distinct_rows,
  (SELECT COUNT(1) FROM tmp) AS total_rows

The result will be as follows. We can see there are five distinct rows and six rows in total. It means we have one row is duplicated.

Rowdistinct_rowstotal_rows
156

Step 2: Display the duplicate records

SELECT
  COUNT(*) AS num_duplicate_rows, # count the duplicate records
  *                               #also select all the columns
FROM tmp
GROUP BY LastName, Age,SchoolID   # group by all columns
HAVING
  num_duplicate_rows > 1;        #find out where the row records more than 1

The result will be as follows.

Rownum_duplicate_rowsLastNameAgeSchoolID
12Buchanan16bbb

Step 3: Remove the duplicates and save the result to a new table

We don’t remove the duplicate record directly, instead, we just need to select the records without duplicates. Therefore, we can do li like this.

#create a new table names tmp_dedup, don't overwrite the current one directly
CREATE OR REPLACE TABLE sandbox.tmp_dedup      
AS
SELECT
DISTINCT *  # select the records without duplicates
FROM tmp

Voila! The result will be neat and clean as we expect.

RowLastNameAgeSchoolID
1Adams15aaa
2Buchanan16bbb
3Coolidge15ccc
4Davis17ddd
5Eisenhower16eee