Skip to content

How to handle duplicates with Python?

The presence of duplicate values within datasets remains one of the most common challenges affecting data quality. When performing data analysis or constructing machine learning models, duplicates can introduce bias and lead to inaccuracies. Consequently, identifying and effectively managing these duplications within datasets becomes paramount.

What are Duplicate Values?

Duplicate values manifest as data points sharing identical characteristics, either entirely or in part, within a dataset. These duplicates often arise due to issues with data input, data collection processes, or other contextual factors that contribute to their emergence.

We can create a dataset that contains duplicate records for instance. We can see two entries for “Tim,” both aged 30 years. These records stand as duplicates of each other.

import pandas as pd  

df = pd.DataFrame({ 
           'name': ['Jack','Jane','Tim','Mike','Tim'], 
           'age' :[ 31, 25, 30, 29,30]}) 

# display the df
   name  age
0  Jack   31
1  Jane   25
2   Tim   30
3  Mike   29
4   Tim   30

Detect Duplicate Values

The initial step in mitigating duplicate value issues involves their detection within the dataset.

Within the pandas library, a range of functions facilitates the identification of duplicate entries. The duplicated() function, for instance, generates a Boolean Series indicating the presence of duplicate rows.

Once duplicates have been identified, we can use the drop_duplicates() function to removes these redundant rows from the dataset.

# Identify duplicate records
0    False
1    False
2    False
3    False
4     True
dtype: bool    # the result is boolean

Remove Duplicate Values

# Remove duplicate records and store it in a new dataset called df1
df1 = df.drop_duplicates()

# display the new dataset df1

   name  age
0  Jack   31
1  Jane   25
2   Tim   30    
3  Mike   29

In the default mode, the drop_duplicates() function remains the first record and delete the duplicated one. But you can choose to keep the first one or the last with the keep option.

df.drop_duplicates(keep = 'first')   #keep the first one
df.drop_duplicates(keep = 'last')    #keep the last one