The presence of duplicate values within datasets remains one of the most common challenges affecting data quality. When performing data analysis or constructing machine learning models, duplicates can introduce bias and lead to inaccuracies. Consequently, identifying and effectively managing these duplications within datasets becomes paramount.
What are Duplicate Values?
Duplicate values manifest as data points sharing identical characteristics, either entirely or in part, within a dataset. These duplicates often arise due to issues with data input, data collection processes, or other contextual factors that contribute to their emergence.
We can create a dataset that contains duplicate records for instance. We can see two entries for “Tim,” both aged 30 years. These records stand as duplicates of each other.
import pandas as pd
df = pd.DataFrame({
'name': ['Jack','Jane','Tim','Mike','Tim'],
'age' :[ 31, 25, 30, 29,30]})
# display the df
name age
0 Jack 31
1 Jane 25
2 Tim 30
3 Mike 29
4 Tim 30
Detect Duplicate Values
The initial step in mitigating duplicate value issues involves their detection within the dataset.
Within the pandas library, a range of functions facilitates the identification of duplicate entries. The duplicated()
function, for instance, generates a Boolean Series indicating the presence of duplicate rows.
Once duplicates have been identified, we can use the drop_duplicates()
function to removes these redundant rows from the dataset.
# Identify duplicate records
df.duplicated()
0 False
1 False
2 False
3 False
4 True
dtype: bool # the result is boolean
Remove Duplicate Values
# Remove duplicate records and store it in a new dataset called df1
df1 = df.drop_duplicates()
# display the new dataset df1
print(df1)
name age
0 Jack 31
1 Jane 25
2 Tim 30
3 Mike 29
In the default mode, the drop_duplicates()
function remains the first record and delete the duplicated one. But you can choose to keep the first one or the last with the keep
option.
df.drop_duplicates(keep = 'first') #keep the first one
df.drop_duplicates(keep = 'last') #keep the last one