Skip to content

How to check missing values with Python?

Keep in mind that there’s a big difference between the messy data we find in the real world and the clean, processed data we use. Real-world datasets often have missing info, mistakes, repeated entries, and conflicting data. These issues can crop up due to various reasons like how data is handled, gathered, and managed.

In this article, we’ll dive into a practical aspect: how to spot missing data in your dataset.

Let’s start with creating a dataframe that contains missing values.

import numpy as np    # import library
import pandas as pd

# create the data with dictionary
data = {
    'A': [1, 2, np.nan, 4, 5, np.nan, 7, np.nan, 9, 10],
    'B': [11, np.nan, 13, 14, 15, 16, np.nan, 18, 19, 20],
    'C': [21, 22, 23, np.nan, 25, 26, 27, 28, 29, 30],
    'D': [31, 32, 33, 34, np.nan, 36, 37, 38, 39, 40],
    'E': [np.nan, 42, 43, 44, 45, 46, 47, 48, 49, 50]
}

df = pd.DataFrame(data)   # create the dataframe
print(df)                 # display the dataframe

      A     B     C     D     E
0   1.0  11.0  21.0  31.0   NaN
1   2.0   NaN  22.0  32.0  42.0
2   NaN  13.0  23.0  33.0  43.0
3   4.0  14.0   NaN  34.0  44.0
4   5.0  15.0  25.0   NaN  45.0
5   NaN  16.0  26.0  36.0  46.0
6   7.0   NaN  27.0  37.0  47.0
7   NaN  18.0  28.0  38.0  48.0
8   9.0  19.0  29.0  39.0  49.0
9  10.0  20.0  30.0  40.0  50.0

When the dataset is small, we can easily spot any missing values just by looking at it. But what if the dataset is big, like 100MB, 1GB, or even 100GB? In those cases, we can’t rely on our eyes alone. We need smarter methods and tools to handle the missing data efficiently.

In order to check missing values in Pandas DataFrame, we use a function isnull() , notnull(), and isna(),notna(). These functions assist us in determining the presence or absence of a value in the dataset. Furthermore, the output of these functions is in boolean form, allowing us to integrate aggregation functions. This approach significantly enhances the comprehensibility of the results, even for individuals less familiar with technical aspects.

  1. You can check the full dataset missing value status.
# check the full dataset missing value status
df.isnull()
       A      B      C      D      E
0  False  False  False  False   True
1  False   True  False  False  False
2   True  False  False  False  False
3  False  False   True  False  False
4  False  False  False   True  False
5   True  False  False  False  False
6  False   True  False  False  False
7   True  False  False  False  False
8  False  False  False  False  False
9  False  False  False  False  False

2. You can check each column’s missing value status

# check the missing value of column A
df['A'].isnull()  
0    False
1    False
2     True
3    False
4    False
5     True
6    False
7     True
8    False
9    False  

3. Calculate the missing value in column and the whole dataset
If you find True and False a bit confusing, don’t worry at all. Just remember, we can use the aggregate functions with the boolean outcome, which will give us results that are easier for us to understand as humans.

# calculate the missing value of column B
df['B'].isnull().sum()   
2  # check the dataset above, the column B contains 2 missing values

# calculate the missing value percentage of column B
print (df['B'].isnull().sum() / len(df) * 100,'%')
20.0 %    # the percentage is 20%
# check and calculate the missing values in each column
df.isna().sum()
A    3            #column A has 3 missing values
B    2            # column B has 2
C    1            # and so on
D    1
E    1
dtype: int64

# combine the sum and percentage and put into a new dataframe
missing_count = df.isna().sum()
missing_pct = df.isna().sum()/len(df)

missing_sum = pd.DataFrame(data = {
    'missing_count': missing_count,
    'missing_percentage': missing_pct
})

# show the new dataframe
print(missing_sum)
   missing_count  missing_percentage
A              3                 0.3
B              2                 0.2
C              1                 0.1
D              1                 0.1
E              1                 0.1
Tags: