Dealing with null values is an essential task in data cleansing and data preprocessing. No matter your career goal is to be a data analyst, business analyst or data engineer, or data scientist, it is a must-done challenge.
In Python, we know the null values can be represented in NaN and None.
What is the difference between NaN and None?
NaN is a numerical value, it is used as a placeholder for missing data. The data consistency is good, and you still can use it for vectorized operations. This is an example.
# create a list that contains NaN value
ls = pd.Series([1, 2,np.nan,4])
# print the list
print(ls)
0 1.0
1 2.0
2 NaN
3 4.0
dtype: float64
We can see the list contains a NaN value and the data type is numerical, specifically, it is float64.
The list has no problem doing the vectorized operation.
ls * 2 # simply multiply by 2
0 2.0
1 4.0
2 NaN
3 8.0
dtype: float64
How does it work with None?
Look at the two sentences below, we create two lists, one is normal, and the other declares the data type as “Object”. Without the declaration, pandas will force the None to NaN.
We can see, that list 1′ s data type is float64 and the None value has been converted to NaN.
The data type of list 2 is still an object.
By default, None belongs to a data type names NoneType.
ls1 = pd.Series([1, None])
ls2 = pd.Series([1, None], dtype=object)
print(ls1)
0 1.0
1 NaN
dtype: float64
print(ls2)
0 1
1 None
dtype: object
How to interpret the NULL value?
One day in the real world, you find a blank cell in a data frame or in a table, u tried to use the calculation function or filtering function to do some analysis, but the computer looks crazy and returns ridiculous feedback. This would be a very frustrating moment, but unfortunately, it is not unusual.
There could be several situations causing this result.
- It is a data missing and it is NaN
- It is a data missing and it is None
- There is no data missing but the data is an empty string.
- There is no data missing but a string containing only a space.
Let me use an example to illustrate this case. In List 4, element No. 4 is an empty string, and element No.5 is a string containing only a space.
Although it looks like a missing data case, it is not. Especially when u tried to use any functions related to Null, it will not work.
ls4 = pd.Series([1,2,3,'',' '])
print(ls4)
0 1
1 2
2 3
3
4
dtype: object
Be careful, the Null data can be very tricky. 🙂
Don’t let it destroy our good mood for a whole day.