The Ultimate Guide of  Data Cleansing

The Ultimate Guide of Data Cleansing

Content

What is data cleansing?

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset. But it is crucial to establish a template for your data cleaning process so you know you are doing it the right way every time.

Benefits of Data Cleansing

Rules number one: garbage in gets you garbage out.

Better data beats fancier algorithms.

  • Removal of errors when multiple sources of data are at play.
  • Fewer errors make for happier clients and less-frustrated employees.
  • Ability to map the different functions and what your data is intended to do.
  • Monitoring errors and better reporting to see where errors are coming from, making it easier to fix incorrect or corrupt data for future applications.
  • Using tools for data cleaning will make for more efficient business practices and quicker decision-making.

4 Steps to Do Data Cleansing

Step 1: Remove Unwanted observations

The first step to data cleaning is removing unwanted observations from your dataset. This includes duplicate or irrelevant observations.

Duplicate observations most frequently arise during data collection, such as when you:

  • Combine datasets from multiple places
  • Scrape data
  • Receive data from clients/other departments

Irrelevant observations are those that don’t actually fit the specific problem that you’re trying to solve.

  • For example, if you were building a model for Single-Family homes only, you wouldn’t want observations for Apartments in there.
  • This is also a great time to review your charts from Exploratory Analysis. You can look at the distribution charts for categorical features to see if there are any classes that shouldn’t be there.
  • Checking for irrelevant observations before engineering features can save you many headaches down the road.

Step 2: Fix Structural Errors

Structural errors are when you measure or transfer data and notice strange naming conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or classes. For example, you may find “N/A” and “Not Applicable” both appear, but they should be analyzed as the same category.

Step 3: Filter Unwanted Outliers

Outliers can cause problems with certain types of models. For example, linear regression models are less robust to outliers than decision tree models.

In general, if you have a legitimate reason to remove an outlier, it will help your model’s performance.

However, outliers are innocent until proven guilty. You should never remove an outlier just because it’s a “big number.” That big number could be very informative for your model. In addition, you’d better have a good reason for removing an outlier, such as suspicious measurements that are unlikely to be real data.

Step 4: Handle Missing Data

Missing data is a deceptively tricky issue in applied machine learning.

First, don’t simply ignore missing values in your dataset. You must handle them in some way for the very practical reason that most algorithms do not accept missing values.

Dropping missing values is sub-optimal because when you drop observations, you drop information. The fact that the value was missing may be informative in itself.

Imputing missing values is sub-optimal because the value was originally missing but you filled it in, which always leads to a loss in information, no matter how sophisticated your imputation method is.

Tricky Missing Data

Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle.

Missing categorical data

The best way to handle missing data for categorical features is to simply label them as ’Missing’!

  • You’re essentially adding a new class for the feature.
  • This tells the algorithm that the value was missing.
  • This also gets around the technical requirement for no missing values.

Missing numeric data

For missing numeric data, you should flag and fill the values.

  • Flag the observation with an indicator variable of missingness.
  • Then, fill the original missing value with 0 just to meet the technical requirement of no missing values.

By using this technique of flagging and filling, you are essentially allowing the algorithm to estimate the optimal constant for missingness, instead of just filling it in with the mean.

Leave a Reply