Missing Values – How to treat them in data.

Missing Values

Missing Value means an absence of observation in a variable. A Table consists of rows and columns, each column represents a distinct variable. Missing value/values mean an observation /observations is /are missing or misrepresented in a column.

for Example, your dataset has 5 columns and 100 rows. In some of the columns, you will see some observations are absent.

Let us elaborate on what are missing values and how they are represented in a column

They can be represented by a simple dot (“.”)
They can be represented by “NaN” or “Na”.
They can be represented by zero (“0”).

Sometimes it could be some sort of encodings. A dataset needs to be thoroughly studied to figure out, are there any missing values? And if there are, how they are represented and how can we treat them.

Missing values are a common problem and it is important to treat them otherwise it leaves a huge impact on your overall result. Also Python’s data science library (since I work with Python) does not support missing values so it becomes a necessity to treat them.

They can occur due to technical errors , human errors(data entry, data collection) etc.

If they occur completely at random (if there is no relation of their occurrence with the observations of other variable) and if they are few then you can omit the observations which are missing.

If there is some relationship between the missing value with say a distinct category of another variable then you would want to closely identify it. and depending on the count treat it.

If they occur with a distinct relation among the missing values and the values of the other variable then you need to identify the pattern and take actions accordingly, treating them effectively in a specific way.

Now Let’s see some imputation techniques and also touch upon some merits and demerits of those techniques.

Imputation Techniques

As we already know that there are mainly two types of variables in a dataset: Numerical and Categorical ; and there are sub categories also but let us not get into that.

run–of-the–mill kind

The most common technique that is used is mean(if normally distributed) or median(unaffected by outlier) imputation. This is only used if the variable is numerical.

oh the smart move

If the “y” (prediction variable ) is categorical for example y has two classes and it is about predicting diabetic and non-diabetic and the values are missing in the variable called “BMI”(body mass index). If we see that almost all the values that are missing in the “BMI” column are associated with the diabetic class. Instead of taking the mean of the whole column which includes both the classes. We can just take the mean of the values associated with only one category and impute with that value. It would make much more sense.

We call it “jugad” here

There is also a very interesting technique where we can use machine learning algorithms to impute these values. Let’s say we have 4 variables and 1000 rows in our dataset. One of the variable has some values missing let’s say 50-100. For the values where all the observations are there for all 4 variables, we can treat them as train data, by keeping the variable where the values are missing as “Y”. Now having trained our model for “Y”. We can predict the missing values by treating this(missing values )as a test data and passing it into our model. The values given by the model are the values we can use to impute.We can use various algorithms for it.

And all the jazz

There are other methods like:

Arbitrary value imputation :

It consists of replacing all occurrences of missing values within a variable an arbitrary value.

End of tail imputation:

It is similar to arbitrary value imputation but it selects arbitrary values at the end of the distribution automatically (there are variations to this method).

Random sample imputation:

It consists of taking a random observation from the pool of all the available observations and this observation is used to fill the missing values.

Creating a missing indicator and creating new variables:

Creating a missing indicator usually seen where missing categories are replaced by some sort of indicator which tells that the value at this instance is missing.

creating new variables means we are sort of creating these dummy variables (extra variables) which will indicate the value was missing at that observation by let’s say 1 and wherever the values are already there for the observations it will indicate that as “0”.

The most frequent observation

This particular technique is used to impute values mainly in a categorical variable we replace the missing observation with the most frequent observation.

Other techniques like machine learning model based technique and random sample technique can also be used to impute the missing values in a categorical variable.

The Demon Strikes

All these techniques above come with their advantages and disadvantages. Though most of the techniques are easy to implement, fast and able to capture missingness. They can disrupt the variance , covariance and correlation of a variable. Also at times (as in case of most frequent observation ) can fall prey to over representation of one particular class.

There are certain points to note: it is not always necessary to impute, it depends upon percentage of data missing and to what domain the data belongs. The treatment differs with these two important factors and of course it also depends upon the client. But it is an industry standard practice to preserve as much information as we can.

This article is aimed at giving you a refresher on how to treat missing values I wanted to include code and pictures but it would have made this article too long.