Missing Value imputation – advanced way

Missing value or junk value imputation with mean/median/mode is the very basic part of data cleaning , as these processes will give the accuracy up to a certain level. Also if mean/median/mode are applicable when our data is in some traditional format, but in most of the practical scenario it is not. If our data is categorical one, like in the below dataset there is a missing value in “Type” which we can’t even replace with mode (it will not appropriate) .
In this scenario, we need some predictive models using which we can predict our variable with missing value and replace with the same. We can create our own model using any classification techniques (for ordinal / categorical) data or regression model (for continuous data).

Missing Value imputation - Advanced way

R has some relief for us in this tedious task in form of “MICE” and “missForest” package.

missForest

How does it work?

Most popular name in classification world is “Random Forest”, which is able to handle any form of data and give reasonable amount of high accuracy rate in most of the cases. Internally “missForest” uses Random Forest technique.  We pass our full data frame in it which has “NAs” in it, Random Forest itself check which column (with NA value) to predict based on the perfect combination of the other variables (which can be controlled by the mtry parameter)

How to use “missForest”?
install.packages("missForest") #install missForest library
library(missForest)
MF_out <- missForest(Data_with_NA, ntree=500, verbose = TRUE) #number of tree 500
Data_wo_NA=MF_out$ximp #get the cleaned data frame

Where, “verbose” is logical. If ‘TRUE’ the user is supplied with additional output between iterations, i.e., estimated imputation error, runtime and if complete data matrix is supplied the true imputation error. See ‘xtrue’.

For tuning the model we can pass the parameter called “mtry” which is the number of variables randomly sampled at each split. This argument is directly supplied to the ‘randomForest’ function. Note that the default value is sqrt(p) for both categorical and continuous variables where p is the number of variables in ‘xmis’.

There other several parameters we can pass based on our need.

To know more about the other parameters in “missForest” please refer:

https://cran.r-project.org/web/packages/missForest/missForest.pdf

MICE :  MICE is Multivariate Imputation by Chained Equations.

How does it work?

As the name says it is able to apply multiple equations / models in data set and replace the NA values. Instead of only random forest using MICE we can apply logistic regression , polytomous regression and much more algorithm based on the data type of the targeted (with NA values) columns. So here we can apply different algorithm for different type of columns.

Each variable has its own imputation model. Built-in imputation models are provided for continuous data (predictive mean matching, normal), binary data (logistic regression), unordered categorical data (polytomous logistic regression) and ordered categorical data (proportional odds). MICE can also impute continuous two-level data (normal model, pan, second-level variables). Passive imputation can be used to maintain consistency between variables. Various diagnostic plots are available to inspect the quality of the imputations.

And there are several more functions regarding data imputation inside “MICE” package

How to use “mice”?
install.packages("mice")  #install mice library
library("mice")
abc=mice(NAdata,meth=c('rf')) #using random forest, here we can apply “polr” etc for different algorithms
nonNA=complete(abc)   #final dataset without NA

To know more about “mice” package in R please refer:

https://cran.r-project.org/web/packages/mice/mice.pdf

Note:  In both “MICE” and “missForest” process if our missing/junk data points are not NA and in different format like “NULL” or any other junk values, we need to convert them into “NA” before running the functions.

Leave a Comment

Scroll to Top