While working on Machine Learning/Predictive Modelling problems, feature selection is an important step. It is because, we get a dataset with too many variables in practical model building problems in which all variables are not relevant to the problem, and this we don’t know in advance.
Also, there are some disadvantages of using all given variables while building models:
- The model will not give optimal/robust accuracy which it will give only with important features
- Using too many variables will slow down the processing of Algorithms which is very in convenient for an Analyst
So it’s better to do Feature Selection beforehand to get most optimal accurate model.
While working on a problem, I came across “Boruta Algorithm” for Feature Selection. It uses wrapper approach using Random Forest in which the classifier is used as a black box returning a feature ranking. It is named as “Boruta” as Boruta is a god of the forest in the Slavic mythology and it uses Random Forest Classification technique.
How it works?
From the result of Random Forest, we can compute the average and standard deviation of the accuracy loss. Alternatively, the Z score computed by dividing the average loss by its standard deviation which can be used as the importance measure for all the variables. But we cannot use Z Score as measure for finding variable importance as Z score is not directly related to the statistical significance of the feature importance returned by the random forest algorithm, since its distribution is not Normal. So, we need some other external reference to decide whether the importance of any given attribute is significant. To overcome this, algorithm introduces some random attributes to the dataset which are nothing but copies (at least 5) of existing attributes called as ‘shadow’ attributes.
Step by step procedure is:
- Introduction of Shadow attributes.
- Shuffling of attributes to remove correlation and biasness.
- Z score noted by executing Random Forest.
- At every iteration, calculates Maximum Z Score among each Shadow attributes (MZSA) and checks whether a real feature has a higher importance than the MZSA or lower.
- Mark the attributes which have importance significantly lower than MZSA as ‘unimportant’ and permanently remove them from the information system.
- Mark the attributes which have importance significantly higher than MZSA as ‘important’.
- Repeat the procedure until all the features are marked, or the algorithm has reached the previously set limit of the random forest runs.
- At the end, remove all shadow attributes.
Boruta Package in R
We can install package as: install.packages(“Boruta”)
Full documentation of package is available at CRAN
I am using a Loan Credit dataset to show the usage of Boruta package:
> boruta.train <- Boruta(RESPONSE~.-OBS., data = trainStrata, doTrace = 2)
Boruta performed 99 iterations in 55.90475 secs.
12 attributes confirmed important: AMOUNT, CHK_ACCT, DURATION,EMPLOYMENT, GUARANTOR and 7 more.
12 attributes confirmed unimportant: CO.APPLICANT, EDUCATION,FURNITURE, MALE_DIV, MALE_MAR_or_WID and 7 more.
6 tentative attributes left: AGE, FOREIGN, JOB, OWN_RES,PRESENT_RESIDENT and 1 more.
It has clarified that 12 attributes are important and 12 are unimportant. Tentative attributes have importance close to their best shadow attributes that’s why Boruta is not able to make a decision. We can include these attributes in our model and at later point we can remove the one which we find unimportant based on p-value.
Boruta performed 99 iterations in 55.90475 secs.
Tentatives roughfixed over the last 99 iterations.
17 attributes confirmed important: AGE, AMOUNT, CHK_ACCT, DURATION, EMPLOYMENT and 12more.
13 attributes confirmed unimportant: CO.APPLICANT, EDUCATION, FURNITURE, MALE_DIV,MALE_MAR_or_WID and 8 more.
TentativeRoughFix performs a simplified, weaker test for judging such attributes. That;s why it does results in Tentative attributes
> plot(boruta.train, xlab = “”, xaxt = “n”)> lz=lapply(1:ncol(boruta.train$ImpHistory),function(i) boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])> names(lz)=colnames(boruta.train$ImpHistory)> Labels=sort(sapply(lz,median))> axis(side = 1,las=2,labels = names(Labels),at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)
Red marked Features are unimportant
Yellow marked are Tentative
& Green Marked are Important Features.
> getSelectedAttributes(boruta.train, withTentative = FALSE)
 “CHK_ACCT” “DURATION” “HISTORY” “NEW_CAR” “AMOUNT”
 “SAV_ACCT” “EMPLOYMENT” “INSTALL_RATE” “GUARANTOR” “REAL_ESTATE”
 “PROP_UNKN_NONE” “OTHER_INSTALL”
> getSelectedAttributes(boruta.train, withTentative = TRUE)
 “CHK_ACCT” “DURATION” “HISTORY” “NEW_CAR” “USED_CAR”
 “AMOUNT” “SAV_ACCT” “EMPLOYMENT” “INSTALL_RATE” “GUARANTOR”
 “PRESENT_RESIDENT” “REAL_ESTATE” “PROP_UNKN_NONE” “AGE” “OTHER_INSTALL”
 “OWN_RES” “JOB” “FOREIGN”
meanImp medianImp minImp maxImp normHits decision
CHK_ACCT 27.07403154 27.3874305 22.48988172 29.8983190 1.00000000 Confirmed
DURATION 11.26659814 11.2746993 8.53024265 13.8364649 1.00000000 Confirmed
HISTORY 8.01582402 7.9723847 5.34204849 10.3233216 1.00000000 Confirmed
NEW_CAR 3.20879187 3.2836036 0.25101124 5.9992022 0.68686869 Confirmed
USED_CAR 2.44850872 2.4970894 0.13552355 5.0760985 0.46464646 Tentative
FURNITURE 0.66763142 0.6285931 -0.98286825 2.2102518 0.00000000 Rejected