Training Data: The data is here : train_u6lujuX_CVtuZ9i
Step 1 – Exploratory Data Analysis :
a. All records with blank fields are weeded out.
b. We then determine features that are categorical and those that are continuous. Out of these we see that the following features are categorical:
Gender
Married
Dependents
Education
Self_Employed
Loan_Amount_Term
Credit_History
Property_Area
The following features are continuous:
ApplicantIncome
CoapplicantIncome
LoanAmount
The Loan_ID field is the identifier field whereas the Loan_Status is the Label field
c. We next move on to identify the outlier data for the above three continuous features. Based on the below charts we determine that the outliers for the following features are as below:
Loan Amount – anything above 500 is considered an outlier
Coapplicant Income – anything above 11300 is considered an outlier
Applicant Income – anything above 23803 is considered an outlier
We use both scatter charts and box charts. However it is seen that identification of outliers was easier for me using scatter charts.
d. Next we carve out a dataset consisting only of outlier data corresponding to the above three features. We keep this aside for testing later.
e. We finally create a dataset that is devoid of all records with null value features and outlier data. From this we create two subsets of data : the training data and the validation data.
Step 2 – Model creation and training :
The problem statement at hand is to determine whether a loan would be approved or not. Hence it is a classification problem. We therefore use multiple classification algorithms to decide the best one. Using Microsoft Azure Studio for Machine Learning I explored the following five algorithms:
a. Support Vector Machine
b. Logistic Regression
c. Decision Jungle
d. Averaged Perceptron
e. Bayes Point Machine
After having created models for each of the above – I also evaluated the metrics for each. Here is how it looks like:
Model  AUC  Accuracy  Precision 
Support Vector Machine  0.821  0.822  0.798 
Logistic Regression  0.831  0.822  0.786 
Decision Jungle  0.812  0.822  0.792 
Averaged Perceptron  0.799  0.829  0.797 
Bayes Point Machine  0.804  0.818  0.788 
I next tested out these models on the validation dataset and the outlier dataset and the number of errors are as indicated below:
Model  Errors in Validation Data  Errors in Outlier Data 
Support Vector Machine  5  4 
Logistic Regression  5  5 
Decision Jungle  5  5 
Averaged Perceptron  4  5 
Bayes Point Machine  4  7 
Based on the above two sets of metrics I was able to determine that the Averaged Perceptron algorithm was the best followed by the Support Vector Machine and finally the Logistic Regression.
While doing the checks on the validation and outlier datasets – I realized that the training data was heavily biased to those who had a successful loan approval. I later brought in SMOTE to see if I could get better model and better results (Check my earlier blog posting on SMOTE). However – while the AUC metrics improved quite a bit, the Accuracy and Precision metrics deteriorated and the actual validations on the validation and outlier datasets – seemed to paint a dreary picture:
Model  AUC  Accuracy  Precision 
Support Vector Machine  0.855  0.765  0.652 
Logistic Regression  0.863  0.812  0.722 
Averaged Perceptron  0.844  0.796  0.723 
Model  Errors in Validation + Outlier Data 
Support Vector Machine  18 
Logistic Regression  11 
Averaged Perceptron  13 
I had to revert back to the model without the usage of SMOTE – as that seemed to be doing better.
Step 3 – Trained Models
After having done all that here are the final models that were created – which were run on the test data (test_loan)
Averaged Perceptron (Predictive Exp.)09132017 172603
TwoClass Logistic Regression (Predictive Exp.)09132017 174646
Two Class Support Vector Model (Predictive Exp.)09132017 175226
Human Resources Analytics
Objective Find out why the best and most experienced employees leaving prematurely? Predict which valuable employees will leave next. Here is the data: HR_comma_sep2
Exploratory Data Analytics
The first step was to check if there were any records with blank fields. There were none.
The Label field is “Left” and it indicates those who left. Let us start by exploring the correlation between the label field and the feature fields.
Feature  Correlation 
satisfaction_level  0.388374983 
last_evaluation  0.00656712 
number_project  0.023787185 
average_montly_hours  0.071287179 
time_spend_company  0.144822175 
Work_accident  0.154621634 
promotion_last_5years  0.061788107 
salary  0.157897791 
There is a strong negative correlation between whether a person is leaving to his/her Satisfaction Level, Salary, If the person met with a Work Accident and if there has been a promotion in the last 5 years.
Also the department with the highest attrition is the Sales department with a figure of 28%, followed by the Technical department and Support department – at 20% and 16% respectively.
Also the longer a person stays with the company the less likely that he or she would leave the organization
As you can see a person with 3 years in the organization is the most vulnerable to leave, followed by 5 years and then by 4 years in the organization. People with more than 6 years experience have not been found to leave the organization.
Prediction Approach
The objective is to predict the list of people who are most likely to leave the organization. To train the model we split the data into the training set and the testing set.
The training set has data of everybody who has left and also an equal number of those who have not left. To select those who have not left – we would choose those who are still within the organization – but who have the lowest satisfaction_level, salary, Work_accident and promotion_last_5years (in that order). From that subset of data – we choose an equal number of records in reverse order. The reason to choose the reverse order is so that we can create a training dataset that is not biased. So the reversed order subset of data would contain those with the highest satisfaction level, salary, no work_accidents and have been promoted in the last 5 years.
Another set of training data is also created with a combination of data of everybody who has left and a set of biased data from those who have not left (those who are quite likely to leave based on high correlation).
Both the above training data is used to create a model that is more highly accurate.
We use Microsoft Azure Machine Learning Studio to play with various algorithms to create the best one:
There are five algorithms chosen above. The performance metrics for each of the above is given below:
Name of Algorithm  AUC  Accuracy  Precision 
TwoClass Averaged Perceptron  0.749  0.682  0.683 
TwoClass Support Vector Machine  0.673  0.599  0.589 
TwoClass Logistic Regression  0.745  0.677  0.676 
TwoClass Bayes Point Machine  0.744  0.677  0.679 
TwoClass Decision Jungle  0.991  0.969  0.990

Next we change the input dataset to a biased dataset and then we gauge how the above five perform
Name of Algorithm  AUC  Accuracy  Precision 
TwoClass Averaged Perceptron  0.853  0.800  0.745 
TwoClass Support Vector Machine  0.835  0.778  0.725 
TwoClass Logistic Regression  0.853  0.801  0.744 
TwoClass Bayes Point Machine  0.851  0.792  0.731 
TwoClass Decision Jungle  0.992  0.974  0.991

It therefore appears that the biased dataset seems to create a better model. Of the above five, we can choose the top three in the below order:
First: TwoClass Decision Jungle
Second: TwoClass Logistic Regression
Third: TwoClass Averaged Perceptron
Decision Jungle Model was chosen as the best. Upon analysis of the predicted outcome – the department wise pattern was very similar to modelling data:
27% from Sales, 18% from Technical and 15% from support.
(The model is Decision Jungle Model Biased along with the solutions API key: CprmDLspC+PeGrwL3269PD+VnsaBR6lsS1Bgi9Dh+xt0LGvU1NRnYVoUDSj+GgwyBG7ezl8ermhf9gmyn0pDXg==). The model predictions were pretty accurate when I ran it across the validation dataset.
From my end – I’ve been learning Tableau and Power BI – although I’ve been fairly impressed by D3.js and how light it can be on memory/resources – and yet render stuff beautifully.
I started working on Tableau and Power BI in parallel. Although Power BI (a Microsoft product) entered my life earlier than Tableau (and even before D3.js) – it is Tableau that has impressed me the most. I cannot really comment about Qlickview.
Here are some of my projects on Tableau:
All of my other projects can be found here:
https://public.tableau.com/profile/rajiv.ramanjani#!/
So it was obvious that it was a regression model that required to be built and not a classification model. The most straight forward choice was a Linear Regression model. However that didn’t help. The predictions that came out of Linear Regression model were very poor in accuracy – around 8% accurate – which was nowhere near what we needed.
That is where I got stuck initially.
Later I chanced upon Microsoft Azure machine learning models. I tried Linear Regression there as well and the accuracy was poor there as well.
Azure gave me the choice of trying out different regression models. This was real fun – as I could build models real fast and compare their accuracy well. Besides the Linear Regression model the Decision Forest Regression and the Bayesian Linear Regression were tried on Azure. Of all these three the Decision Forest Regression was the best. The accuracy here was around 95%.
Later I tried the same in Python. I tried different regression models and these included the Decision Tree Regressor, Gradient Boosting Regressor, Random Forest Regressor, Linear Regression and finally the Bayesian Linear Regression model. Of all these the Gradient Boosting Regressor was the most difficult to work with as it takes a really long time to execute. Here in Python the Random Forest Regressor was the most accurate with 96% accuracy whereas the Decision Tree Regressor scored 92% in Python.
So the final decision went with Random Forest Regressor.
The entire project sits here:
Rajivs Github repo for the Walmart project
The fun part of this project was working with Azure. It helped me a lot. I was also able to create an API plugin and use it in excel to make predictions.
The best part of this project was that I got good feedback from my academy staff and my student peers about the project – being one of the best. Just makes me feel good and makes me want to do more.
Source: Bloomberg