The input data was sourced from here . Of course you need to be a Kaggle member to be able to download the data.
Step 1 : Data Description
The training data has the following columns – which are described below.
crim
per capita crime rate by town.
zn
proportion of residential land zoned for lots over 25,000 sq.ft.
indus
proportion of nonretail business acres per town.
chas
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox
nitrogen oxides concentration (parts per 10 million).
rm
average number of rooms per dwelling.
age
proportion of owneroccupied units built prior to 1940.
dis
weighted mean of distances to five Boston employment centres.
rad
index of accessibility to radial highways.
tax
fullvalue propertytax rate per \$10,000.
ptratio
pupilteacher ratio by town.
black
1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.
lstat
lower status of the population (percent).
medv
median value of owneroccupied homes in \$1000s.
The last column i.e. medv is the predicted variable.
Step 2 : Data Wrangling
This data is quite clean – with all variables in integer or float format – with no missing data. So there is very little to do in terms of data wrangling. The Correlation of all of the fields to the medv predicted field is as below (gleaned using CORREL formula on MS Excel).
ID  0.22169 
crim  0.40745 
zn  0.344842 
indus  0.47393 
chas  0.20439 
nox  0.41305 
rm  0.689598 
age  0.35889 
dis  0.249422 
rad  0.35225 
tax  0.44808 
ptratio  0.48138 
black  0.33666 
lstat  0.7386 
As with earlier experiments – I tried to do both the project both on Microsoft Azure and also on Python.
Step 3: Training the model
The Azure project is available here . The models that have been trained here along with their metrics is given below:
Model Name  Coefficient of Determination 
Decision Forest Regression  0.729938 
Boosted Decision Tree Regression  0.832524 
Neural Network Regression  0.822492 
Linear Regression  0.635432 
Bayesian Linear Regression  0.58574 
Poisson Regression  0.600579 
The same project was done using Python and the scores were as below:
Model Name  Score 
Decision Tree Regressor 

Random Forest Regressor 

The python project is here on my Github account.
Step 4 : Making actual predictions
The final winners are the “Boosted Decision Tree Regression” algorithm on Azure and the “Random Forest Regressor” algorithm on Python. I split the training data of 332 records – as 300 records to train and 32 to validate. The predicted outcome was very close to the actual records on the validation data subset
Random Forest Regressor
Actual medv  Predicted medv 
17.7  21.06 
19.5  20.11 
20.2  24.01 
21.4  18.2 
19.9  20.74 
19  14.61 
19.1  15.32 
19.1  14.06 
20.1  18.16 
19.6  20.65 
23.2  21 
13.8  13.42 
16.7  15.77 
12  13.75 
14.6  13.84 
21.4  19.24 
23  23.04 
23.7  35.28 
21.8  22.99 
20.6  21.27 
19.1  21 
20.6  23.4 
15.2  17.89 
8.1  15.41 
13.6  16.82 
20.1  20.11 
21.8  18.12 
18.3  19.55 
17.5  19.1 
22.4  31.44 
20.6  28.74 
23.9  27.95 
11.9  33.46 
Boosted Decision Tree Regression
As you can see the predictions are fairly close to the actuals
Leave a Reply