ML Regression Project : Boston Housing

boston

The input data was sourced from here .  Of course you need to be a Kaggle member to be able to download the data.

Step 1 : Data Description

The training data has the following columns – which are described below.

crim
per capita crime rate by town.

zn
proportion of residential land zoned for lots over 25,000 sq.ft.

indus
proportion of non-retail business acres per town.

chas
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox
nitrogen oxides concentration (parts per 10 million).

rm
average number of rooms per dwelling.

age
proportion of owner-occupied units built prior to 1940.

dis
weighted mean of distances to five Boston employment centres.

rad
index of accessibility to radial highways.

tax
full-value property-tax rate per \$10,000.

ptratio
pupil-teacher ratio by town.

black
1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.

lstat
lower status of the population (percent).

medv
median value of owner-occupied homes in \$1000s.

The last column i.e. medv is the predicted variable.

Step 2 : Data Wrangling

This data is quite clean – with all variables in integer or float format – with no missing data.  So there is very little to do in terms of data wrangling.  The Correlation of all of the fields to the medv predicted field is as below (gleaned using CORREL formula on MS Excel).

ID -0.22169
crim -0.40745
zn 0.344842
indus -0.47393
chas 0.20439
nox -0.41305
rm 0.689598
age -0.35889
dis 0.249422
rad -0.35225
tax -0.44808
ptratio -0.48138
black 0.33666
lstat -0.7386

As with earlier experiments – I tried to do both the project both on Microsoft Azure and also on Python.

Step 3: Training the model

Boston_Housing

The Azure project is available here .  The models that have been trained here along with their metrics is given below:

Model Name Co-efficient of Determination
Decision Forest Regression 0.729938
Boosted Decision Tree Regression 0.832524
Neural Network Regression 0.822492
Linear Regression 0.635432
Bayesian Linear Regression 0.58574
Poisson Regression 0.600579

The same project was done using Python and the scores were as below:

Model Name Score
Decision Tree Regressor
0.77152717185183195
Random Forest Regressor
0.82706521966378554

The python project is here on my Github account.

Step 4 : Making actual predictions

The final winners are the “Boosted Decision Tree Regression” algorithm on Azure and the “Random Forest Regressor” algorithm on Python.  I split the training data of 332 records – as 300 records to train and 32 to validate.  The predicted outcome was very close to the actual records on the validation data subset

Random Forest Regressor

Actual medv Predicted medv
17.7 21.06
19.5 20.11
20.2 24.01
21.4 18.2
19.9 20.74
19 14.61
19.1 15.32
19.1 14.06
20.1 18.16
19.6 20.65
23.2 21
13.8 13.42
16.7 15.77
12 13.75
14.6 13.84
21.4 19.24
23 23.04
23.7 35.28
21.8 22.99
20.6 21.27
19.1 21
20.6 23.4
15.2 17.89
8.1 15.41
13.6 16.82
20.1 20.11
21.8 18.12
18.3 19.55
17.5 19.1
22.4 31.44
20.6 28.74
23.9 27.95
11.9 33.46

Boosted Decision Tree Regression

Boston_Housing
As you can see the predictions are fairly close to the actuals

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: