The input data was sourced from here . Of course you need to be a Kaggle member to be able to download the data.
Step 1 : Data Description
The training data has the following columns – which are described below.
per capita crime rate by town.
proportion of residential land zoned for lots over 25,000 sq.ft.
proportion of non-retail business acres per town.
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nitrogen oxides concentration (parts per 10 million).
average number of rooms per dwelling.
proportion of owner-occupied units built prior to 1940.
weighted mean of distances to five Boston employment centres.
index of accessibility to radial highways.
full-value property-tax rate per \$10,000.
pupil-teacher ratio by town.
1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.
lower status of the population (percent).
median value of owner-occupied homes in \$1000s.
The last column i.e. medv is the predicted variable.
Step 2 : Data Wrangling
This data is quite clean – with all variables in integer or float format – with no missing data. So there is very little to do in terms of data wrangling. The Correlation of all of the fields to the medv predicted field is as below (gleaned using CORREL formula on MS Excel).
As with earlier experiments – I tried to do both the project both on Microsoft Azure and also on Python.
Step 3: Training the model
The Azure project is available here . The models that have been trained here along with their metrics is given below:
|Model Name||Co-efficient of Determination|
|Decision Forest Regression||0.729938|
|Boosted Decision Tree Regression||0.832524|
|Neural Network Regression||0.822492|
|Bayesian Linear Regression||0.58574|
The same project was done using Python and the scores were as below:
|Decision Tree Regressor||
|Random Forest Regressor||
The python project is here on my Github account.
Step 4 : Making actual predictions
The final winners are the “Boosted Decision Tree Regression” algorithm on Azure and the “Random Forest Regressor” algorithm on Python. I split the training data of 332 records – as 300 records to train and 32 to validate. The predicted outcome was very close to the actual records on the validation data subset
Random Forest Regressor
|Actual medv||Predicted medv|
Boosted Decision Tree Regression
As you can see the predictions are fairly close to the actuals