To buy or not to buy, a questioned asked by many prospective house buyers. In this section we explore how machine learning techniques can help one in making a decision. We specifically study the Ames housing dataset available in Kaggle. This dataset set consists of 80 features that describe various characteristics of a house and thus are exploited by machine learning techniques to predict the house selling price.
Feature types
The table below displays the various type of features available in the dataset.
Data analysis
We explore the top features as predicted by SelectKBest that appear to contribute the most in making the predictions about the sale price. SelectKBest supports two scoring functions for regression: ‘f_regression’ and ‘mutual_info_regression’. We use both these functions to get the top 10 most important features as shown below:
Interesting both the functions select the same 9 out of 10 features though their ranking is slightly different. ‘Neighborhood’ is the highest scoring feature with scoring function ‘mutual_info_regression’ but is not present in the list of the other scoring function.
Data cleaning
Both the training and test dataset consisted features with relatively large number of NA’s, but on closer inspection of most of the categorical variables this turned out to be valid values. In such cases the NA’s were replaced by ‘None’ value. For the other variables these were replaced by either the mean or the mode, depending on the feature.
Outliers
The dataset contains at least 19 features in which data lies at more than 3 standard deviations away. Due to the relatively small size of the dataset, removing records with outliers could potentially lead to serious loss of data and lead to weaker machine learning model. Hence, there was no outliers removed in this study.
Data normalization
For normalization of data, I employed the usage of ‘PowerTransformer’, this was preferred choice due to its superior ability of handling of outliers present in the data. I decided to keep the default transformation method of ‘Yeo-Johnson’.
Feature selection
Feature selection in this exercise was performed by various combinations of SelectKBest and PCA on ElasticNet model.
Machine learning models
Following were the machine learning models used for training:
· ElasticNet
· XGBoost
· Neural network
· SVR