Predicting House Prices with Random Forest and XGBoost


This project had the objetive to test different approaches of ensemble models (models composed of week learners, that combined results in a great predictor), to find the one that would best fit a House Pricing Prediction. Here was tested algorithms based on Bagging (bootstrap aggregation) and Boosting mechanisms, and then compared between each other. To know more details about these mechanisms, open this article by Joseph Rocca. Let's beggin explaining in detail what was done:

1. Loading the data

First we begin importing the libraries that will be used:


Next, load the data. Bear in mind that the data is saved on a local directory, and the path written to load it might change if the archive is moved.
Also, before reading the code, consider some important notes:

2. Explanatory Data Analysis (EDA)

Start describing the data, searching for helpfull insights before we begin building the model:

2.1 Train & test data statistics

First, let's look into the shape of the train/test features DataFrame. Next, we'll look the statistics of numerical columns to find relevant insights, before building the model. For that, the describe() method works perfectly.

2.2 Response Variable Distribution

Now, take a look into the Sales Price on the Train Data.
By plotting the Smooth Density, we see that most prices are between 100 and 200 k USD:

2.3 Missing Values

Check if the data has NaN values. Next, look into cathegorical variables to check it's cardinality, wether the train has all of them, or if we might find some new ones at validation

2.3.1 Filling empty spaces

Looking at the data described, it is time to give some treatment to the information above. The treatment steps are listed bellow:

Note: If the program returns an error, it is just a warning telling us the values are being replaced on the original DF.

2.4 Reducing High Cardinality


Next, lets check which variables have a high cardinality. This will be important when we start the Encoding process.

(OBS: it was decided to take analysis only if the cardinality is over 6. Less than that, it would have little influence on the training speed)



Points taken:

2.5 Numerical Features Selection

Finally, let's look into which features have trully a considerable correlation to the output variable. For that we will define a new threshold, that is:

Then, we will create a heatmap visual to assess collinarity between features.

Besides having some inner correlations, it was decided to keep those variables, such as 'OverallQual' and 'GarageCars', as they are very important for the model.

3. Preprocessing

It is always important to preprocess the information before applying the ML model. We saw that a few columns require imputing values for completeness, and others require encoding for better resulting.
Here it will be apllied some Preprocessing steps, following the steps:

  1. First, split train/test data, then select numerical and cathegorical columns (the last will be split for ordinal and OneHot Encoding);
  2. Second, check if there is cathegories in test there are not in training. If so, update the handle_unknown parameter at encoding;
  3. Next, create pipelines for cathegorical variables: One to be applied to features represented by an ordinal scale, and other to apply on features to be OneHot Encoded;
  4. Transform the columns and prepare for modeling;
Note: Because at least one column has new cathegory never seen in trainning, we have to add 'use_encoded_value' as handle_unknown argument.


4. Random Forest Regressor


Now we implement the Random Forest Regression algorithm. We begin importing the RandomForestRegressor() method from scikit learn, and then create the final pipeline that will preprocess and build the model. Next, we use scikit-learn to predict and evaluate the model.

Note: It was assigned 850 decision trees to compose the RF model based on experience, but later this number will be tuned

This Mean Absolute Error means the expected error, in US dollars, of our prediction VS the real value, which gives us a good margin, considering the difference to the prices in our data. Now we have a reference to enhance the model accuracy.

For instance, we can check visually with a regression line:

4.1 Improve Performance

Considerations:

- The first assumption is true. One way to correct it, is applying a log-transformation into the predicted variable. If we do this, the result improves a lot:


Great!! we got the best model at 450 trees. Let's check the results of our RF model:

5. Extreme Gradient Boosting Regressor


6. Models combined


7. Conclusion


To summarize, we got a accurate model predicting house prices in Ames, Iowa, by using the following: