How I used Web Scraping and Machine Learning to solve my own rental problem

A detailed look at how we can extract data from rent listings, predict a fair price, and get suggestions of apartments based on preferences.

Bruno Caraffa
9 min readAug 17, 2022

At the end of this year, I’ll probably need to move from my current rental, and with that comes all the trouble of searching for a new apartment. I was in that same position two years ago and remembered how boring it was and the amount of time it took to look through the listings pages on the websites looking for the apartments that would fit my preferences(and my budget, unfortunately). And even when I’d found some promising candidates, I still had a doubt: was that rent fair compared to the market price?

Photo by Jon Tyson on Unsplash

So I’ve decided to scrape through the HTML codes of the rental listings in my city, Brasília, extract the available data, and build a machine learning model to help me find a fair price within the conditions I’d want. Later on, I thought using a recommendation system to provide suggestions of listings based on the selected conditions would also be a useful feature, as it could save me time in finding the apartments that fit me. Also, I would be able to compare those prices from the recommended apartments with the fair price that the model provides me. And finally, I realized that plenty of friends had gone through the same problem and decided to publish the prototype solution online to help people close to me when looking for an apartment.

1) Web Scraping

First of all, I needed the data and none of the main rental listing sites would provide a csv file with the listing features. But that’s fine; it would be too easy, and nothing in life is. So I’ve decided to web scrape the site with more listings and get the data for each of the apartments to store in a dataframe and create a model later on.

The main problem I ran into was that the listings had dynamic data structures. The main features (name, price, and area) were mandatory for all of them, but the other features were probably optional when the listing was registered. Those will be missing in the dataset and we’ll try to fill them if possible. There’s real life again.

Here is the full web scraping code:

The code is divided into five blocks.

  1. Create the lists in which we’re going to store the data, “productlinks” and “aps” .Define the base url for the site — here we are considering the url with the city of Brasília already filtered and only apartments. Houses and commercial listings will be excluded by then.
  2. Loop through the first 100 pages (usually there are many fewer on this site with the chosen filters) using f-strings. On each of those pages, using BeautifoulSoup, we find the specific urls for all the listings on the page and add them to the base url so we can get the complete url. Finally, we append the string with the complete url to the “productlinks” list. Our result from this block is a list with all the urls of all the rental listings on the site.
  3. Now we define a function to capture the dynamic part of the data, which is stored on the class “mb-0 text-normal” of the header “h6 on the HTML codes and return it as the “temp_data” dictionary.
  4. Loop through the links of all the listings, stored on the “productlinks” list. For the static part of the data, we access the respective classes, extract the text, and apply some minor text manipulations. Then we store them on the “data” dictionary, apply the “dynamic_portion” function to extract the dynamic data and append the “temp_data” dictionary to the “data” dictionary, resulting in an extraction of the available data for each of the listings. After our data dictionary is finalized, we append it to the listings. After our data dictionary is finalized we append it to the “aps” list where we’ll store the data from all the apartments.
  5. Create a dataframe using the “aps” dictionary and save it on a csv file to use as the raw data to create the prediction model.
The extracted dataframe. NaNs are the features missing on the dynamic data of the listings.

2) Training the model

With the data from the apartments in hand, it was time to prepare the dataset for the training. First, I did some minor manipulations on the data to create some features and adjust others. I’ll not go into specifics of the preprocessing of the extracted data here, but the full code of the application is available on GitHub for those curious about it. Then I needed to fill in the missing and 0 values and some strategies were used for it, such as filling in the missing values with the median or average of the apartment’s neighborhood/amount of rooms combination.

Here is the complete code of the training, which we’ll go into detail below:

Due to many missing values on some of the features, I’ve decided to go only with the more robust ones in order to obtain a solid prediction model. Some of the features had more missing values than filled values and could have been valuable to the model, like the number of suites, parking spaces, state taxes (IPTU here in Brazil), the position of the sun, and others. But filling in too many values like that could skew the model, and that’s something we don’t want.

Thanks to Brasilia’s address structure, it was pretty easy to obtain the neighborhood from the name feature. In the name of the listing, almost always there is an address, such as ‘SQS 210’ or ‘CA 2’, and those initials before the number refer to a specific neighborhood in the city (SQS or CA in these cases), which can be used as a precious feature on our model. So in order to preserve the explicability and avoid bias in our model, I’ve decided to go with the three most robust features: area, number of rooms, and neighborhood.

So we end up with two numerical features, a continuous and a discrete (area and number of rooms, respectively) and a categorical feature (neighborhood), to which we need to apply the OneHotEncoder to prepare them for the training. And finally, after fitting the OneHotEncoder to the original data, we store it on a pickle object to use on future input data (we fit the OHE on the original dataset and only transform the user’s input data with it) and finally transform our dataset to obtain the dataset ready for the training. The last step was to add the condo fee to the rent price in order to obtain the total living cost and set it as our target variable.

With our data ready, it was time to split our dataset into train and test and decide which model I was going to use for the predictions. First I went for multiple regression, usually a good choice for price prediction, but it also usually works better with more features than the 3 we are using. So I had a decent result, a mean absolute error of around R$1,000 per prediction (or $ 200 in August/22). Then I tried the random forest regressor, which usually outperforms the multiple regression on a small number of features. Said and done, using the RandomForestRegressor with some minor hyperparameter adjustments, I was able to drop the mean absolute error to around R$ 700 ($140), which is 30% better than our baseline model. I said “around” because every time we train the model, there are different listings and therefore a different error on our model predictions.

Here is a plot of how the predictions stack up against the original prices of the apartments:

Predictions vs actual of rent prices using the RandomForestRegressor

It’s possible to see that there are two huge outliers in blue, which probably are very fancy apartments, and our model was not able to predict that since we don’t have any features measuring the quality dimension of the apartment, only its size, number of rooms, and location.

All in all, it is still a good predictor as we have a 30% smaller MAE (mean absolute error) compared to the multiple regression. Most of the apartments have a price below R$10.000, and on those the red dots are closer to the blue dots than on the ones above R$ 10.000. This suggests that our model performs better at regular prices than at higher prices. That’s fine; for the time being, it’s better to pass on the expensive ones than the regular ones, because the vast majority of people will be looking for rent far below R$10.000. In data science, it is extremely important to think about prediction errors and use logic to make the best decisions and fully understand their impacts. As we’ve said above, if we want to be accurate in those predictions of the expensive rents we need to add new dimensions to the training data, such as the quality and condition of the apartment, or if it’s a penthouse or not.

3) Building a recommendation system

We’ve already scraped the data of apartment listings and created a model able to predict the rent price of an apartment based on its area, number of rooms, and neighborhood. So why not create a recommendation system to suggest apartments to the user based on the choices made for the prediction?

Here is the code:

We’ll go into the details dividing the code by blocks.

  1. Load the dfpreprocessed data (basically the raw extracted dataset) to df_original. Load the dataset used for the train (dffinal) on df.
  2. Create a dataframe with the user’s input. The columns are “area”, “quartos” (rooms) and “setor”(neighbourhood).
  3. Load the OneHotEncoder stored with pickle to “ohe”.
  4. Create the dummy variables of the user’s input categorical feature, which is “setor”, and convert it to the “dummy” dataframe.
  5. Drop the “setor” column from the user’s input dataframe and merge it with the dummy variables of “setor” we’ve just stored it on “dummy” dataframe on step 4.
  6. Then we’ll add “userInput” to “df”, which is the dataset ready to train on. With them together, we can run a K-nearest neighbor algorithm, and decide which apartments on our “df” have the data closest to the user’s input and since a person is searching for an apartment with those parameters, we can assume it’s interesting to suggest other apartments with similar features that we have stored on our dataset.
  7. Drop the price from the “df” to create the “features” dataframe
  8. We’ll set the nearest neighbors’ K to 11, as we need to find the 10 best suggestions based on the user’s input and the closest neighbor will always be the user’s input. Also, we use the ball-tree algorithm to compute the nearest neighbors. I’ll not go into details, but this article explains the differences between the algorithms on KNN. Finally, we fit the “features” df to our KNN model and get two lists as outputs: one with the index of the 10 nearest neighbours (for every data point, or in our case, apartment), and the other with the distance to each of the 10.
  9. Leave on “df_original” just the features we want to show to the user in our output.
  10. Create the “mais_proximos” list to store every index of the 10 nearest neighbors of the user’s input, which is the 1st data point of the trained data, since we’ve merged it with the training dataset extracted from the apartment listings.
  11. Finally, we create the “top10” dataset with the 10 nearest neighbors to our user’s input and apply some small transformations to improve the display of the dataframe.

4) Deploying the application

With all the work done, it was time to deploy the prototype application so that other people could use it. I’ve decided to do it using Streamlit, Docker and Heroku. I’ll not go into specifics and explain the codes here because this article would be too dense, but the streamlit and docker files are available on GitHub.

The app deployed on Heroku with the prediction and suggestions working

You can check out the prototype deployed on Heroku here!

If you’ve read this far, thanks for your attention, and I hope you enjoyed it. I am available for any doubts or suggestions. =)

--

--

Bruno Caraffa

Bruno Caraffa — The best Data Scientist in my house. Data Intelligence Coordinator @Wiz Co. @Brasília/Brazil. https://www.linkedin.com/in/brunocaraffa/