15 min read

A program to predict a home's sell price.

hbf is a new Python package that (currently) allows one to scrape data from Zillow.com with a starting address and a search box. It is limited to processing only 800 results per call due to result number limitations when issuing an http request on Zillow.com.

Eventually the developers would also like to add home sale price prediction capabilities.

This analysis is a first step in that direction using the R programming language to:

  1. Access current hbf functionality as it has recently been made available via Python’s package installer pip.
  2. Determine the potential of both statistical and machine learning models as sale price predictors.
  3. Predict the sale price of a home currently on the market in New Smyrna Beach, Florida.

Example hbf input file

[Address Control]
Street Number:      1160
Street Name:        Fairvilla Dr, New Smyrna Beach
Apartment Number:
City:               New Smyrna Beach
State:              Florida
Zip Code:           32168

[Search Control]
Search Box Half-Width (miles): 0.1

[Zillow Control]
accept: /
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)
            AppleWebKit/537.36 (KHTML, like Gecko)
            Chrome/85.0.4183.102
            Safari/537.36

Example hbf call

from hbf.getCleanedDF import getDF

smyrnaSmall = getDF("res.inp")

Example output as R df

The Python object is now available as a R dataframe via the reticulate package!

smyrnaSmall <- janitor::clean_names(py$smyrnaSmall)
tibble::tibble(smyrnaSmall)
## # A tibble: 13 x 11
##    address sell_price  beds baths home_size home_type year_built heating cooling
##    <chr>        <dbl> <dbl> <dbl>     <dbl> <chr>          <dbl> <chr>   <list> 
##  1 915 Fa…     125000     2   2         810 Single F…       1986 Forced… <dbl […
##  2 238 Go…        NaN     3   2        2263 Single F…        NaN Forced… <chr […
##  3 905 Fa…     100000     2   2         810 Single F…       1986 Forced… <dbl […
##  4 1150 F…     115000     2   1         810 Single F…       1985 Forced… <dbl […
##  5 1130 F…     112000     2   2         842 Single F…       1985 Forced… <dbl […
##  6 907 Fa…     100000     2   2         810 Single F…       1986 Forced… <chr […
##  7 1128 F…     114000     2   1         810 Single F…       1985 Forced… <dbl […
##  8 241 Go…     234000     2   3        1757 Single F…       1993 Forced… <chr […
##  9 1140 F…     100000     2   1        1000 Single F…       1985 Forced… <chr […
## 10 254 Go…     302000     3   2        2941 Single F…       1996 Forced… <chr […
## 11 911 Fa…     108931     2   2         994 Townhouse       1986 Forced… <chr […
## 12 9 Fore…     135000     3   2.5      1850 Single F…       1978 Forced… <dbl […
## 13 11 For…        NaN     2   2        1242 Single F…       1978 Forced… <dbl […
## # … with 2 more variables: parking <list>, lot_size <dbl>

Working with a larger dataset

Searching from the same starting point in New Smyrna Beach, FL, but within a 3 mile box returning 800 results, fit a linear model.

## # A tibble: 800 x 11
##    address sell_price  beds baths home_size home_type year_built heating cooling
##    <chr>        <dbl> <dbl> <dbl>     <dbl> <chr>          <dbl> <list>  <list> 
##  1 679 Mi…     285000     3     2      1904 Single F…       1991 <chr [… <chr […
##  2 24 For…     180000     2     2      1124 Single F…       1978 <chr [… <chr […
##  3 620 Ba…     300000     2     2      1530 Single F…       1948 <chr [… <chr […
##  4 91 Hea…     260000     2     2       888 Condo           1986 <chr [… <chr […
##  5 17 Fai…     307000     3     2      2645 Single F…       1974 <chr [… <chr […
##  6 563 Ch…     102500     3     1      1214 Single F…       1940 <chr [… <chr […
##  7 706 Fo…     270000     3     2      1838 Single F…       1987 <chr [… <dbl […
##  8 101 N …     235000     2     2      1000 Single F…       1973 <chr [… <chr […
##  9 332 Pa…     180000     3     2      1258 Single F…       2020 <chr [… <chr […
## 10 695 Mi…     204900     2     2      1355 Single F…       1992 <chr [… <chr […
## # … with 790 more rows, and 2 more variables: parking <list>, lot_size <dbl>

Build and fit a linear regression prediction model using tidymodels

From example in tidymodels, “Get Started”:

With tidymodels, we start by specifying the functional form of the model that we want using the parsnip package. Since there is a numeric outcome and the model should be linear with slopes and intercepts, the model type is “linear regression”.

Since this model will have more than one predictor, it will be type “multiple linear regression”.

# 1. Define the ordinary least squares linear model engine.
lm_mod <- 
  linear_reg() %>% # from `parsnip`
  set_engine("lm")

# 2. Estimate or train using the `fit()` function.
lm_fit <- 
  lm_mod %>% 
  fit(sell_price ~ beds + baths + home_size + year_built + lot_size,
      data = smyrnaScrape)

tidy(lm_fit)
## # A tibble: 6 x 5
##   term           estimate   std.error statistic  p.value
##   <chr>             <dbl>       <dbl>     <dbl>    <dbl>
## 1 (Intercept) 730646.     502772.        1.45   1.47e- 1
## 2 beds           825.      11085.        0.0744 9.41e- 1
## 3 baths        82947.      12611.        6.58   1.24e-10
## 4 home_size       62.1         9.73      6.39   3.95e-10
## 5 year_built    -398.        256.       -1.55   1.21e- 1
## 6 lot_size         0.0125      0.0254    0.493  6.22e- 1

Create example data for new predictions

New points are built by varying home square footage.

new_points <- expand.grid(beds = 2,
                          baths = 1,
                          home_size = c(800, 1000, 1200, 1500),
                          year_built = 1985,
                          lot_size = 1500
                          )
new_points
##   beds baths home_size year_built lot_size
## 1    2     1       800       1985     1500
## 2    2     1      1000       1985     1500
## 3    2     1      1200       1985     1500
## 4    2     1      1500       1985     1500

use the predict() function to find the mean values of new points

mean_pred <- predict(lm_fit, new_data = new_points)
mean_pred
## # A tibble: 4 x 1
##     .pred
##     <dbl>
## 1  75613.
## 2  88036.
## 3 100460.
## 4 119095.

With tidymodels, the types of predicted values are standardized so that we can use the same syntax to get these values across model types using predict().

When making predictions, the tidymodels convention is to always produce a tibble of results with standardized column names. This makes it easy to combine the original data and the predictions in a usable format:

conf_int_pred <- predict(lm_fit, 
                         new_data = new_points, 
                         type = "conf_int")
conf_int_pred
## # A tibble: 4 x 2
##   .pred_lower .pred_upper
##         <dbl>       <dbl>
## 1      51323.      99902.
## 2      63916.     112156.
## 3      75908.     125011.
## 4      92846.     145345.

It is now straightforward to make a plot.

# Now combine:
plot_data <- 
  new_points %>% 
  bind_cols(mean_pred) %>% 
  bind_cols(conf_int_pred)

# and plot:
ggplot(plot_data, aes(x = home_size)) + 
  geom_point(aes(y = .pred)) + 
  geom_errorbar(aes(ymin = .pred_lower, 
                    ymax = .pred_upper),
                width = .2) + 
  labs(x = "home size (square-ft)",
       y = "sale price ($)")

Fitting a Bayesian model

Re-purposing the previous fit() call is now straightforward. Notice that it is unchanged, we only need to define the new engine to operate on it.

First, set the prior distribution to be the default as discussed at Prior Distributions for rstanarm Models.

# set the prior distribution
prior_dist <- 
  rstanarm::stan_glm(sell_price ~ beds + baths + home_size + year_built + lot_size,
                     data = smyrnaScrape, chains = 1) # using default prior means don't add
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000125 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.25 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 1.3903 seconds (Warm-up)
## Chain 1:                0.110121 seconds (Sampling)
## Chain 1:                1.50042 seconds (Total)
## Chain 1:
set.seed(123) # for reproducibiility

# make the parsnip model
bayes_mod <-   
  linear_reg() %>% 
  set_engine("stan", 
             prior_intercept = prior_dist$prior.info$prior_intercept, 
             prior =  prior_dist$prior.info$prior)
# train the model
bayes_fit <- 
  bayes_mod %>% 
  fit(sell_price ~ beds + baths + home_size + year_built + lot_size, data = smyrnaScrape)

print(bayes_fit, digits = 5)
## parsnip model object
## 
## Fit time:  1m 0.4s 
## stan_glm
##  family:       gaussian [identity]
##  formula:      sell_price ~ beds + baths + home_size + year_built + lot_size
##  observations: 493
##  predictors:   6
## ------
##             Median       MAD_SD      
## (Intercept) 215619.60310   6485.64949
## beds             0.08791      2.39662
## baths            0.01896      2.52481
## home_size        8.09597      2.51432
## year_built       0.09409      2.49585
## lot_size         0.01536      0.03170
## 
## Auxiliary parameter(s):
##       Median       MAD_SD      
## sigma 175173.11326   5667.42774
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
tidy(bayes_fit, conf.int = TRUE)
## # A tibble: 6 x 5
##   term           estimate std.error    conf.low   conf.high
##   <chr>             <dbl>     <dbl>       <dbl>       <dbl>
## 1 (Intercept) 215620.     6486.     205193.     226538.    
## 2 beds             0.0879    2.40       -4.11        4.12  
## 3 baths            0.0190    2.52       -4.07        4.11  
## 4 home_size        8.10      2.51        3.91       12.2   
## 5 year_built       0.0941    2.50       -3.96        3.97  
## 6 lot_size         0.0154    0.0317     -0.0350      0.0687

## Using the tidymodels modeling workflow

Start by loading the data and filtering out really inexpensive and expensive recently sold homes that may skew predictions, then do some pre-processing before applying a machine learning method, including removal of data with missing values.

smyrnaScrape <-
  smyrnaScrape %>%
  # Remove heating/cooling vars until can figure out how to standardize
  select(-heating, -cooling, -parking) %>%
  # Filter out irrelevant "homes"
  filter(sell_price > 100000, sell_price < 750000) %>%
  # Exclude missing data
  na.omit() 

# Convert home type to factor
smyrnaScrape$home_type <- factor(smyrnaScrape$home_type)

We can see how the average sell price varies by price.

## # A tibble: 5 x 3
##   group               n mean_sale
##   <fct>           <int>     <dbl>
## 1 [101000,150600)    92   130280.
## 2 [150600,186000)    89   168883.
## 3 [186000,220000)    81   203232.
## 4 [220000,284000)    87   248642 
## 5 [284000,735000]    87   393459.

Data splitting

To get started, let’s split this single dataset into two: a training set and a testing set. We’ll keep most of the rows in the original dataset (subset chosen randomly) in the training set. The training data will be used to fit the model, and the testing set will be used to measure model performance.

To do this, we can use the rsample package to create an object that contains the information on how to split the data, and then two more rsample functions to create data frames for the training and testing sets:

# Fix the random numbers by setting the seed 
# This enables the analysis to be reproducible when random numbers are used 
set.seed(555)
# Put 3/4 of the data into the training set 
data_split <- initial_split(smyrnaScrape, prop = 3/4)

# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

Create recipe and roles and finalize features

To get started, let’s create a recipe for a simple regression model. Before training the model, we can use a recipe to create a few new predictors and conduct some pre-processing required by the model.

Let’s initiate a new recipe and add a role to this recipe. We can use the update_role() function to let recipes know that address and home_type are variables with a custom role that we will call “ID” (a role can have any character value). Whereas our formula included all variables in the training set other than sell_price as predictors, this tells the recipe to keep address and home_type but not use them as either outcomes or predictors.

smyrna_rec <- 
  # '.' means use all other vars as predictors
  recipe(sell_price ~ ., data = train_data) %>% 
  update_role(address, home_type, new_role = "ID") %>%
  # Remove columns from the data when the training set data have a single value
  # I.e., if it is a "zero-variance predictor" that has no information within the column
  step_zv(all_predictors())

This step of adding roles to a recipe is optional; the purpose of using it here is that address and home_type can be retained in the data but not included in the model. This can be convenient when, after the model is fit, we want to investigate some poorly predicted value. ID columns will be available and can be used to try to understand what went wrong.

To get the current set of variables and roles, use the summary() function.

summary(smyrna_rec)
## # A tibble: 8 x 4
##   variable   type    role      source  
##   <chr>      <chr>   <chr>     <chr>   
## 1 address    nominal ID        original
## 2 beds       numeric predictor original
## 3 baths      numeric predictor original
## 4 home_size  numeric predictor original
## 5 home_type  nominal ID        original
## 6 year_built numeric predictor original
## 7 lot_size   numeric predictor original
## 8 sell_price numeric outcome   original

Now we’ve created a specification of what should be done with the data. How do we use the recipe we made?

Fit a model with a recipe

Let’s first use linear regression to model the flight data similar to what was done initially.

lr_mod <- 
  linear_reg() %>% 
  set_engine("lm")

smyrna_wflow <- 
  workflow() %>% 
  add_model(lr_mod) %>% 
  add_recipe(smyrna_rec)

smyrna_wflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## ● step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Now, there is a single function that can be used to prepare the recipe and train the model from the resulting predictors.

smyrna_fit <- 
  smyrna_wflow %>% 
  fit(data = train_data)

This object has the finalized recipe and fitted model objects inside. You may want to extract the model or recipe objects from the workflow. To do this, you can use the helper functions pull_workflow_fit() and pull_workflow_prepped_recipe(). For example, here we pull the fitted model object then use the broom::tidy() function to get a tidy tibble of model coefficients.

smyrna_fit %>% 
  pull_workflow_fit() %>% 
  tidy()
## # A tibble: 6 x 5
##   term           estimate   std.error statistic  p.value
##   <chr>             <dbl>       <dbl>     <dbl>    <dbl>
## 1 (Intercept) -18064.     329766.       -0.0548 9.56e- 1
## 2 beds         11580.       7271.        1.59   1.12e- 1
## 3 baths        35877.       9458.        3.79   1.78e- 4
## 4 home_size       42.3         6.51      6.49   3.23e-10
## 5 year_built      29.9       168.        0.178  8.59e- 1
## 6 lot_size         0.0170      0.0132    1.28   2.00e- 1

Use a trained workflow to predict

Our goal was to predict the sale price of a home in New Smyrna Beach, FL. We have just:

  1. Built the model (lr_mod),

  2. Created a pre-processing recipe (smyrna_rec),

  3. Bundled the model and recipe (smyrna_wflow), and

  4. Trained our workflow using a single call to fit().

The next step is to use the trained workflow (smyrna_fit) to predict with the unseen test data, which we will do with a single call to predict(). The predict() method applies the recipe to the new data, then passes them to the fitted model.

smyrna_pred <- 
  predict(smyrna_fit, test_data) %>%
  bind_cols(test_data %>% select(sell_price, beds, baths, home_size))

smyrna_pred
## # A tibble: 109 x 5
##      .pred sell_price  beds baths home_size
##      <dbl>      <dbl> <dbl> <dbl>     <dbl>
##  1 183573.     180000     2     2      1124
##  2 225640.     270000     3     2      1838
##  3 202140.     180000     3     2      1258
##  4 330313.     715000     4     4      2380
##  5 170477.     125000     2     2       810
##  6 190588.     218500     2     2      1290
##  7 198515.     227000     2     2      1468
##  8 139182.     169900     2     1       963
##  9 132655.     125900     2     1       804
## 10 299311.     535000     5     3      2245
## # … with 99 more rows

Now that we have a tibble with our predictions, how will we evaluate the performance of our workflow? We would like to calculate a metric that tells how well our model predicted late arrivals, compared to the true status of our outcome variable, sell_price.

smyrna_pred %>% 
  rmse(sell_price, .pred)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     105780.

Moving to machine learning

Next, we will build on what we’ve learned and create a random forest model by simply defining a new workflow!

rf_mod <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

smyrna_rf_wflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(smyrna_rec)
set.seed(234)
smyrna_rf_fit <-
  smyrna_rf_wflow  %>% 
  fit(data = train_data)

smyrna_rf_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## ● step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(formula = ..y ~ ., data = data, num.trees = ~1000,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  1000 
## Sample size:                      327 
## Number of independent variables:  5 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       4817952011 
## R squared (OOB):                  0.472358
smyrna_rf_pred <- 
  predict(smyrna_rf_fit, test_data) %>%
  bind_cols(test_data %>% select(sell_price, beds, baths, home_size))

smyrna_rf_pred
## # A tibble: 109 x 5
##      .pred sell_price  beds baths home_size
##      <dbl>      <dbl> <dbl> <dbl>     <dbl>
##  1 152816.     180000     2     2      1124
##  2 214435.     270000     3     2      1838
##  3 209350.     180000     3     2      1258
##  4 348550.     715000     4     4      2380
##  5 145537.     125000     2     2       810
##  6 189050.     218500     2     2      1290
##  7 185739.     227000     2     2      1468
##  8 180856.     169900     2     1       963
##  9 181623.     125900     2     1       804
## 10 348442.     535000     5     3      2245
## # … with 99 more rows

We can now see that some of the structure in the data is being picked up compared to the linear regression model, which can only represent a line of best fit through the data.

Let’s try and predict the sale price of 1160 Fairvilla Dr, New Smyrna Beach, FL 32168 using both models.

new_data = data.frame(address = c("1160 Fairvilla Dr, New Smyrna Beach, FL 32168"),
                      beds = c(2),
                      baths = c(1),
                      home_size = c(816),
                      home_type = c("Single Family"),
                      year_built = c(1985),
                      lot_size = c(816))

bind_cols(predict(smyrna_fit, new_data) %>% rename(lm = .pred),
          predict(smyrna_rf_fit, new_data)%>% rename(rf = .pred)
)
## # A tibble: 1 x 2
##        lm      rf
##     <dbl>   <dbl>
## 1 134824. 126690.

The Zestimate for this property (as of this writing) is $115,170 with a range of $105,000 - $124,000 with a list price of $125,000. Both of our models fit are outside of the Zestimate range, but the sale price estimate from the random forest model is just outside of it. This is really not surprising as we did not go out of our way to add any sophistication to our feature set and we are only working with 436 data points, but it is still good to be close with such little effort!

Of course the real validation point will be the actual sale price 😉.

Estimating performance