This walkthrough accompanies your reading “Building predictive models”. The goal in this activity is to build a linear model to predict the sale price of a house in Saratoga County. This is a little different than how we’ve used this dataset in the past; we’re not primarily interested in using this model to understand partial relationships, but rather to build a prediction machine.

We’ll learn how to: 1. Estimate the error in our prediction for a new observation (a new house), measured by root mean square prediction error 2. Build a prediction model semi-automatically using stepwise selection

Estimating out-of-sample prediction error

When building a predictive model, our goal is usually to make predictions for data where we don’t have the response variable in hand. In the context of this example, think about predicting the sale price of a house in the future using historical data. We can use simulation to do this – we hold back a fraction of the data from our model to represent our “future” house sales (we call this the testing dataset, since we use it to evaluate our models), and treat the remaining observations as the “historical” sales (we call this the “training” dataset, as it’s used to train our models to make predictions).

Creating training and testing datasets

It’s important to do the train/test split randomly. The easiest way to do this is to shuffle the entire dataset, randomly ordering the rows, and then use the first \(k\) observations as the training dataset and the last \(n-k\) observations as the testing dataset. Here we’ll use 80% of the data to fit models, and 20% to evaluate models:

library(mosaic)
data("SaratogaHouses")

# Remember, setting the seed gives you the same (pseudo)random numbers
# every time
set.seed(1022)
# Randomly re-order the observations:
saratoga_shuffle = shuffle(SaratogaHouses)
# Remove the original id, we don't need it
saratoga_shuffle = subset(saratoga_shuffle, select=(-orig.id))

# Create the 80/20 percent split:
n = nrow(SaratogaHouses)
n_train = round(0.8*n)  # round to nearest integer

# Make training and testing datasets
saratoga_train = saratoga_shuffle[1:n_train,]
saratoga_test  = saratoga_shuffle[(n_train+1):n,]

Fitting and evaluating models

Let’s fit the small, medium, and big models from your reading (p174) using the training dataset:

# Small model
lm1 = lm(price ~ lotSize + bedrooms + bathrooms, data=saratoga_train)

# Medium model. Sometimes it's easier to name the variables we want 
# to leave out of the model. In the command below,
# the dot (.) means "all variables not named"
# the minus (-) means "exclude this variable"
lm2 = lm(price ~ . - sewer - waterfront - landValue - newConstruction, data=saratoga_train)
summary(lm2)
## 
## Call:
## lm(formula = price ~ . - sewer - waterfront - landValue - newConstruction, 
##     data = saratoga_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -239676  -40616   -8448   27901  525394 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             22006.955  13885.434   1.585 0.113221    
## lotSize                  9088.666   2624.067   3.464 0.000549 ***
## age                        88.533     74.047   1.196 0.232051    
## livingArea                 93.586      5.647  16.572  < 2e-16 ***
## pctCollege                423.046    186.631   2.267 0.023561 *  
## bedrooms               -16317.717   3284.268  -4.968 7.60e-07 ***
## fireplaces                189.884   3789.896   0.050 0.960048    
## bathrooms               20941.562   4283.777   4.889 1.14e-06 ***
## rooms                    3349.825   1250.418   2.679 0.007474 ** 
## heatinghot water/steam -10563.867   5346.186  -1.976 0.048360 *  
## heatingelectric         -6789.490  15646.664  -0.434 0.664411    
## fuelelectric            -5829.100  15356.121  -0.380 0.704305    
## fueloil                 -9814.651   6212.463  -1.580 0.114376    
## centralAirNo           -17098.688   4509.281  -3.792 0.000156 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67310 on 1368 degrees of freedom
## Multiple R-squared:  0.5465, Adjusted R-squared:  0.5422 
## F-statistic: 126.8 on 13 and 1368 DF,  p-value: < 2.2e-16
# The model below includes all the same variables plus their interactions;
# The (var1+var2+var3)^2 tells R to include all pairwise interactions and
# main effects
lm3 = lm(price ~ (. - sewer - waterfront - landValue - newConstruction)^2, data=saratoga_train)
# Check out the summary table on your own:
# summary(lm3) 

Once we fit the models, we get predictions for the test set and compute the root mean squared prediction error (see “Measuring generalization error”, p 164)

yhat_test1 = predict(lm1, newdata = saratoga_test)
yhat_test2 = predict(lm2, newdata = saratoga_test)
yhat_test3 = predict(lm3, newdata = saratoga_test)
## Warning in predict.lm(lm3, newdata = saratoga_test): prediction from a rank-
## deficient fit may be misleading
sqrt( mean( (saratoga_test$price - yhat_test1)^2 ) )
## [1] 70705.41
sqrt( mean( (saratoga_test$price - yhat_test2)^2 ) )
## [1] 62280.08
sqrt( mean( (saratoga_test$price - yhat_test3)^2 ) )
## [1] 61191.87

We see that the smallest model has RMSPE of about $76,047, and the largest model has a similiar RMSPE of $74,876. However, the medium model has a substantially lower RMPSE than either – $67,135. What’s going on? The small model is too simple to make good predictions, and the big model is so complicated that it can “learn” random patterns in the training set that don’t occur in the testing dataset – we say it is overfitting the data.

In statistical parlance, we say that the small model predictions have high bias – the best prediction we can make using a linear model of those four variables isn’t very good. On the other hand, the large model predictions have high variance – the large model will tend to capture chance associations that are particular to one random sample, rather than learning associations that we are likely to see in future data. As a result the predictions will vary widely between models fit to different training samples.

By focusing on prediction error out of sample (on new data) we hope to find the Goldilocks model – one that’s complex enough to make good predictions, but not so complex that it’s unstable and prone to overfitting. Note that just looking at prediction error in sample (using the same data used to fit the model) we can’t diagnose overfitting:

sqrt( mean( (saratoga_train$price - fitted(lm1))^2 ) )
## [1] 78622.7
sqrt( mean( (saratoga_train$price - fitted(lm2))^2 ) )
## [1] 66970.18
sqrt( mean( (saratoga_train$price - fitted(lm3))^2 ) )
## [1] 61934.72

Looking at RMSPE in-sample for the large model, we get the dramatically over-optimistic prediction error estimate of \(60,526\) (versus $74,876 using new data).

Finally, there is some randomness inherent in all these estimates of RMSPE; ideally we would average over multiple train/test splits using a do-loop:

pred_error_est = do(10) * {
  saratoga_shuffle = shuffle(SaratogaHouses)
  # Remove the original id, we don't need it
  saratoga_shuffle = subset(saratoga_shuffle, select=(-orig.id))
  
  # Create the 80/20 percent split:
  n = nrow(SaratogaHouses)
  n_train = round(0.8*n)  # round to nearest integer
  
  saratoga_train = saratoga_shuffle[1:n_train,]
  saratoga_test  = saratoga_shuffle[(n_train+1):n,]
  
  lm1 = lm(price ~ lotSize + bedrooms + bathrooms, data=saratoga_train)
  lm2 = lm(price ~ . - sewer - waterfront - landValue - newConstruction, data=saratoga_train)
  lm3 = lm(price ~ (. - sewer - waterfront - landValue - newConstruction)^2, data=saratoga_train)
  yhat_test1 = predict(lm1, newdata = saratoga_test)
  yhat_test2 = predict(lm2, newdata = saratoga_test)
  yhat_test3 = predict(lm3, newdata = saratoga_test)
  
  c(
    rmspe_small = sqrt( mean( (saratoga_test$price - yhat_test1)^2 ) ),
    rmspe_med = sqrt( mean( (saratoga_test$price - yhat_test2)^2 ) ),
    rmspe_large = sqrt( mean( (saratoga_test$price - yhat_test3)^2 ) )
  )

}

print(pred_error_est)
##    rmspe_small rmspe_med rmspe_large
## 1     80919.02  70825.05    75545.65
## 2     69387.81  59839.74    67370.83
## 3     76735.38  67406.03    69471.90
## 4     75503.07  62293.50    65911.76
## 5     67887.87  57715.45    57438.05
## 6     73934.81  64733.59    68078.61
## 7     79142.84  66964.82    68730.15
## 8     69053.48  61511.90    61699.24
## 9     79234.51  70994.44   126529.88
## 10    71047.57  59018.64    62154.65
print(summary(pred_error_est))
##   rmspe_small      rmspe_med      rmspe_large    
##  Min.   :67888   Min.   :57715   Min.   : 57438  
##  1st Qu.:69803   1st Qu.:60258   1st Qu.: 63094  
##  Median :74719   Median :63514   Median : 67725  
##  Mean   :74285   Mean   :64130   Mean   : 72293  
##  3rd Qu.:78541   3rd Qu.:67296   3rd Qu.: 69286  
##  Max.   :80919   Max.   :70994   Max.   :126530

From the output above we see that 1) looking at the estimated RMSPE in each of the ten splits, depending on the random split we choose we can actually get an estimate of RMSPE for the large model that’s smaller than the medium model – but 2) in the summary, when averaging across all the splits the medium model still has lower RMSPE (although the gap wasn’t as large as we saw in the first split).

Stepwise selection

For a complete description of stepwise selection, see p. 179 of your reading. The basic idea is to start from some linear model – large or small – and add or remove terms one at a time to try to improve estimated out of sample prediction error. In R this is straightforward. We can either start small and build up, or start large and pare back. Stepwise search is a heuristic, so the two approaches can give different answers.

# Option 1: Starting from a large model and deleting variables
# Here the scope is all the variables in the original model.
# If you want to see each step, delete the trace=0 argument below.
# This will generate *a lot* of output!
lm_big = lm(price ~ (. - sewer - waterfront - landValue - newConstruction)^2, data=saratoga_train) 
lm_step = step(lm_big, trace=0)

# Option 2: Starting from a smaller model,
# explicitly listing variables in the scope for selection,
# and adding/deleting variables from the scope
lm_start = lm(price ~ lotSize + age + livingArea, data=SaratogaHouses)
# If you go this route, you can't use the ~ . notation above in the scope
# argument
lm_step2 = step(lm_start, trace=0, scope = price ~ (lotSize + age + livingArea + pctCollege + bedrooms + fireplaces + bathrooms + rooms + heating + fuel + centralAir)^2, data=saratoga_train)


# These different starting points lead to different selected models here:
summary(lm_step)  
## 
## Call:
## lm(formula = price ~ lotSize + age + livingArea + pctCollege + 
##     bedrooms + fireplaces + bathrooms + rooms + heating + fuel + 
##     centralAir + lotSize:age + lotSize:bedrooms + lotSize:fireplaces + 
##     age:pctCollege + age:fireplaces + age:centralAir + livingArea:fireplaces + 
##     livingArea:fuel + livingArea:centralAir + pctCollege:fireplaces + 
##     pctCollege:bathrooms + pctCollege:fuel + bedrooms:heating + 
##     fireplaces:fuel + bathrooms:heating + rooms:heating + rooms:fuel, 
##     data = saratoga_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -237548  -35331   -6280   27174  487209 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       48921.091  40003.243   1.223 0.221571    
## lotSize                            1898.545  10099.869   0.188 0.850923    
## age                               -2322.047    429.225  -5.410 7.46e-08 ***
## livingArea                           96.682      8.624  11.211  < 2e-16 ***
## pctCollege                         -367.965    666.572  -0.552 0.581023    
## bedrooms                         -19243.308   4340.931  -4.433 1.01e-05 ***
## fireplaces                        41413.468  20730.977   1.998 0.045955 *  
## bathrooms                         -4064.217  17719.080  -0.229 0.818617    
## rooms                              4323.103   1516.033   2.852 0.004417 ** 
## heatinghot water/steam            42588.324  20207.519   2.108 0.035255 *  
## heatingelectric                   84936.836  51130.315   1.661 0.096910 .  
## fuelelectric                     -11384.061  51248.773  -0.222 0.824244    
## fueloil                           74488.663  30687.418   2.427 0.015341 *  
## centralAirNo                      14453.332  14753.756   0.980 0.327441    
## lotSize:age                        -211.633     91.233  -2.320 0.020507 *  
## lotSize:bedrooms                   5529.358   3564.390   1.551 0.121071    
## lotSize:fireplaces                -8052.417   4274.708  -1.884 0.059817 .  
## age:pctCollege                       33.737      6.792   4.967 7.67e-07 ***
## age:fireplaces                     -170.943    121.382  -1.408 0.159274    
## age:centralAirNo                    817.940    226.052   3.618 0.000307 ***
## livingArea:fireplaces                13.311      4.003   3.325 0.000907 ***
## livingArea:fuelelectric              -4.155     14.924  -0.278 0.780754    
## livingArea:fueloil                  -43.643     14.866  -2.936 0.003383 ** 
## livingArea:centralAirNo             -23.164      6.986  -3.316 0.000938 ***
## pctCollege:fireplaces             -1173.571    344.633  -3.405 0.000680 ***
## pctCollege:bathrooms                601.741    306.850   1.961 0.050082 .  
## pctCollege:fuelelectric           -1576.334    464.251  -3.395 0.000705 ***
## pctCollege:fueloil                -1011.196    455.778  -2.219 0.026679 *  
## bedrooms:heatinghot water/steam    7231.651   7839.561   0.922 0.356456    
## bedrooms:heatingelectric          23830.036   8588.602   2.775 0.005603 ** 
## fireplaces:fuelelectric           24649.069   9646.207   2.555 0.010719 *  
## fireplaces:fueloil                21289.740  11115.521   1.915 0.055664 .  
## bathrooms:heatinghot water/steam -14281.815   8961.736  -1.594 0.111251    
## bathrooms:heatingelectric        -25315.770  11266.181  -2.247 0.024798 *  
## rooms:heatinghot water/steam      -6411.118   2992.144  -2.143 0.032320 *  
## rooms:heatingelectric            -18186.339   7990.073  -2.276 0.022996 *  
## rooms:fuelelectric                12853.811   7913.915   1.624 0.104567    
## rooms:fueloil                      4490.504   3597.157   1.248 0.212121    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63250 on 1344 degrees of freedom
## Multiple R-squared:  0.5986, Adjusted R-squared:  0.5875 
## F-statistic: 54.17 on 37 and 1344 DF,  p-value: < 2.2e-16
summary(lm_step2)
## 
## Call:
## lm(formula = price ~ lotSize + age + livingArea + centralAir + 
##     bathrooms + bedrooms + fuel + rooms + pctCollege + livingArea:centralAir + 
##     livingArea:fuel + age:fuel + age:pctCollege + age:centralAir + 
##     fuel:pctCollege + centralAir:fuel + livingArea:bathrooms + 
##     bathrooms:bedrooms, data = SaratogaHouses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -291799  -36274   -7851   26620  508994 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -17758.007  26623.169  -0.667 0.504854    
## lotSize                     8096.028   2340.130   3.460 0.000554 ***
## age                        -1834.077    363.130  -5.051 4.87e-07 ***
## livingArea                    75.650     12.919   5.856 5.69e-09 ***
## centralAirNo               25501.718  13411.892   1.901 0.057415 .  
## bathrooms                  38081.733  10035.263   3.795 0.000153 ***
## bedrooms                   14380.751   7343.780   1.958 0.050367 .  
## fuelelectric               23082.259  31304.021   0.737 0.461006    
## fueloil                   133857.602  28684.610   4.667 3.30e-06 ***
## rooms                       2838.173   1052.961   2.695 0.007099 ** 
## pctCollege                    24.899    253.886   0.098 0.921885    
## livingArea:centralAirNo      -31.053      6.074  -5.112 3.54e-07 ***
## livingArea:fuelelectric        3.154      9.298   0.339 0.734477    
## livingArea:fueloil           -36.508      8.164  -4.472 8.26e-06 ***
## age:fuelelectric             436.546    364.698   1.197 0.231471    
## age:fueloil                 -348.616    133.154  -2.618 0.008919 ** 
## age:pctCollege                25.512      5.435   4.694 2.89e-06 ***
## age:centralAirNo             646.485    203.735   3.173 0.001535 ** 
## fuelelectric:pctCollege    -1020.763    426.065  -2.396 0.016692 *  
## fueloil:pctCollege          -776.743    402.289  -1.931 0.053672 .  
## centralAirNo:fuelelectric   7517.787  10695.955   0.703 0.482237    
## centralAirNo:fueloil      -32182.249  13194.958  -2.439 0.014830 *  
## livingArea:bathrooms          17.671      4.522   3.908 9.69e-05 ***
## bathrooms:bedrooms        -14957.385   3603.121  -4.151 3.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63640 on 1704 degrees of freedom
## Multiple R-squared:  0.5876, Adjusted R-squared:  0.582 
## F-statistic: 105.6 on 23 and 1704 DF,  p-value: < 2.2e-16

Let’s evaluate the two stepwise models via RMSPE against our three original linear models, averaging over 10 training/testing splits:

pred_error_est = do(10) * {
  saratoga_shuffle = shuffle(SaratogaHouses)
  # Remove the original id, we don't need it
  saratoga_shuffle = subset(saratoga_shuffle, select=(-orig.id))
  
  # Create the 80/20 percent split:
  n = nrow(SaratogaHouses)
  n_train = round(0.8*n)  # round to nearest integer
  
  saratoga_train = saratoga_shuffle[1:n_train,]
  saratoga_test  = saratoga_shuffle[(n_train+1):n,]
  
  lm1 = lm(price ~ lotSize + bedrooms + bathrooms, data=saratoga_train)
  lm2 = lm(price ~ . - sewer - waterfront - landValue - newConstruction, data=saratoga_train)
  lm3 = lm(price ~ (. - sewer - waterfront - landValue - newConstruction)^2, data=saratoga_train)
  
  # Stepwise models
  lm_step = step(lm3, trace=0)
  
  lm_start = lm(price ~ lotSize + age + livingArea, data=saratoga_train)
  lm_step2 = step(lm_start, scope = price ~ (lotSize + age + livingArea + pctCollege + bedrooms + fireplaces + bathrooms + rooms + heating + fuel + centralAir)^2, data=saratoga_train, trace=0)
  
  yhat_test1 = predict(lm1, newdata = saratoga_test)
  yhat_test2 = predict(lm2, newdata = saratoga_test)
  yhat_test3 = predict(lm3, newdata = saratoga_test)
  yhat_test_step = predict(lm_step, newdata = saratoga_test)
  yhat_test_step2 = predict(lm_step2, newdata = saratoga_test)
  
  c(
    rmspe_small = sqrt( mean( (saratoga_test$price - yhat_test1)^2 ) ),
    rmspe_med = sqrt( mean( (saratoga_test$price - yhat_test2)^2 ) ),
    rmspe_large = sqrt( mean( (saratoga_test$price - yhat_test3)^2 ) ),
    rmspe_step = sqrt( mean( (saratoga_test$price - yhat_test_step)^2 ) ),
    rmspe_step2 = sqrt( mean( (saratoga_test$price - yhat_test_step2)^2 ) )
  )

}

print(pred_error_est)
##    rmspe_small rmspe_med rmspe_large rmspe_step rmspe_step2
## 1     81708.61  72584.97    73490.22   72558.24    72504.37
## 2     81593.28  70931.59    70448.49   70450.05    69874.41
## 3     69492.07  57854.49    64647.91   63462.84    62776.48
## 4     64041.18  58996.14    60052.22   59485.00    60478.86
## 5     85536.22  74134.83    72576.21   74322.24    72690.42
## 6     82782.95  71576.61    72667.12   71759.76    71843.64
## 7     77661.81  63445.92    62997.90   63008.93    62481.75
## 8     75480.72  64434.26    63542.00   62681.70    62237.73
## 9     83188.49  68861.48    68147.69   67436.74    68093.51
## 10    70989.60  65224.84    67259.55   63689.14    65094.11
print(summary(pred_error_est))
##   rmspe_small      rmspe_med      rmspe_large      rmspe_step   
##  Min.   :64041   Min.   :57854   Min.   :60052   Min.   :59485  
##  1st Qu.:72112   1st Qu.:63693   1st Qu.:63818   1st Qu.:63122  
##  Median :79628   Median :67043   Median :67704   Median :65563  
##  Mean   :77247   Mean   :66805   Mean   :67583   Mean   :66885  
##  3rd Qu.:82514   3rd Qu.:71415   3rd Qu.:72044   3rd Qu.:71432  
##  Max.   :85536   Max.   :74135   Max.   :73490   Max.   :74322  
##   rmspe_step2   
##  Min.   :60479  
##  1st Qu.:62555  
##  Median :66594  
##  Mean   :66808  
##  3rd Qu.:71351  
##  Max.   :72690

Looking at the summary output, we see that the two stepwise models and the medium model all have very similar estimated RMSPE – despite the two stepwise models being more complex. In datasets with more variables and larger sample sizes, stepwise selection will often find a better predictive model. In this case we could comfortably use the medium model, knowing that stepwise selection failed to generate substaintial improvement.

Stepwise Selection: Summary

Stepwise selection is a useful tool for finding good predictive models, but it has some drawbacks; first, different starting points may give different answers (althgouh they will usually perform similarly for prediction). Second, the stepwise search procedure uses an approximation to out of sample prediction error as its guide. When the sample size isn’t large relative to the number of regression coefficients in the model, the assumptions behind these approximations can easily break down. Third, you lose out on the ability to check the model using residual plots, checking for outliers, etc. It’s always a good idea to check the final model, even though we skipped that step above.

Finally, an important caveat your textbook doesn’t mention: After running stepwise selection, confidence intervals and hypothesis tests using the same data that was used to select the model are invalid – all of our confidence intervals and hypothesis tests assume that the model was chosen without using the data. One can get around this by using part of the data for selecting a model and another part of the data to get confidence intervals or conduct hypothesis tests. But in general stepwise selection is best reserved for cases where we are just trying to get a good predictive model.