This walkthrough accompanies your reading “Building predictive models”. The goal in this activity is to build a linear model to predict the sale price of a house in Saratoga County. This is a little different than how we’ve used this dataset in the past; we’re not primarily interested in using this model to understand partial relationships, but rather to build a prediction machine.
We’ll learn how to: 1. Estimate the error in our prediction for a new observation (a new house), measured by root mean square prediction error 2. Build a prediction model semi-automatically using stepwise selection
When building a predictive model, our goal is usually to make predictions for data where we don’t have the response variable in hand. In the context of this example, think about predicting the sale price of a house in the future using historical data. We can use simulation to do this – we hold back a fraction of the data from our model to represent our “future” house sales (we call this the testing dataset, since we use it to evaluate our models), and treat the remaining observations as the “historical” sales (we call this the “training” dataset, as it’s used to train our models to make predictions).
It’s important to do the train/test split randomly. The easiest way to do this is to shuffle the entire dataset, randomly ordering the rows, and then use the first \(k\) observations as the training dataset and the last \(n-k\) observations as the testing dataset. Here we’ll use 80% of the data to fit models, and 20% to evaluate models:
library(mosaic)
data("SaratogaHouses")
# Remember, setting the seed gives you the same (pseudo)random numbers
# every time
set.seed(1022)
# Randomly re-order the observations:
saratoga_shuffle = shuffle(SaratogaHouses)
# Remove the original id, we don't need it
saratoga_shuffle = subset(saratoga_shuffle, select=(-orig.id))
# Create the 80/20 percent split:
n = nrow(SaratogaHouses)
n_train = round(0.8*n) # round to nearest integer
# Make training and testing datasets
saratoga_train = saratoga_shuffle[1:n_train,]
saratoga_test = saratoga_shuffle[(n_train+1):n,]
Let’s fit the small, medium, and big models from your reading (p174) using the training dataset:
# Small model
lm1 = lm(price ~ lotSize + bedrooms + bathrooms, data=saratoga_train)
# Medium model. Sometimes it's easier to name the variables we want
# to leave out of the model. In the command below,
# the dot (.) means "all variables not named"
# the minus (-) means "exclude this variable"
lm2 = lm(price ~ . - sewer - waterfront - landValue - newConstruction, data=saratoga_train)
summary(lm2)
##
## Call:
## lm(formula = price ~ . - sewer - waterfront - landValue - newConstruction,
## data = saratoga_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -239676 -40616 -8448 27901 525394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22006.955 13885.434 1.585 0.113221
## lotSize 9088.666 2624.067 3.464 0.000549 ***
## age 88.533 74.047 1.196 0.232051
## livingArea 93.586 5.647 16.572 < 2e-16 ***
## pctCollege 423.046 186.631 2.267 0.023561 *
## bedrooms -16317.717 3284.268 -4.968 7.60e-07 ***
## fireplaces 189.884 3789.896 0.050 0.960048
## bathrooms 20941.562 4283.777 4.889 1.14e-06 ***
## rooms 3349.825 1250.418 2.679 0.007474 **
## heatinghot water/steam -10563.867 5346.186 -1.976 0.048360 *
## heatingelectric -6789.490 15646.664 -0.434 0.664411
## fuelelectric -5829.100 15356.121 -0.380 0.704305
## fueloil -9814.651 6212.463 -1.580 0.114376
## centralAirNo -17098.688 4509.281 -3.792 0.000156 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67310 on 1368 degrees of freedom
## Multiple R-squared: 0.5465, Adjusted R-squared: 0.5422
## F-statistic: 126.8 on 13 and 1368 DF, p-value: < 2.2e-16
# The model below includes all the same variables plus their interactions;
# The (var1+var2+var3)^2 tells R to include all pairwise interactions and
# main effects
lm3 = lm(price ~ (. - sewer - waterfront - landValue - newConstruction)^2, data=saratoga_train)
# Check out the summary table on your own:
# summary(lm3)
Once we fit the models, we get predictions for the test set and compute the root mean squared prediction error (see “Measuring generalization error”, p 164)
yhat_test1 = predict(lm1, newdata = saratoga_test)
yhat_test2 = predict(lm2, newdata = saratoga_test)
yhat_test3 = predict(lm3, newdata = saratoga_test)
## Warning in predict.lm(lm3, newdata = saratoga_test): prediction from a rank-
## deficient fit may be misleading
sqrt( mean( (saratoga_test$price - yhat_test1)^2 ) )
## [1] 70705.41
sqrt( mean( (saratoga_test$price - yhat_test2)^2 ) )
## [1] 62280.08
sqrt( mean( (saratoga_test$price - yhat_test3)^2 ) )
## [1] 61191.87
We see that the smallest model has RMSPE of about $76,047, and the largest model has a similiar RMSPE of $74,876. However, the medium model has a substantially lower RMPSE than either – $67,135. What’s going on? The small model is too simple to make good predictions, and the big model is so complicated that it can “learn” random patterns in the training set that don’t occur in the testing dataset – we say it is overfitting the data.
In statistical parlance, we say that the small model predictions have high bias – the best prediction we can make using a linear model of those four variables isn’t very good. On the other hand, the large model predictions have high variance – the large model will tend to capture chance associations that are particular to one random sample, rather than learning associations that we are likely to see in future data. As a result the predictions will vary widely between models fit to different training samples.
By focusing on prediction error out of sample (on new data) we hope to find the Goldilocks model – one that’s complex enough to make good predictions, but not so complex that it’s unstable and prone to overfitting. Note that just looking at prediction error in sample (using the same data used to fit the model) we can’t diagnose overfitting:
sqrt( mean( (saratoga_train$price - fitted(lm1))^2 ) )
## [1] 78622.7
sqrt( mean( (saratoga_train$price - fitted(lm2))^2 ) )
## [1] 66970.18
sqrt( mean( (saratoga_train$price - fitted(lm3))^2 ) )
## [1] 61934.72
Looking at RMSPE in-sample for the large model, we get the dramatically over-optimistic prediction error estimate of \(60,526\) (versus $74,876 using new data).
Finally, there is some randomness inherent in all these estimates of RMSPE; ideally we would average over multiple train/test splits using a do-loop:
pred_error_est = do(10) * {
saratoga_shuffle = shuffle(SaratogaHouses)
# Remove the original id, we don't need it
saratoga_shuffle = subset(saratoga_shuffle, select=(-orig.id))
# Create the 80/20 percent split:
n = nrow(SaratogaHouses)
n_train = round(0.8*n) # round to nearest integer
saratoga_train = saratoga_shuffle[1:n_train,]
saratoga_test = saratoga_shuffle[(n_train+1):n,]
lm1 = lm(price ~ lotSize + bedrooms + bathrooms, data=saratoga_train)
lm2 = lm(price ~ . - sewer - waterfront - landValue - newConstruction, data=saratoga_train)
lm3 = lm(price ~ (. - sewer - waterfront - landValue - newConstruction)^2, data=saratoga_train)
yhat_test1 = predict(lm1, newdata = saratoga_test)
yhat_test2 = predict(lm2, newdata = saratoga_test)
yhat_test3 = predict(lm3, newdata = saratoga_test)
c(
rmspe_small = sqrt( mean( (saratoga_test$price - yhat_test1)^2 ) ),
rmspe_med = sqrt( mean( (saratoga_test$price - yhat_test2)^2 ) ),
rmspe_large = sqrt( mean( (saratoga_test$price - yhat_test3)^2 ) )
)
}
print(pred_error_est)
## rmspe_small rmspe_med rmspe_large
## 1 80919.02 70825.05 75545.65
## 2 69387.81 59839.74 67370.83
## 3 76735.38 67406.03 69471.90
## 4 75503.07 62293.50 65911.76
## 5 67887.87 57715.45 57438.05
## 6 73934.81 64733.59 68078.61
## 7 79142.84 66964.82 68730.15
## 8 69053.48 61511.90 61699.24
## 9 79234.51 70994.44 126529.88
## 10 71047.57 59018.64 62154.65
print(summary(pred_error_est))
## rmspe_small rmspe_med rmspe_large
## Min. :67888 Min. :57715 Min. : 57438
## 1st Qu.:69803 1st Qu.:60258 1st Qu.: 63094
## Median :74719 Median :63514 Median : 67725
## Mean :74285 Mean :64130 Mean : 72293
## 3rd Qu.:78541 3rd Qu.:67296 3rd Qu.: 69286
## Max. :80919 Max. :70994 Max. :126530
From the output above we see that 1) looking at the estimated RMSPE in each of the ten splits, depending on the random split we choose we can actually get an estimate of RMSPE for the large model that’s smaller than the medium model – but 2) in the summary, when averaging across all the splits the medium model still has lower RMSPE (although the gap wasn’t as large as we saw in the first split).
For a complete description of stepwise selection, see p. 179 of your reading. The basic idea is to start from some linear model – large or small – and add or remove terms one at a time to try to improve estimated out of sample prediction error. In R this is straightforward. We can either start small and build up, or start large and pare back. Stepwise search is a heuristic, so the two approaches can give different answers.
# Option 1: Starting from a large model and deleting variables
# Here the scope is all the variables in the original model.
# If you want to see each step, delete the trace=0 argument below.
# This will generate *a lot* of output!
lm_big = lm(price ~ (. - sewer - waterfront - landValue - newConstruction)^2, data=saratoga_train)
lm_step = step(lm_big, trace=0)
# Option 2: Starting from a smaller model,
# explicitly listing variables in the scope for selection,
# and adding/deleting variables from the scope
lm_start = lm(price ~ lotSize + age + livingArea, data=SaratogaHouses)
# If you go this route, you can't use the ~ . notation above in the scope
# argument
lm_step2 = step(lm_start, trace=0, scope = price ~ (lotSize + age + livingArea + pctCollege + bedrooms + fireplaces + bathrooms + rooms + heating + fuel + centralAir)^2, data=saratoga_train)
# These different starting points lead to different selected models here:
summary(lm_step)
##
## Call:
## lm(formula = price ~ lotSize + age + livingArea + pctCollege +
## bedrooms + fireplaces + bathrooms + rooms + heating + fuel +
## centralAir + lotSize:age + lotSize:bedrooms + lotSize:fireplaces +
## age:pctCollege + age:fireplaces + age:centralAir + livingArea:fireplaces +
## livingArea:fuel + livingArea:centralAir + pctCollege:fireplaces +
## pctCollege:bathrooms + pctCollege:fuel + bedrooms:heating +
## fireplaces:fuel + bathrooms:heating + rooms:heating + rooms:fuel,
## data = saratoga_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -237548 -35331 -6280 27174 487209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48921.091 40003.243 1.223 0.221571
## lotSize 1898.545 10099.869 0.188 0.850923
## age -2322.047 429.225 -5.410 7.46e-08 ***
## livingArea 96.682 8.624 11.211 < 2e-16 ***
## pctCollege -367.965 666.572 -0.552 0.581023
## bedrooms -19243.308 4340.931 -4.433 1.01e-05 ***
## fireplaces 41413.468 20730.977 1.998 0.045955 *
## bathrooms -4064.217 17719.080 -0.229 0.818617
## rooms 4323.103 1516.033 2.852 0.004417 **
## heatinghot water/steam 42588.324 20207.519 2.108 0.035255 *
## heatingelectric 84936.836 51130.315 1.661 0.096910 .
## fuelelectric -11384.061 51248.773 -0.222 0.824244
## fueloil 74488.663 30687.418 2.427 0.015341 *
## centralAirNo 14453.332 14753.756 0.980 0.327441
## lotSize:age -211.633 91.233 -2.320 0.020507 *
## lotSize:bedrooms 5529.358 3564.390 1.551 0.121071
## lotSize:fireplaces -8052.417 4274.708 -1.884 0.059817 .
## age:pctCollege 33.737 6.792 4.967 7.67e-07 ***
## age:fireplaces -170.943 121.382 -1.408 0.159274
## age:centralAirNo 817.940 226.052 3.618 0.000307 ***
## livingArea:fireplaces 13.311 4.003 3.325 0.000907 ***
## livingArea:fuelelectric -4.155 14.924 -0.278 0.780754
## livingArea:fueloil -43.643 14.866 -2.936 0.003383 **
## livingArea:centralAirNo -23.164 6.986 -3.316 0.000938 ***
## pctCollege:fireplaces -1173.571 344.633 -3.405 0.000680 ***
## pctCollege:bathrooms 601.741 306.850 1.961 0.050082 .
## pctCollege:fuelelectric -1576.334 464.251 -3.395 0.000705 ***
## pctCollege:fueloil -1011.196 455.778 -2.219 0.026679 *
## bedrooms:heatinghot water/steam 7231.651 7839.561 0.922 0.356456
## bedrooms:heatingelectric 23830.036 8588.602 2.775 0.005603 **
## fireplaces:fuelelectric 24649.069 9646.207 2.555 0.010719 *
## fireplaces:fueloil 21289.740 11115.521 1.915 0.055664 .
## bathrooms:heatinghot water/steam -14281.815 8961.736 -1.594 0.111251
## bathrooms:heatingelectric -25315.770 11266.181 -2.247 0.024798 *
## rooms:heatinghot water/steam -6411.118 2992.144 -2.143 0.032320 *
## rooms:heatingelectric -18186.339 7990.073 -2.276 0.022996 *
## rooms:fuelelectric 12853.811 7913.915 1.624 0.104567
## rooms:fueloil 4490.504 3597.157 1.248 0.212121
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63250 on 1344 degrees of freedom
## Multiple R-squared: 0.5986, Adjusted R-squared: 0.5875
## F-statistic: 54.17 on 37 and 1344 DF, p-value: < 2.2e-16
summary(lm_step2)
##
## Call:
## lm(formula = price ~ lotSize + age + livingArea + centralAir +
## bathrooms + bedrooms + fuel + rooms + pctCollege + livingArea:centralAir +
## livingArea:fuel + age:fuel + age:pctCollege + age:centralAir +
## fuel:pctCollege + centralAir:fuel + livingArea:bathrooms +
## bathrooms:bedrooms, data = SaratogaHouses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -291799 -36274 -7851 26620 508994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17758.007 26623.169 -0.667 0.504854
## lotSize 8096.028 2340.130 3.460 0.000554 ***
## age -1834.077 363.130 -5.051 4.87e-07 ***
## livingArea 75.650 12.919 5.856 5.69e-09 ***
## centralAirNo 25501.718 13411.892 1.901 0.057415 .
## bathrooms 38081.733 10035.263 3.795 0.000153 ***
## bedrooms 14380.751 7343.780 1.958 0.050367 .
## fuelelectric 23082.259 31304.021 0.737 0.461006
## fueloil 133857.602 28684.610 4.667 3.30e-06 ***
## rooms 2838.173 1052.961 2.695 0.007099 **
## pctCollege 24.899 253.886 0.098 0.921885
## livingArea:centralAirNo -31.053 6.074 -5.112 3.54e-07 ***
## livingArea:fuelelectric 3.154 9.298 0.339 0.734477
## livingArea:fueloil -36.508 8.164 -4.472 8.26e-06 ***
## age:fuelelectric 436.546 364.698 1.197 0.231471
## age:fueloil -348.616 133.154 -2.618 0.008919 **
## age:pctCollege 25.512 5.435 4.694 2.89e-06 ***
## age:centralAirNo 646.485 203.735 3.173 0.001535 **
## fuelelectric:pctCollege -1020.763 426.065 -2.396 0.016692 *
## fueloil:pctCollege -776.743 402.289 -1.931 0.053672 .
## centralAirNo:fuelelectric 7517.787 10695.955 0.703 0.482237
## centralAirNo:fueloil -32182.249 13194.958 -2.439 0.014830 *
## livingArea:bathrooms 17.671 4.522 3.908 9.69e-05 ***
## bathrooms:bedrooms -14957.385 3603.121 -4.151 3.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63640 on 1704 degrees of freedom
## Multiple R-squared: 0.5876, Adjusted R-squared: 0.582
## F-statistic: 105.6 on 23 and 1704 DF, p-value: < 2.2e-16
Let’s evaluate the two stepwise models via RMSPE against our three original linear models, averaging over 10 training/testing splits:
pred_error_est = do(10) * {
saratoga_shuffle = shuffle(SaratogaHouses)
# Remove the original id, we don't need it
saratoga_shuffle = subset(saratoga_shuffle, select=(-orig.id))
# Create the 80/20 percent split:
n = nrow(SaratogaHouses)
n_train = round(0.8*n) # round to nearest integer
saratoga_train = saratoga_shuffle[1:n_train,]
saratoga_test = saratoga_shuffle[(n_train+1):n,]
lm1 = lm(price ~ lotSize + bedrooms + bathrooms, data=saratoga_train)
lm2 = lm(price ~ . - sewer - waterfront - landValue - newConstruction, data=saratoga_train)
lm3 = lm(price ~ (. - sewer - waterfront - landValue - newConstruction)^2, data=saratoga_train)
# Stepwise models
lm_step = step(lm3, trace=0)
lm_start = lm(price ~ lotSize + age + livingArea, data=saratoga_train)
lm_step2 = step(lm_start, scope = price ~ (lotSize + age + livingArea + pctCollege + bedrooms + fireplaces + bathrooms + rooms + heating + fuel + centralAir)^2, data=saratoga_train, trace=0)
yhat_test1 = predict(lm1, newdata = saratoga_test)
yhat_test2 = predict(lm2, newdata = saratoga_test)
yhat_test3 = predict(lm3, newdata = saratoga_test)
yhat_test_step = predict(lm_step, newdata = saratoga_test)
yhat_test_step2 = predict(lm_step2, newdata = saratoga_test)
c(
rmspe_small = sqrt( mean( (saratoga_test$price - yhat_test1)^2 ) ),
rmspe_med = sqrt( mean( (saratoga_test$price - yhat_test2)^2 ) ),
rmspe_large = sqrt( mean( (saratoga_test$price - yhat_test3)^2 ) ),
rmspe_step = sqrt( mean( (saratoga_test$price - yhat_test_step)^2 ) ),
rmspe_step2 = sqrt( mean( (saratoga_test$price - yhat_test_step2)^2 ) )
)
}
print(pred_error_est)
## rmspe_small rmspe_med rmspe_large rmspe_step rmspe_step2
## 1 81708.61 72584.97 73490.22 72558.24 72504.37
## 2 81593.28 70931.59 70448.49 70450.05 69874.41
## 3 69492.07 57854.49 64647.91 63462.84 62776.48
## 4 64041.18 58996.14 60052.22 59485.00 60478.86
## 5 85536.22 74134.83 72576.21 74322.24 72690.42
## 6 82782.95 71576.61 72667.12 71759.76 71843.64
## 7 77661.81 63445.92 62997.90 63008.93 62481.75
## 8 75480.72 64434.26 63542.00 62681.70 62237.73
## 9 83188.49 68861.48 68147.69 67436.74 68093.51
## 10 70989.60 65224.84 67259.55 63689.14 65094.11
print(summary(pred_error_est))
## rmspe_small rmspe_med rmspe_large rmspe_step
## Min. :64041 Min. :57854 Min. :60052 Min. :59485
## 1st Qu.:72112 1st Qu.:63693 1st Qu.:63818 1st Qu.:63122
## Median :79628 Median :67043 Median :67704 Median :65563
## Mean :77247 Mean :66805 Mean :67583 Mean :66885
## 3rd Qu.:82514 3rd Qu.:71415 3rd Qu.:72044 3rd Qu.:71432
## Max. :85536 Max. :74135 Max. :73490 Max. :74322
## rmspe_step2
## Min. :60479
## 1st Qu.:62555
## Median :66594
## Mean :66808
## 3rd Qu.:71351
## Max. :72690
Looking at the summary output, we see that the two stepwise models and the medium model all have very similar estimated RMSPE – despite the two stepwise models being more complex. In datasets with more variables and larger sample sizes, stepwise selection will often find a better predictive model. In this case we could comfortably use the medium model, knowing that stepwise selection failed to generate substaintial improvement.
Stepwise selection is a useful tool for finding good predictive models, but it has some drawbacks; first, different starting points may give different answers (althgouh they will usually perform similarly for prediction). Second, the stepwise search procedure uses an approximation to out of sample prediction error as its guide. When the sample size isn’t large relative to the number of regression coefficients in the model, the assumptions behind these approximations can easily break down. Third, you lose out on the ability to check the model using residual plots, checking for outliers, etc. It’s always a good idea to check the final model, even though we skipped that step above.
Finally, an important caveat your textbook doesn’t mention: After running stepwise selection, confidence intervals and hypothesis tests using the same data that was used to select the model are invalid – all of our confidence intervals and hypothesis tests assume that the model was chosen without using the data. One can get around this by using part of the data for selecting a model and another part of the data to get confidence intervals or conduct hypothesis tests. But in general stepwise selection is best reserved for cases where we are just trying to get a good predictive model.