Hypothesis Testing in Regression

This walkthrough shows you the R code to go along with Ch 7’s section titled “Hypothesis testing in regression” (p 159), and provides some more details about hypothesis tests in the regression summary table.

Using \(R^2\) to test for partial (linear) relationships

Using \(R^2\) to compare models or evaluate predictors is a reasonable idea; remember that \(R^2\) tells us the proportion of variability in the response variable that is predictable using our covariates. If we add a “good” predictor – one that brings something new to the table, beyond the variables in the model already – we should expect \(R^2\) to go up.

But it turns out that \(R^2\) will always increase when we add a variable to the regression model, whether it’s actually useful or not, due to tiny chance partial correlations with the response variable. Let’s make up a completely useless variable, and add it to a regression model using the Saratoga housing dataset:

library(mosaic)
data("SaratogaHouses")

lm0 = lm(price ~ livingArea + lotSize + fireplaces, data=SaratogaHouses)
rsquared(lm0)

## [1] 0.5112573

# adding a variable that is just noise
useless = rnorm(1728, 0, 1) 
lm0useless = lm(price ~ livingArea + lotSize + fireplaces + useless, data=SaratogaHouses)
rsquared(lm0useless)

## [1] 0.512179

If you run the code above a few times, to regenerate the useless variable, you’ll see that \(R^2\) for the model including the useless variable is always larger! We could get the summary table and look at the adjusted \(R^2\) instead, or we can use our tools from hypothesis testing to understand when an increase in \(R^2\) from adding a variable or variables is unlikely to be due to chance.

This is especially useful for categorical predictors like the fuel system type, when a null hypothesis of no partial relationship corresponds to multiple regression coefficients being zero simultaneously (one for each of the two dummy variables). Let’s add fuel into the model:

lm1 = lm(price ~ livingArea + lotSize + fireplaces + fuel, data=SaratogaHouses)
summary(lm1)

## 
## Call:
## lm(formula = price ~ livingArea + lotSize + fireplaces + fuel, 
##     data = SaratogaHouses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -252945  -39745   -7802   28073  547557 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   23776.058   5541.771   4.290 1.88e-05 ***
## livingArea      105.161      3.142  33.465  < 2e-16 ***
## lotSize        8639.936   2487.912   3.473 0.000528 ***
## fireplaces     7493.785   3382.214   2.216 0.026846 *  
## fuelelectric -15494.074   4536.950  -3.415 0.000652 ***
## fueloil      -18978.411   5322.031  -3.566 0.000372 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68530 on 1722 degrees of freedom
## Multiple R-squared:  0.5168, Adjusted R-squared:  0.5154 
## F-statistic: 368.4 on 5 and 1722 DF,  p-value: < 2.2e-16

rsquared(lm1)

## [1] 0.5168484

\(R^2\) increases after adding these two dummy variables, but is it a meaningful increase? We can assess that using a permutation test – what increase in \(R^2\) would we get if we purposefully made fuel a useless variable by shuffling its values, breaking any relationship with the sale price?

perm1 = do(10000)*{
  lm(price ~ livingArea + lotSize + fireplaces + shuffle(fuel), data=SaratogaHouses)
}

At every iteration of the do-loop above the fuel column is shuffled before we fit the model, so the increase in \(R^2\) is due to chance. The output from the permutation test above tells us: If the fuel variable is actually useless, what kind of increase would we expect see?

hist(perm1$r.squared)
abline(v=rsquared(lm1), col='red')

Recall the observed value was 0.5168, shown in red above. So the p-value associated with the test of no partial linear relationship between fuel and sale price – that is, the null hypothesis that the coefficients on both fuel dummy variables is zero – is estimated as

sum(perm1$r.squared >= rsquared(lm1))/10000

## [1] 1e-04

Note that a conservative way to report these p-values would be to “round up” and say that \(p<0.001\) – if you want precise estimates of very small p-values you need to run an awful lot of simulations, and there usually isn’t any point (conventionally, for better or worse \(p<0.05\) is often taken as sufficiently “strong” evidence against the null).

To get a little more practice let’s test for a non-zero partial effect of fireplaces in the above model:

perm2 = do(10000)*{
  lm(price ~ livingArea + lotSize + shuffle(fireplaces) + fuel, data=SaratogaHouses)
}

hist(perm2$r.squared)

# p-value
sum(perm2$r.squared >= rsquared(lm1))/10000

## [1] 0.0285

What about the test statistics and p-values in the R summary table?

The summary table includes a host of test statistics and p-values:

summary(lm1)

## 
## Call:
## lm(formula = price ~ livingArea + lotSize + fireplaces + fuel, 
##     data = SaratogaHouses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -252945  -39745   -7802   28073  547557 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   23776.058   5541.771   4.290 1.88e-05 ***
## livingArea      105.161      3.142  33.465  < 2e-16 ***
## lotSize        8639.936   2487.912   3.473 0.000528 ***
## fireplaces     7493.785   3382.214   2.216 0.026846 *  
## fuelelectric -15494.074   4536.950  -3.415 0.000652 ***
## fueloil      -18978.411   5322.031  -3.566 0.000372 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68530 on 1722 degrees of freedom
## Multiple R-squared:  0.5168, Adjusted R-squared:  0.5154 
## F-statistic: 368.4 on 5 and 1722 DF,  p-value: < 2.2e-16

The short story is that these are very similar to the hypothesis tests we’ve been doing, but using different test statistics and other assumptions and approximation tools instead of permutation tests. The situation here is analgous to computing standard errors and confidence intervals via direct estimation versus using the bootstrap. The pros and cons are similar too – the regression table is fast to compute, but the p-values are based on some extra modeling assumptions that we didn’t have to invoke for permutation tests.

The slightly longer story is this:

The p-values attached to each coefficient (in the fourth named column, Pr(>|t|) ) are from testing the null hypothesis of no partial linear relationship – \(\beta_j=0\) – using the test statistic

\[ t_j = \frac{\hat\beta_j - 0}{s_{\hat\beta_j}} \]

(reported in the third named column, t value) where \(s_{\hat\beta_j}\) is the standard error of \(\hat\beta_j\). Remember that in general, we have

\[ \hat\beta_j\sim N(\beta_j,s_{\hat\beta_j}^2) \]

approximately if the residuals are independent, the residual standard deviation is constant and the sample size is large (or the residuals are approximately normally distributed around zero). Then if the null hypothesis is true, \(\beta_j=0\) so

\[ t_j\sim N(0,1). \]

Then seeing a large value of \(|t_j|\) would constitute evidence against the null hypothesis. The column Pr(>|t|) simply reports the probability of seeing an abosulte t-statistic at least as extreme as what we observed if the null hypothesis is true – computed using the tail areas from the standard normal distribution. (Actually, R is using a \(t-\) reference distribution, not a standard normal distribution, but the difference is immaterial.)

Let’s look at an example: From the regression summary above, the p-value for testing whether the number of fireplaces has any partial linear relationship to the sale price of the house, holding constant the living area, lot size, and fuel type is about 0.027. Let’s check our understanding by computing the t-statistic and p-value by hand:

tstat = 7493.785/3382.214
2*pnorm(-tstat)

## [1] 0.02671584

The second line above – the p-value – is the blue area below:

(The small difference between the p-value computed by hand and the regression table is from using the normal rather than the t distribution.) Compare that p-value to the last permutation test we did above; the two tests have the same null hypothesis, but use different methods to approximate the distribution of the test statistic when the null hypothesis is true. However, in this case we’d reach similar conclusions either way.

The overall F-test in the very last row of the summary table is testing the hypothesis that NONE of your variables have linear relationships with \(y\), that is, all the regression coefficients are zero (except the intercept \(\beta_0\)). This test is rarely worth doing, but if you wanted to, we could perform the equivalent test by shuffling all the X’s in the model (or more easily by shuffling the dependent variable).

OPTIONAL (for the curious): The overall F-statistic is just \[ F = \frac{R^2 / p}{(1-R^2)/(n-p-1)} \] where \(n\) is the sample size and \(p\) is the number of terms in the model. These are constants, so the F-statistic above is large when \(R^2\) is large, and small when \(R^2\) is small. The F-statistic was historically useful as a test statistics because it has a known distribution under the null hypothesis – called the \(F-\)distribution – if the residuals are independent, have constant variance and 1) are normally distributed or 2) the sample size is very large.

There is also partial version of the F-test for testing the null hypothesis that batches of regression coefficients are zero (as in our fuel type example above). Just like the overall \(F-\)statistic above, it is large when \(R^2\) in the full model is large and small when it is small, and mostly useful insofar as it has a known distribution under the null hypothesis under the same conditions above.