Chapter 8 Fitting models to data

Using the lm software is largely a matter of familiarity with the model design language described in Chapter @ref(“chap:language”). Computing the fitted model values and the residuals is done with the fitted and resid. These operators take a model as an input. To illustrate:

Swim <- SwimRecords  # from mosaicData
mod1 <- lm( time ~ year + sex, data = Swim)
coef(mod1)
## (Intercept)        year        sexM 
## 555.7167834  -0.2514637  -9.7979615

Once you have constucted the model, you can use fitted and resid:

modvals <- fitted(mod1)
head(modvals)
##        1        2        3        4        5        6 
## 66.88051 66.12612 65.62319 65.12026 63.61148 63.10855

8.1 Sums of Squares

Computations can be performed on the fitted model values and the residuals, just like any other quantity:

mean(fitted(mod1))
## [1] 59.92419
var(resid(mod1))
## [1] 15.34147
sd(resid(mod1))
## [1] 3.916818
summary(resid(mod1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.7030 -2.7030 -0.5968  0.0000  1.2800 19.0800

Sums of squares are very important in statistics. Here’s how to calculate them for the response values, the fitted model values, and the residuals:

sum(Swim$time^2)
## [1] 228635
sum(fitted(mod1)^2)
## [1] 227699.2
sum(resid(mod1)^2)
## [1] 935.8294

The partitioning of variation by models is seen by the way the sum of squares of the fitted and the residuals add up to the sum of squares of the response:

227699 + 935.8
## [1] 228634.8

Don’t forget the squaring stage of the operation! The sum of the residuals (without squaring) is very different from the sum of squares of the residuals:

sum(resid(mod1))
## [1] 1.848521e-14
sum(resid(mod1)^2)
## [1] 935.8294

Take care in reading numbers formatted like 1.849e-14. The notation stands for \(1.849 \times 10^{-14}\). That number, \(0.00000000000001849\), is effectively zero compared to the residuals themselves!

8.2 Redundancy

The lm operator will automatically detect redundancy and deal with it by leaving the redundant terms out of the model.

To see how redundancy is handled, here is an example with a constructed redundant variable in the swimming dataset.
The following statement adds a new variable to the dataframe counting how many years after the end of World War II each record was established:

Swim$afterwar <- Swim$year - 1945

Here is a model that doesn’t involve redundancy

mod1 <- lm( time ~ year + sex, data = Swim)
coef(mod1)
## (Intercept)        year        sexM 
## 555.7167834  -0.2514637  -9.7979615

When the redundant variable is added in, lm successfully detects the redundancy and handles it. This is indicated by a coefficient of NA on the redundant variable.

mod2 <- lm( time ~ year + sex + afterwar, data = Swim)
coef(mod2)
## (Intercept)        year        sexM    afterwar 
## 555.7167834  -0.2514637  -9.7979615          NA

In the absence of redundancy, the model coefficients don’t depend on the order in which the model terms are specified. But this is not the case when there is redundancy, since any redundancy is blamed on the later variables. For instance, here afterwar has been put first in the explanatory terms, so lm identifies year as the redundant variable:

mod3 <- lm( time ~ afterwar + year + sex, data = Swim)
coef(mod3)
## (Intercept)    afterwar        year        sexM 
##  66.6199228  -0.2514637          NA  -9.7979615

Even though the coefficients are different, the fitted model values and the residuals are exactly the same (to within computer round-off) regardless of the order of the model terms.

head(fitted(mod2))
##        1        2        3        4        5        6 
## 66.88051 66.12612 65.62319 65.12026 63.61148 63.10855
head(fitted(mod3))
##        1        2        3        4        5        6 
## 66.88051 66.12612 65.62319 65.12026 63.61148 63.10855

Note that whenever you use a categorical variable and an intercept term in a model, there is a redundancy. This is not shown explicitly. For example, here is a model with no intercept term, and both levels of the categorical variable sex show up with coefficients:

mod <- lm( time ~ sex - 1, data = Swim)
coef(mod)
##     sexF     sexM 
## 65.19226 54.65613

If the intercept term is included (as it is by default unless -1 is used in the model formula), one of the levels is simply dropped in the report:

mod <- lm( time ~ sex, data = Swim)
coef(mod)
## (Intercept)        sexM 
##    65.19226   -10.53613

Remember that this coefficient report implicitly involves a redundancy.