# Chapter 8 Fitting models to data

Using the `lm`

software is largely a matter of familiarity with the model design language described in Chapter @ref(“chap:language”). Computing the fitted model values and the residuals is done with the `fitted`

and `resid`

. These operators take a model as an input. To illustrate:

```
Swim <- SwimRecords # from mosaicData
mod1 <- lm( time ~ year + sex, data = Swim)
coef(mod1)
```

```
## (Intercept) year sexM
## 555.7167834 -0.2514637 -9.7979615
```

Once you have constucted the model, you can use `fitted`

and `resid`

:

```
modvals <- fitted(mod1)
head(modvals)
```

```
## 1 2 3 4 5 6
## 66.88051 66.12612 65.62319 65.12026 63.61148 63.10855
```

## 8.1 Sums of Squares

Computations can be performed on the fitted model values and the residuals, just like any other quantity:

`mean(fitted(mod1))`

`## [1] 59.92419`

`var(resid(mod1))`

`## [1] 15.34147`

`sd(resid(mod1))`

`## [1] 3.916818`

`summary(resid(mod1))`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.7030 -2.7030 -0.5968 0.0000 1.2800 19.0800
```

Sums of squares are very important in statistics. Here’s how to calculate them for the response values, the fitted model values, and the residuals:

`sum(Swim$time^2)`

`## [1] 228635`

`sum(fitted(mod1)^2)`

`## [1] 227699.2`

`sum(resid(mod1)^2)`

`## [1] 935.8294`

The partitioning of variation by models is seen by the way the sum of squares of the fitted and the residuals add up to the sum of squares of the response:

`227699 + 935.8`

`## [1] 228634.8`

Don’t forget the squaring stage of the operation! The sum of the residuals (without squaring) is very different from the sum of squares of the residuals:

`sum(resid(mod1))`

`## [1] 1.848521e-14`

`sum(resid(mod1)^2)`

`## [1] 935.8294`

Take care in reading numbers formatted like `1.849e-14`

. The notation stands for \(1.849 \times 10^{-14}\). That number, \(0.00000000000001849\), is effectively zero compared to the residuals themselves!

## 8.2 Redundancy

The `lm`

operator will automatically detect redundancy and deal with it by leaving the redundant terms out of the model.

To see how redundancy is handled, here is an example with a constructed redundant variable in the swimming dataset.

The following statement adds a new variable to the dataframe counting how many years after the end of World War II each record was established:

`Swim$afterwar <- Swim$year - 1945`

Here is a model that doesn’t involve redundancy

```
mod1 <- lm( time ~ year + sex, data = Swim)
coef(mod1)
```

```
## (Intercept) year sexM
## 555.7167834 -0.2514637 -9.7979615
```

When the redundant variable is added in, `lm`

successfully detects the redundancy and handles it. This is indicated by a coefficient of NA on the redundant variable.

```
mod2 <- lm( time ~ year + sex + afterwar, data = Swim)
coef(mod2)
```

```
## (Intercept) year sexM afterwar
## 555.7167834 -0.2514637 -9.7979615 NA
```

In the absence of redundancy, the model coefficients don’t depend on the order in which the model terms are specified. But this is not the case when there is redundancy, since any redundancy is blamed on the later variables. For instance, here `afterwar`

has been put first in the explanatory terms, so `lm`

identifies `year`

as the redundant variable:

```
mod3 <- lm( time ~ afterwar + year + sex, data = Swim)
coef(mod3)
```

```
## (Intercept) afterwar year sexM
## 66.6199228 -0.2514637 NA -9.7979615
```

Even though the coefficients are different, the fitted model values and the residuals are exactly the same (to within computer round-off) regardless of the order of the model terms.

`head(fitted(mod2))`

```
## 1 2 3 4 5 6
## 66.88051 66.12612 65.62319 65.12026 63.61148 63.10855
```

`head(fitted(mod3))`

```
## 1 2 3 4 5 6
## 66.88051 66.12612 65.62319 65.12026 63.61148 63.10855
```

Note that whenever you use a categorical variable and an intercept term in a model, there is a redundancy. This is not shown explicitly. For example, here is a model with no intercept term, and both levels of the categorical variable `sex`

show up with coefficients:

```
mod <- lm( time ~ sex - 1, data = Swim)
coef(mod)
```

```
## sexF sexM
## 65.19226 54.65613
```

If the intercept term is included (as it is by default unless `-1`

is used in the model formula), one of the levels is simply dropped in the report:

```
mod <- lm( time ~ sex, data = Swim)
coef(mod)
```

```
## (Intercept) sexM
## 65.19226 -10.53613
```

Remember that this coefficient report implicitly involves a redundancy.