# Chapter 12 Confidence in models

Regression reports are generated using software you have already encountered: `lm`

to fit a model and `summary`

to construct the report from the fitted model. To illustrate with the `SwimRecords`

data from the ** mosaicData** package:

```
mod <- lm(time ~ year + sex, data = SwimRecords)
summary(mod)
```

```
##
## Call:
## lm(formula = time ~ year + sex, data = SwimRecords)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7027 -2.7027 -0.5968 1.2796 19.0759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 555.71678 33.79991 16.441 < 2e-16 ***
## year -0.25146 0.01732 -14.516 < 2e-16 ***
## sexM -9.79796 1.01287 -9.673 8.79e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.983 on 59 degrees of freedom
## Multiple R-squared: 0.844, Adjusted R-squared: 0.8387
## F-statistic: 159.6 on 2 and 59 DF, p-value: < 2.2e-16
```

## 12.1 Confidence Intervals from Standard Errors

Given the coefficient estimate and the standard error from the regression report, the confidence interval is easily generated.

For a 95% confidence interval, you just multiply the standard error by 2 to get the margin of error. For example, in the above, the margin of error on `sex M`

is \(2 \times 1.013 = 2.03\), or, in computer notation:

`2 * 1.0129`

`## [1] 2.0258`

If you want the two endpoints of the confidence interval, rather than just the margin of error, do these simple calculations: (1) subtract the margin of error from the estimate; (2) add the margin of error to the estimate. So,

`-9.798 - 2*1.0129`

`## [1] -11.8238`

`-9.798 + 2*1.0129`

`## [1] -7.7722`

The key thing is to remember the multiplier that is applied to the standard error. A multiplier of approximately 2 is for a 95% confidence interval.

The `confint()`

function provides a convenient way to calculate confidence intervals directly.

It calculates the exact multiplier (which depends somewhat on the sample size) and applies it to the standard error to produce the confidence intervals.

```
mod <- lm( time ~ year + sex, data = SwimRecords)
confint(mod)
```

```
## 2.5 % 97.5 %
## (Intercept) 488.0833106 623.3502562
## year -0.2861277 -0.2167997
## sexM -11.8247135 -7.7712094
```

It would be convenient if the regression report included confidence intervals rather than the standard error. Part of the reason it doesn’t is historical: the desire to connect to the traditional by-hand calculations.

## 12.2 Bootstrapping Confidence Intervals

Confidence intervals on model coefficients can be computed using the same bootstrapping technique introduced in Chapter @ref(“chap:statistical-inference”).

Start with your fitted model. To illustrate, here is a model of world-record swimming time over the years, taking into account sex:

`lm( time ~ year + sex, data = SwimRecords)`

```
##
## Call:
## lm(formula = time ~ year + sex, data = SwimRecords)
##
## Coefficients:
## (Intercept) year sexM
## 555.7168 -0.2515 -9.7980
```

These coefficients reflect one hypothetical draw from the population-based sampling distribution. It’s impossible to get another draw from the “population” here: the actual records are all you’ve got.

But to approximate sampling variation, you can treat the sample as your population and re-sample:

`lm( time ~ year + sex, data = resample(SwimRecords))`

```
##
## Call:
## lm(formula = time ~ year + sex, data = resample(SwimRecords))
##
## Coefficients:
## (Intercept) year sexM
## 533.8004 -0.2402 -9.9231
```

Constructing many such re-sampling trials and collect the results

```
s = do(500) * lm( time ~ year + sex, data = resample(SwimRecords))
head(s)
```

```
## Intercept year sexM sigma r.squared F numdf dendf .row .index
## 1 551.9951 -0.2495943 -9.495649 2.841716 0.9078873 290.7598 2 59 1 1
## 2 555.8467 -0.2513771 -10.600461 4.007877 0.8340924 148.3099 2 59 1 2
## 3 693.8656 -0.3217095 -9.636199 5.493706 0.8373505 151.8716 2 59 1 3
## 4 633.5587 -0.2911209 -9.212951 4.760334 0.8182637 132.8231 2 59 1 4
## 5 588.4652 -0.2683256 -9.557587 3.912234 0.8497676 166.8624 2 59 1 5
## 6 544.5205 -0.2460383 -9.379971 3.127280 0.8878750 233.5992 2 59 1 6
```

To find the standard error of the coefficients, just take the standard deviation across the re-sampling trials. For the indicator variable `sex M`

:

`sd(s$sexM)`

`## [1] 0.9242203`

Multiplying the standard error by 2 gives the approximate 95% margin of error. Alternatively, you can use the `confint()`

function to calculate this for you:

`confint(s$sexM, method = "stderr")`

`## Confidence Interval from Bootstrap Distribution (500 replicates)`

```
## V1 V2
## stderr -11.5434 -7.911714
```

## 12.3 Prediction Confidence Intervals

When a model is used to make a prediction, it’s helpful to be able to describe how precise the prediction is. For instance, suppose you want to use the `KidsFeet`

data set (from ** mosaicData**) to make a prediction of the foot width of a girl whose foot length is 25 cm.

First, fit your model:

`names(KidsFeet)`

```
## [1] "name" "birthmonth" "birthyear" "length" "width" "sex" "biggerfoot"
## [8] "domhand"
```

`levels(KidsFeet$sex)`

`## [1] "B" "G"`

`mod <- lm( width ~ length + sex, data = KidsFeet)`

Now apply the model to the new data for which you want to make a prediction. Take care to use the right coding for categorical variables.

`predict(mod, newdata = data.frame( length=25, sex="G" ))`

```
## 1
## 8.934275
```

In order to generate a confidence interval, the `predict`

operator needs to be told what type of interval is wanted. There are two types of prediction confidence intervals:

- Interval on the model value which reflects the sampling distributions of the coefficients themselves. To calculate this, use the
`interval = "confidence"`

named argument:

```
predict(mod, newdata = data.frame( length = 25, sex = "G" ),
interval = "confidence")
```

```
## fit lwr upr
## 1 8.934275 8.742565 9.125985
```

The components named `lwr`

and `upr`

are the lower and upper limits of the confidence interval, respectively.

- Interval on the prediction which includes the variation due to the uncertainty in the coefficients as well as the size of a typical residual. To find this interval, use the
`interval = "prediction"`

named argument:

```
predict(mod, newdata = data.frame( length = 25, sex = "G" ),
interval = "prediction")
```

```
## fit lwr upr
## 1 8.934275 8.130484 9.738066
```

The prediction interval is larger than the model-value confidence interval because the residual always gives additional uncertainty around the model value. Predicting an individual’s foot width, even if we know her sex and foot length, involves a lot more uncertainty than predicting the mean foot width of all individuals with these particular characteristics.