# Chapter 15 Testing parts of models

`require(mosaic) # mosaic operators and data used in this section`

The basic software for hypothesis testing on parts of models involves the familiar `lm()`

and `summary()`

operators for generating the regression report and the `anova()`

operator for generating an **ANOVA** report on a model.

## 15.1 ANOVA reports

The `anova`

operator takes a model as an argument and produces the term-by term `ANOVA`

report. To illustrate, consider this model of wages from the Current Population Survey data.

```
Cps <- CPS85 # from mosaicData
mod1 <- lm( wage ~ married + age + educ, data = Cps)
anova(mod1)
```

```
## Analysis of Variance Table
##
## Response: wage
## Df Sum Sq Mean Sq F value Pr(>F)
## married 1 142.4 142.40 6.7404 0.009687 **
## age 1 338.5 338.48 16.0215 7.156e-05 ***
## educ 1 2398.7 2398.72 113.5405 < 2.2e-16 ***
## Residuals 530 11197.1 21.13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Note the small p-value on the `married`

term: `0.0097`

.

To change the order of the terms in the report, you can create a new model with the explanatory terms listed in a different order. For example, here’s the `ANOVA`

on the same model, but with `married`

last instead of first:

```
mod2 <- lm( wage ~ age + educ + married, data = Cps)
anova(mod2)
```

```
## Analysis of Variance Table
##
## Response: wage
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 440.8 440.84 20.8668 6.13e-06 ***
## educ 1 2402.7 2402.75 113.7310 < 2.2e-16 ***
## married 1 36.0 36.01 1.7046 0.1923
## Residuals 530 11197.1 21.13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Now the p-value on `married`

is large. This suggests that much of the variation in `wage`

that is associated with `married`

can also be accounted for by `age`

and `educ`

instead.

## 15.2 Non-Parametric Statistics

Consider the model of world-record swimming times plotted on page 116.

It shows pretty clearly the interaction between `year`

and `sex`

.

It’s easy to confirm that this interaction term is statistically significant:

```
Swim <- SwimRecords # in mosaicData
anova( lm( time ~ year * sex, data = Swim) )
```

```
## Analysis of Variance Table
##
## Response: time
## Df Sum Sq Mean Sq F value Pr(>F)
## year 1 3578.6 3578.6 324.738 < 2.2e-16 ***
## sex 1 1484.2 1484.2 134.688 < 2.2e-16 ***
## year:sex 1 296.7 296.7 26.922 2.826e-06 ***
## Residuals 58 639.2 11.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The p-value on the interaction term is very small: \(2.8 \times10^{-6}\).

To check whether this result might be influenced by the shape of the distribution of the `time`

or `year`

data, you can conduct a non-parametric test. Simply take the rank of each quantitative variable:

```
mod <- lm( rank(time) ~ rank(year) * sex, data = Swim)
anova(mod)
```

```
## Analysis of Variance Table
##
## Response: rank(time)
## Df Sum Sq Mean Sq F value Pr(>F)
## rank(year) 1 14320.5 14320.5 3755.7711 <2e-16 ***
## sex 1 5313.0 5313.0 1393.4135 <2e-16 ***
## rank(year):sex 1 0.9 0.9 0.2298 0.6335
## Residuals 58 221.1 3.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

With the rank-transformed data, the p-value on the interaction term is much larger: no evidence for an interaction between `year`

and `sex`

. You can see this directly in a plot of the data after rank-transforming `time`

:

`xyplot( rank(time) ~ year, groups = sex, data = Swim)`

The rank-transformed data suggest that women’s records are improving in about the same way as men’s. That is, new records are set by women at a rate similar to the rate at which men set them.