# Chapter 4 Groupwise models

Calculating means and other simple statistics is a matter of using the right function in R. The ** mosaic** package — which you should load in to R as shown in Section @ref(“sec:loading-mosaic”) — makes it straightforward to calculate either a “grand” statistic or a “group-wise” statistic. To illustrate:

Load the ** mosaic** package, needed only once per session.

Read in data you are interested in analyzing, for instance the Cherry-Blossom 2008 data described earlier:^{3}

```
Runners = read.csv("http://tiny.cc/mosaic/Cherry-Blossom-2008.csv")
names( Runners )
```

```
## [1] "position" "division" "total" "name" "age" "place" "net" "gun"
## [9] "sex"
```

Calculate a grand mean on the “gun” time — the time between the start of the race, signalled by a gun, and when each runner crossed the finish line. Here, to emphasize that this is a group–wise model, we use the function `gwm()`

from the ** mosaic** package. It works the same as

`mean()`

, but its output is more formally the output of a model.`gwm(gun ~ 1, data = Runners)`

```
##
## Groupwise Model Call:
## gwm(formula = gun ~ 1, data = Runners)
##
## model_value
## 1 93.73
```

The model formula `gun ~ 1`

is an “all cases the same” model. You will refer to the `1`

as the “Intercept term” later in the text, but for now, think of it as “one value.” The one value here is the grand mean. Grand mean and other “grand” statistics.

`mean( gun ~ 1, data = Runners)`

```
## 1
## 93.72504
```

`median( gun ~ 1, data = Runners )`

```
## 1
## 93.68333
```

`sd( gun ~ 1, data = Runners )`

```
## 1
## 14.96899
```

To tell R that you want to break the statistics down by groups, replace the `1`

with the variable which defines the groups. You will be using this notation frequently in building models. Here, as before, the `~`

means, “model by” or “broken down by” or “versus.” To find the means for the gun time broken down by sex, enter

`gwm( gun ~ sex, data = Runners )`

```
##
## Groupwise Model Call:
## gwm(formula = gun ~ sex, data = Runners)
##
## sex model_value
## 1 F 98.77
## 2 M 88.26
```

Other statistics work the same way, for instance,

`sd( gun ~ sex, data = Runners )`

```
## F M
## 13.33713 14.71851
```

Another example … wage broken down by sector of the economy, using data `cps.csv`

:^{4}

```
CPS = read.csv("http://tiny.cc/mosaic/cps.csv")
gwm( wage ~ sector, data = CPS )
```

```
##
## Groupwise Model Call:
## gwm(formula = wage ~ sector, data = CPS)
##
## sector model_value
## 1 clerical 7.423
## 2 const 9.502
## 3 manag 12.704
## 4 manuf 8.036
## 5 other 8.501
## 6 prof 11.947
## 7 sales 7.593
## 8 service 6.537
```

In the Whickham smoking data example^{5}, the `outcome`

for each person was not a number but a category: Alive or Dead at the time of the follow-up.

```
W <- read.csv("http://tiny.cc/mosaic/whickham.csv")
names(W)
```

`## [1] "outcome" "smoker" "age"`

`levels(W$outcome)`

`## [1] "Alive" "Dead"`

The `gwm()`

function recognizes that `outcome`

is a categorical variable and outputs proportions. In the case of “all cases the same,” this is porportions of the data in the `outcome`

levels:

`gwm( outcome ~ 1, data = W )`

```
##
## Groupwise Model Call:
## gwm(formula = outcome ~ 1, data = W)
##
## outcome model_value
## 1 Alive 0.7192
## 2 Dead 0.2808
```

Here’s the breakdown of the levels `Alive`

and `Dead`

according to smoking status:

`gwm( outcome ~ smoker, data = W )`

```
##
## Groupwise Model Call:
## gwm(formula = outcome ~ smoker, data = W)
##
## outcome smoker model_value
## 1 Alive No 0.3820
## 2 Dead No 0.1750
## 3 Alive Yes 0.3371
## 4 Dead Yes 0.1058
```

You may be surbrised by this result. But don’t be misled. A more meaningful question is whether smokers are different from non-smokers when holding other variables constant, such as age. To address this question, you need to add age into the model.

It might be natural to consider each age — 35, 36, 37, and so on — as a separate group, but you won’t get very many members of each group. And, likely, the data for 35 year-olds has quite a lot to say about 36 year-olds, so it doesn’t make sense to treat them as completely separate groups.

You can use the `cut()`

function to divide up a quantitative variable into groups. You get to specify the breaks between groups.

```
W$ageGroups <- cut(W$age, breaks=c(0,30,40,53,64,75,100) )
gwm( outcome ~ ageGroups, data = W )
```

```
##
## Groupwise Model Call:
## gwm(formula = outcome ~ ageGroups, data = W)
##
## outcome ageGroups model_value
## 1 Alive (0,30] 0.214612
## 2 Dead (0,30] 0.004566
## 3 Alive (30,40] 0.181887
## 4 Dead (30,40] 0.009893
## 5 Alive (40,53] 0.177321
## 6 Dead (40,53] 0.035769
## 7 Alive (53,64] 0.119482
## 8 Dead (53,64] 0.071537
## 9 Alive (64,75] 0.025875
## 10 Dead (64,75] 0.102740
## 11 Alive (75,100] 0.000000
## 12 Dead (75,100] 0.056317
```

`gwm( outcome ~ smoker + ageGroups, data = W )`

```
##
## Groupwise Model Call:
## gwm(formula = outcome ~ smoker + ageGroups, data = W)
##
## outcome smoker ageGroups model_value
## 1 Alive No (0,30] 0.123288
## 2 Dead No (0,30] 0.002283
## 3 Alive Yes (0,30] 0.091324
## 4 Dead Yes (0,30] 0.002283
## 5 Alive No (30,40] 0.097412
## 6 Dead No (30,40] 0.004566
## 7 Alive Yes (30,40] 0.084475
## 8 Dead Yes (30,40] 0.005327
## 9 Alive No (40,53] 0.075342
## 10 Dead No (40,53] 0.010654
## 11 Alive Yes (40,53] 0.101979
## 12 Dead Yes (40,53] 0.025114
## 13 Alive No (53,64] 0.064688
## 14 Dead No (53,64] 0.031963
## 15 Alive Yes (53,64] 0.054795
## 16 Dead Yes (53,64] 0.039574
## 17 Alive No (64,75] 0.021309
## 18 Dead No (64,75] 0.078387
## 19 Alive Yes (64,75] 0.004566
## 20 Dead Yes (64,75] 0.024353
## 21 Alive No (75,100] 0.000000
## 22 Dead No (75,100] 0.047184
## 23 Alive Yes (75,100] 0.000000
## 24 Dead Yes (75,100] 0.009132
```

With modeling techniques, to be introduced in later chapters, you can use quantitative variables without the need to divide them into groups.

## 4.1 Model Values and Residuals

A group-wise model tells you a model value for each group, but often you will need these in a case-by-case format: the model value for each case in a data set. The `fitted()`

function carries out this simple calculation, taking each case in turn, figuring out which group it belongs to, and then returning the set of model values for all the cases. It requires two arguments: the model (here provided by `gwm()`

) the data on which the model was based. For example:

```
Kids <- KidsFeet # just get this from mosaicData instead of going out on the web with read.csv
mod <- gwm( width ~ sex, data = Kids )
fitted( mod, Kids )
```

```
## [1] 9.190000 9.190000 9.190000 9.190000 9.190000 9.190000 9.190000 8.784211 8.784211 9.190000
## [11] 9.190000 9.190000 9.190000 9.190000 8.784211 8.784211 8.784211 8.784211 8.784211 8.784211
## [21] 9.190000 9.190000 8.784211 8.784211 8.784211 9.190000 8.784211 9.190000 9.190000 9.190000
## [31] 8.784211 8.784211 8.784211 9.190000 9.190000 8.784211 8.784211 8.784211 8.784211
```

The residuals are found by subtracting the case-by-case model value from the actual values for each case.

```
res <- Kids$width - fitted(mod, Kids)
head(res) # display the first few residuals
```

`## [1] -0.79 -0.39 0.51 0.61 -0.29 0.51`

```
res <- resid(mod, Kids) # function resid does the calculation automatically
head(res)
```

`## [1] -0.79 -0.39 0.51 0.61 -0.29 0.51`

Take care to use the same quantitative variable (`width`

in this case) from the data as was used in constructing the model.

The `var()`

function will calculate the variance:

`var( Kids$width ) # overall variation`

`## [1] 0.2596761`

`var( fitted(mod, Kids) ) # variation in model values`

`## [1] 0.04222182`

`var( resid(mod, Kids) ) # residual variation`

`## [1] 0.2174543`

Notice the “model triangle” relationship between these three numbers.