Chapter 4 Groupwise models

Calculating means and other simple statistics is a matter of using the right function in R. The mosaic package — which you should load in to R as shown in Section @ref(“sec:loading-mosaic”) — makes it straightforward to calculate either a “grand” statistic or a “group-wise” statistic. To illustrate:

Load the mosaic package, needed only once per session.

Read in data you are interested in analyzing, for instance the Cherry-Blossom 2008 data described earlier:3

Runners = read.csv("http://tiny.cc/mosaic/Cherry-Blossom-2008.csv")
names( Runners )
## [1] "position" "division" "total"    "name"     "age"      "place"    "net"      "gun"     
## [9] "sex"

Calculate a grand mean on the “gun” time — the time between the start of the race, signalled by a gun, and when each runner crossed the finish line. Here, to emphasize that this is a group–wise model, we use the function gwm() from the mosaic package. It works the same as mean(), but its output is more formally the output of a model.

gwm(gun ~ 1, data = Runners)
## 
## Groupwise Model Call:
## gwm(formula = gun ~ 1, data = Runners)
## 
##   model_value
## 1       93.73

The model formula gun ~ 1 is an “all cases the same” model. You will refer to the 1 as the “Intercept term” later in the text, but for now, think of it as “one value.” The one value here is the grand mean. Grand mean and other “grand” statistics.

mean( gun ~ 1, data = Runners)
##        1 
## 93.72504
median( gun ~ 1, data = Runners )
##        1 
## 93.68333
sd( gun ~ 1, data = Runners )
##        1 
## 14.96899

To tell R that you want to break the statistics down by groups, replace the 1 with the variable which defines the groups. You will be using this notation frequently in building models. Here, as before, the ~ means, “model by” or “broken down by” or “versus.” To find the means for the gun time broken down by sex, enter

gwm( gun ~ sex, data = Runners )
## 
## Groupwise Model Call:
## gwm(formula = gun ~ sex, data = Runners)
## 
##   sex model_value
## 1   F       98.77
## 2   M       88.26

Other statistics work the same way, for instance,

sd( gun ~ sex, data = Runners )
##        F        M 
## 13.33713 14.71851

Another example … wage broken down by sector of the economy, using data cps.csv:4

CPS = read.csv("http://tiny.cc/mosaic/cps.csv")
gwm( wage ~ sector, data = CPS )
## 
## Groupwise Model Call:
## gwm(formula = wage ~ sector, data = CPS)
## 
##     sector model_value
## 1 clerical       7.423
## 2    const       9.502
## 3    manag      12.704
## 4    manuf       8.036
## 5    other       8.501
## 6     prof      11.947
## 7    sales       7.593
## 8  service       6.537

In the Whickham smoking data example5, the outcome for each person was not a number but a category: Alive or Dead at the time of the follow-up.

W <- read.csv("http://tiny.cc/mosaic/whickham.csv")
names(W)
## [1] "outcome" "smoker"  "age"
levels(W$outcome)
## [1] "Alive" "Dead"

The gwm() function recognizes that outcome is a categorical variable and outputs proportions. In the case of “all cases the same,” this is porportions of the data in the outcome levels:

gwm( outcome ~ 1, data = W )
## 
## Groupwise Model Call:
## gwm(formula = outcome ~ 1, data = W)
## 
##   outcome model_value
## 1   Alive      0.7192
## 2    Dead      0.2808

Here’s the breakdown of the levels Alive and Dead according to smoking status:

gwm( outcome ~ smoker, data = W )
## 
## Groupwise Model Call:
## gwm(formula = outcome ~ smoker, data = W)
## 
##   outcome smoker model_value
## 1   Alive     No      0.3820
## 2    Dead     No      0.1750
## 3   Alive    Yes      0.3371
## 4    Dead    Yes      0.1058

You may be surbrised by this result. But don’t be misled. A more meaningful question is whether smokers are different from non-smokers when holding other variables constant, such as age. To address this question, you need to add age into the model.

It might be natural to consider each age — 35, 36, 37, and so on — as a separate group, but you won’t get very many members of each group. And, likely, the data for 35 year-olds has quite a lot to say about 36 year-olds, so it doesn’t make sense to treat them as completely separate groups.

You can use the cut() function to divide up a quantitative variable into groups. You get to specify the breaks between groups.

W$ageGroups <- cut(W$age, breaks=c(0,30,40,53,64,75,100) )
gwm( outcome ~ ageGroups, data = W )
## 
## Groupwise Model Call:
## gwm(formula = outcome ~ ageGroups, data = W)
## 
##    outcome ageGroups model_value
## 1    Alive    (0,30]    0.214612
## 2     Dead    (0,30]    0.004566
## 3    Alive   (30,40]    0.181887
## 4     Dead   (30,40]    0.009893
## 5    Alive   (40,53]    0.177321
## 6     Dead   (40,53]    0.035769
## 7    Alive   (53,64]    0.119482
## 8     Dead   (53,64]    0.071537
## 9    Alive   (64,75]    0.025875
## 10    Dead   (64,75]    0.102740
## 11   Alive  (75,100]    0.000000
## 12    Dead  (75,100]    0.056317
gwm( outcome ~ smoker + ageGroups, data = W )
## 
## Groupwise Model Call:
## gwm(formula = outcome ~ smoker + ageGroups, data = W)
## 
##    outcome smoker ageGroups model_value
## 1    Alive     No    (0,30]    0.123288
## 2     Dead     No    (0,30]    0.002283
## 3    Alive    Yes    (0,30]    0.091324
## 4     Dead    Yes    (0,30]    0.002283
## 5    Alive     No   (30,40]    0.097412
## 6     Dead     No   (30,40]    0.004566
## 7    Alive    Yes   (30,40]    0.084475
## 8     Dead    Yes   (30,40]    0.005327
## 9    Alive     No   (40,53]    0.075342
## 10    Dead     No   (40,53]    0.010654
## 11   Alive    Yes   (40,53]    0.101979
## 12    Dead    Yes   (40,53]    0.025114
## 13   Alive     No   (53,64]    0.064688
## 14    Dead     No   (53,64]    0.031963
## 15   Alive    Yes   (53,64]    0.054795
## 16    Dead    Yes   (53,64]    0.039574
## 17   Alive     No   (64,75]    0.021309
## 18    Dead     No   (64,75]    0.078387
## 19   Alive    Yes   (64,75]    0.004566
## 20    Dead    Yes   (64,75]    0.024353
## 21   Alive     No  (75,100]    0.000000
## 22    Dead     No  (75,100]    0.047184
## 23   Alive    Yes  (75,100]    0.000000
## 24    Dead    Yes  (75,100]    0.009132

With modeling techniques, to be introduced in later chapters, you can use quantitative variables without the need to divide them into groups.

4.1 Model Values and Residuals

A group-wise model tells you a model value for each group, but often you will need these in a case-by-case format: the model value for each case in a data set. The fitted() function carries out this simple calculation, taking each case in turn, figuring out which group it belongs to, and then returning the set of model values for all the cases. It requires two arguments: the model (here provided by gwm()) the data on which the model was based. For example:

Kids <- KidsFeet   # just get this from mosaicData instead of going out on the web with read.csv
mod <- gwm( width ~ sex, data = Kids )
fitted( mod, Kids )
##  [1] 9.190000 9.190000 9.190000 9.190000 9.190000 9.190000 9.190000 8.784211 8.784211 9.190000
## [11] 9.190000 9.190000 9.190000 9.190000 8.784211 8.784211 8.784211 8.784211 8.784211 8.784211
## [21] 9.190000 9.190000 8.784211 8.784211 8.784211 9.190000 8.784211 9.190000 9.190000 9.190000
## [31] 8.784211 8.784211 8.784211 9.190000 9.190000 8.784211 8.784211 8.784211 8.784211

The residuals are found by subtracting the case-by-case model value from the actual values for each case.

res <- Kids$width - fitted(mod, Kids)
head(res) # display the first few residuals
## [1] -0.79 -0.39  0.51  0.61 -0.29  0.51
res <- resid(mod, Kids) # function resid does the calculation automatically
head(res)
## [1] -0.79 -0.39  0.51  0.61 -0.29  0.51

Take care to use the same quantitative variable (width in this case) from the data as was used in constructing the model.

The var() function will calculate the variance:

var( Kids$width )  # overall variation
## [1] 0.2596761
var( fitted(mod, Kids) ) # variation in model values
## [1] 0.04222182
var( resid(mod, Kids) ) # residual variation
## [1] 0.2174543

Notice the “model triangle” relationship between these three numbers.


  1. These are also made available in the mosaic package as TenMileRace.

  2. This data set is made available by mosaic as CPS85.

  3. This data set is made available by mosaic as Whickham.