Chapter 7 Model formulas and coefficients

Make sure mosaic is loaded.

7.1 Examining model coefficients

The lm() operator finds model coefficients. To illustrate, here’s a pair of statements that read in a data frame and fit a model to it:

mod <- lm( time ~ year + sex, data = SwimRecords) 

The first argument to lm is a model design, the second is the data frame.

The object created by lm — here given the name mod — contains a variety of information about the model. To access the coefficients themselves, use the coef() operator applied to the model:

coef(mod)
## (Intercept)        year        sexM 
## 555.7167834  -0.2514637  -9.7979615

As shorthand to display the coefficients, just type the name of the object that is storing the model:

mod
## 
## Call:
## lm(formula = time ~ year + sex, data = SwimRecords)
## 
## Coefficients:
## (Intercept)         year         sexM  
##    555.7168      -0.2515      -9.7980

A more detailed report can be gotten with the summary operator. This gives additional statistical information that will be used in later chapters:

summary(mod)
## 
## Call:
## lm(formula = time ~ year + sex, data = SwimRecords)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7027 -2.7027 -0.5968  1.2796 19.0759 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 555.71678   33.79991  16.441  < 2e-16 ***
## year         -0.25146    0.01732 -14.516  < 2e-16 ***
## sexM         -9.79796    1.01287  -9.673 8.79e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.983 on 59 degrees of freedom
## Multiple R-squared:  0.844,  Adjusted R-squared:  0.8387 
## F-statistic: 159.6 on 2 and 59 DF,  p-value: < 2.2e-16

From time to time in the exercises, you will be asked to calculate model values “by hand.” This is accomplished by multiplying the coefficients by the appropriate values and adding them up. For example, the model value for a male swimmer in 2010 would be:

555.7 - 0.2515*2010 - 9.798
## [1] 40.387

Notice that the “value” used to multiply the intercept is always 1, and the “value” used for a categorical level is either 0 or 1 depending on whether there is a match with the level. In this example, since the swimmer in question was male, the value of sex M is 1. If the swimmer had been female, the value for sex M would have been 0.

When a model includes interaction terms, the interaction coefficients need to be multiplied by all the values involved in the interaction. For example, here is a model with an interaction between year and sex:

mod2 <- lm( time ~ year * sex, data = SwimRecords)
coef(mod2)
##  (Intercept)         year         sexM    year:sexM 
##  697.3012156   -0.3240459 -302.4638388    0.1499166
697.3 - 0.3240*2010 - 302.5 + 0.1499*2010
## [1] 44.859

The year:sexM coefficient is being multiplied by the year (2010) and the value of sex M, which is 1 for this male swimmer.

7.2 Other Useful Operators

  1. cross() will combine two categorical variables into a single variable. For example, in the Current Population Survey data, the variable sex has levels F and M, while the variable race has levels W and NW. Crossing the two variables combines them; the new variable has four levels: F.NW, M.NW, F.W, M.W:
CPS <- CPS85  # from mosaicData
RaceSex <- cross(CPS$sex, CPS$race)
summary(RaceSex)
## F:NW  F:W M:NW  M:W 
##   28  217   39  250

The summary tells us that there are 28 non-white females, 270 white females, etc, in the new categorical variable called RaceSex.

  1. as.factor() will convert a quantitative variable to a categorical variable. This is useful when a quantity like month has been coded as a number, say 1 for January and 2 for February, etc. but you do not want models to treat it as such.

To illustrate, consider two different models of the usage temperature versus month:

utils <- read.csv("http://tiny.cc/mosaic/utilities.csv")
mod1 <- lm( temp ~ month, data = utils)
mod2 <- lm( temp ~ as.factor(month), data = utils)

Here are the graphs of those models:

xyplot(temp + fitted(mod1) ~ month, data = utils)

xyplot( temp + fitted(mod2) ~ month, data = utils )

In the first model, month is treated quantitatively, so the model term month produces a straight-line relationship that does not correspond well to the data.

In the second model, month is treated categorically, allowing a more complicated model relationship. In fact, this is a groupwise model: the model values represent the mean temperature for each month.