# Chapter 16 Models of Yes/No Variables

Fitting logistic models uses many of the same ideas as in linear models.

`library(mosaic)`

## 16.1 Fitting Logistic Models

The `glm`

operator fits logistic models. (It also fits other kinds of models, but that’s another story.) `glm`

takes model design and data arguments that are identical to their counterparts in `lm`

. Here’s an example using the smoking/mortality data in `Whickham`

:

```
mod = glm( outcome ~ age + smoker, data=Whickham,
family="binomial")
```

The last argument, `family="binomial"`

, simply specifies to `glm()`

that the logistic transformation should be used. `glm()`

is short for Generalized Linear Modeling, a broad label that covers logistic regression as well as other types of models involving links and transformations.

The regression report is produced with the `summary`

operator, which recognizes that the model was fit logistically and does the right thing:

`summary(mod)`

```
##
## Call:
## glm(formula = outcome ~ age + smoker, family = "binomial", data = Whickham)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9581 -0.5458 -0.2228 0.4381 3.2795
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.599221 0.441231 -17.223 <2e-16 ***
## age 0.123683 0.007177 17.233 <2e-16 ***
## smokerYes 0.204699 0.168422 1.215 0.224
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1560.32 on 1313 degrees of freedom
## Residual deviance: 945.02 on 1311 degrees of freedom
## AIC: 951.02
##
## Number of Fisher Scoring iterations: 6
```

Keep in mind that the coefficients refer to the intermediate model values \(Y\). The probability \(P\) will be \(e^Y / (1 + e^Y )\).

## 16.2 Fitted Model Values

Logistic regression involves two different kinds of fitted values: the intermediate “link” value \(Y\) and the probability \(P\). The `fitted`

operator returns the probabilities:

`probs = fitted(mod) `

There is one fitted probability value for each case.

The link values can be gotten via the `predict`

operator

```
linkvals = predict(mod, type="link")
head(linkvals)
```

```
## 1 2 3 4 5 6
## -4.5498092 -5.1682251 1.3869832 0.6875514 0.3165019 -2.6945616
```

Notice that the link values are not necessarily between zero and one.

The `predict`

operator can also be used to calculate the probability values.

```
response_vals = predict(mod, type="response")
head(response_vals)
```

```
## 1 2 3 4 5 6
## 0.010458680 0.005662422 0.800110184 0.665421999 0.578471493 0.063295027
```

This is particularly useful when you want to use `predict`

to find the model values for inputs other than that original data frame used for fitting.

## 16.3 Which Level is “Yes”?

In fitting a logistic model, it’s crucial that the response variable be categorical, with two levels. It happens that in the `Whickham`

data, the `outcome`

variable fits the bill: the levels are `Alive`

and `Dead`

.

The `glm`

software will automatically recode the response variable as 0/1. The question is, which level gets mapped to 1? In some sense, it makes no difference since there are only two levels. But if you’re talking about the probability of dying, it’s nice not to mistake that for the probability of staying alive. So make sure that you know which level in the response variable corresponds to 1: it’s the second level.

Here is an easy way to make sure which level has been coded as “Yes”. First, fit the all-cases-the-same model, `outcome`

~ 1. The fitted model value \(P\) from this model will be the proportion of cases for which the outcome was “Yes.”

`mod2 = glm( outcome ~ 1, data = Whickham, family = "binomial")`

`fitted(mod2)`

So, 28% of the cases were “Yes.” But which of the two levels is “Yes?” Find out just by looking at the proportion of each level:

`tally( ~ outcome, data = Whickham, format = "proportion")`

```
##
## Alive Dead
## 0.7191781 0.2808219
```

If you want to dictate which of the two levels is going to be encoded as 1, you can use a comparison operation to do so:

```
mod3 = glm( outcome == "Alive" ~ 1, data=Whickham,
family="binomial")
mod3
```

In this model, “Yes” means `Alive`

.

## 16.4 Analysis of Deviance

The same basic logic used in analysis of variance applies to logistic regression, although the quantity being broken down into parts is not the sum of squares of the residuals but, rather, the ** deviance**.

The `anova`

software will take apart a logistic model, term by term, using the order specified in the model.

`anova(mod, test="Chisq")`

```
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: outcome
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 1313 1560.32
## age 1 613.81 1312 946.51 <2e-16 ***
## smoker 1 1.49 1311 945.02 0.2228
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Notice the second argument, `test="Chisq"`

, which instructs `anova`

to calculate a p-value for each term. This involves a slightly different test than the F test used in linear-model ANOVA

The format of the ANOVA table for logistic regression is somewhat different from that used in linear models, but the concepts of degrees of freedom and partitioning still apply. The basic idea is to ask whether the reduction in deviance accomplished with the model terms is greater than what would be expected if random terms were used instead.