Chapter 11 Modeling randomness

One of the most surprising outcomes of the revolution in computing technology has been the discovery of diverse uses for randomness in the analysis of data and in science generally. Most young people have little trouble with the idea of a computer generating random data; they see it in computer games and simulations. Older people, raised with the idea that computers do mathematical operations and that such operations are completely deterministic, sometimes find computer-generated randomness suspect. Indeed, conventional algebraic notation (\(+\), \(-\), \(\sqrt{ }\), \(\cos\), and so on) has no notation for “generate at random.”

One of the simplest operators for generating random events is resample. This takes two arguments: the first is a set of items to choose from at random, the second is how many events to generate. Each item is equally likely to be choosen. For example, here is a simulation of a coin flip:

coin <- c("H","T")
resample(coin, 5)
## [1] "T" "T" "H" "T" "T"
resample(coin, 5)
## [1] "T" "T" "H" "T" "T"

The first command creates an object holding the possible outcome of each event, called coin. The next command generated five events, each event being a random choice of the outcomes in coin.

Another example is rolling dice. First, construct a set of the possible outcomes: the numbers 1, 2, 3, 4, 5, 6.

die <- seq(1,6)
die
## [1] 1 2 3 4 5 6

Then generate random events. Here is a roll of two dice.

resample(die,2)
## [1] 3 5

The resample() function is also useful for selecting cases at random from a data frame. You have already seen it put to use in generating sampling distributions via the bootstrap. This technique will be further developed in later chapters.

11.1 Random Draws from Probability Models

Although resample() is useful for random sampling, it can work only with finite sets of possible outcomes such as H/T or 1/2/3/4/5/6 or the cases in a data frame. By default in resample(), the underlying probability model is equiprobability — each possible outcome is equally likely. You can specify another probability model by using the prob = argument to resample(). For instance, to flip coins that are very likely (90% of the time, on average) to come up heads:

resample( coin, 10, prob = c(.9,.1))
##  [1] "T" "H" "H" "T" "T" "H" "H" "H" "T" "H"

R provides other operators that allow draws to be made from outcome sets that are infinite.

For example, the rnorm() function makes random draws from a normal probability distribution. The required argument tells how many draws to make. Optional, named arguments let you specify the mean and standard deviation of the particular normal distribution that you want. To illustrate, here is a set of 15 random numbers from a normal distribution with mean 1000 and standard deviation 75:

samps <- rnorm(15, mean = 1000, sd = 75)
samps
##  [1] 1171.4984  895.8354  979.0908  990.0009 1047.6963  978.6810  800.7658  816.9650 1099.0085
## [10]  977.0021  866.4019  987.1062 1091.1006 1142.1395  967.7148

In this example, the output was assigned to an object samps to facilitate some additional computations to the values. For instance, here is the mean and standard deviation of the sample:

mean(samps)
## [1] 987.4005
sd(samps)
## [1] 110.9166

Don’t be surprised that the mean and standard deviation of the sample don’t match exactly the parameters that were set with the arguments mean = 1000, sd = 75. The sample was drawn at random and so the sample statistics are going to vary from one sample to the next. Part of the statistical methodology to be studied in later chapters has to do with determining how close the statistics calculated from a sample are likely to be to the parameters of the underlying population.

Often you will generate very large samples. In these situations you usually don’t want to display all the samples, just do calculations with them. The practical limits of “large” depend on the computer you are using and how much time you are willing to spend on a calculation. For an operator like rnorm and the others to be introduced in this chapter, it’s feasible to generate samples of size 10,000 or 100,000 on an ordinary laptop computer.

samps <- rnorm(100000, mean = 1000, sd = 75)
mean( samps )
## [1] 999.6824
sd( samps )
## [1] 75.23447

Notice that the sample mean and standard deviation are quite close to the population parameters in this large sample. (Remember not to put commas in as punctuation in large numbers: it’s 100000 not 100,000.)

The simulations that you will do in later chapters will be much more elaborate than the simple draws here. Even with today’s computers, you will want to use only a few hundred trials.

11.2 Standard Probability Models

R provides a large set of operators like rnorm for different probability models. All of these operators work in the same way:

  • Each has a required first argument that gives the number of draws to make.

  • Each has an optional set of parameters that specify the particular probability distribution you want.

All the operators start with the letter r — standing for “random” — followed by the name of the probability model:

Family R name Parameters Nature
Normal rnorm mean,sd continuous
Uniform runif min,max continuous
Binomial rbinom size,prob discrete
Poisson rpois Average rate (written lambda) discrete
Exponential rexp Same rate as in poisson but the parameter is called rate. continuous
Lognormal rlnorm Mean and sd of the natural logarithm. meanlog, sdlog continuous
\(\chi^2\) rchisq Degrees of freedom (df) continuous
t rt Degrees of freedom (df) continuous
F rf Degrees of freedom in the numerator and in the denominator (df1, df2) continuous

To use these operators, you first must choose a particular probability model based on the setting that applies in your situation. This setting will usually indicate what the population parameters should be. Some examples:

  • You are in charge of a hiring committee that is going to interview three candidates selected from a population of job applicants that is 63% female. How many of the interviewees will be female? Modeling this as random selection from the applicant pool, a binomial model is appropriate. The size of each trial is 3, the probability of being female is 63% :
samps <- rbinom(40, size = 3, prob = 0.63)
samps
##  [1] 2 2 2 3 2 2 3 2 1 1 2 1 3 2 2 2 2 1 2 2 1 3 2 3 2 2 3 3 1 3 2 2 2 1 1 2 1 3 1 2

There are 40 trials here, since the first argument was set to 40. Remember, each of the trials is a simulation of one hiring event. In the first simulated event, two of the interviewees were female; in the third only one was female. Typically, you will be summarizing all the simulations, for example to see how likely each possible outcome is.

table(samps)
## samps
##  1  2  3 
## 10 21  9
  • You want to simulate the number of customers who come into a store over the course of an hour. The average rate is 15 per hour. To simulate a situation where customers arrive randomly, the poisson model is appropriate:
rpois(25, lambda = 15)
##  [1] 17 15 19 25 10 17 15 12 11 15 19 17 13 13 13 12 18 14 21 17 15 11 17 18 11
  • You want to generate a simulation of the interval between earthquakes as in Example @ref(“example:earthquake-intervals”). To simulate the random intervals with a typical rate of 0.03 earthquakes per year, you would use
rexp( 15, rate = 0.03 )
##  [1]  49.5057149  37.6415609 177.9802139   0.1406587  29.8780978  42.6115631   0.1706456  40.2302083
##  [9]  45.6183367 210.6326415  63.0320137  35.7448324   2.7317871  16.6463322  89.4471805

Notice the huge variation in the intervals, from less than two months to more than 210 years between earthquakes.

11.3 Quantiles and Coverage Intervals

You will often need to compute coverage intervals in order to describe the range of likely outcomes from a random process. R provides a series of operators for this purpose; a separate operator for each named probability model. The operators all begin with q, standing for quantiles. In all cases, the first argument is the set of quantiles you want to calculate for the particular probability model. The optional named arguments are the parameters.

Remember that to find a 95% coverage interval you need the 0.025 and 0.975 quantiles. For a 99% interval, you need the 0.005 and 0.995 quantiles.

To illustrate, here are 95% coverage intervals for a few probability models.

  • A normal distribution with mean 0 and standard deviation 1:
qnorm( c(0.025, 0.975), mean = 0, sd = 1)
## [1] -1.959964  1.959964
  • The hiring committee situation modelled by a binomial distribution with size = 3 and prob = 0.63:
qbinom( c(0.025, 0.975), size = 3, prob = 0.63)
## [1] 0 3

Perhaps you are surprised to see that the coverage interval includes all the possible outcomes. That’s because the number of cases in each trial (\(n = 3\)) is quite small.

  • The number of customers entering a store during an hour as modelled by a poisson distribution with an average rate of 15 per hour.
qpois( c(0.025, 0.975), lambda = 15)
## [1]  8 23
  • The interval between earthquakes modelled by an exponential distribution with a typical rate of 0.03 earthquakes per year:
qexp( c(.025, .975), rate = 0.03)
## [1]   0.8439269 122.9626485

You can also use the q operators to find the value that would be at a particular percentile. For example, the exponential model with rate = 0.03 gives the 25th percentile of the interval between earthquakes as:

qexp( .25, rate = 0.03)
## [1] 9.589402

A quarter of the time, the interval between earthquakes will be 9.59 years or less.

It’s entirely feasible to calculate percentiles and coverage intervals by combining the random-number generators with quantile. For example, here is the 95% coverage interval from a normal distribution with mean 0 and standard deviation 1:

samps <- rnorm(10000, mean = 0, sd = 1)
qdata( samps, c(.025, .975) )
##        quantile     p
## 2.5%  -1.961533 0.025
## 97.5%  1.892216 0.975

The disadvantage of this approach is that it is a simulation and the results will vary randomly. By making the sample size large enough — here it is \(n = 10000\) — you can reduce the random variation. Using the q operators uses mathematical analysis to give you what is effectively an infinite sample size. For this reason, it’s advisable to use the q operators when you can. However, for many of the techniques to be introduced in later chapters you will have to generate a random sample and then apply quantile to approximate the coverage intervals.

11.4 Percentiles

A percentile computation applies to situations where you have a measured value and you want to know where that value ranks relative to the entire set of possible outcomes. You have already seen percentiles computed from samples; they also apply to probability models.

It’s easy to confuse percentiles with quantiles because they are so closely related. Mathematically, the percentile operators are the inverse of the quantile operators. To help you remember which is which, it’s helpful to distinguish them based on the type of argument that you give to the operator:

  • Percentile (e.g. pnorm) The input argument is a measured value, something that could be the output of a single draw from the probability distribution. The output is always a number between 0 and 1 — a percentile.

  • Quantile (e.g. qnorm) The input is a percentile, a number between 0 and 1. The output is on the scale of the measured variable.

Example: You have just gotten your score, 670, on a professional school admissions test. According to the information published by the testing company, the scores are normally distributed with a mean of 600 and a standard deviation of 100. So, your ranking on the test, as indicated by a percentile, is:

pnorm(670, mean = 600, sd = 100)
## [1] 0.7580363

Your score is at about the 75th percentile.

Example: Unfortunately, the professional school that you want to go to accepts only students with scores in the top 15 percent. Your score, at 75.8%, isn’t good enough. So, you will study some more and take practice tests until your score is good enough. How well will you need to score to reach the 85th percentile?

qnorm(0.85, mean = 600, sd = 100)
## [1] 703.6433