Chapter 2 Data in R
Many of the R functions used in this text are written especially for the text to enhance convienence and clarity of purpose. To access these functions, you will need to load the
mosaic package at the beginning of your session. Loading a package is simple:
You need do this only once in each session of R, and on systems such as Rstudio the package will generally be reloaded automatically. (If you get an error message, it’s likely that the
mosaic package has not been installed on your system. Use the package installation menu in R to install
mosaic, after which the
require() function will load the package.)
mosaic itself loads other packages it in turn depends on. If a command you see in this text does not work for you, be sure that
mosaic is loaded.
Data used in statistical modeling are usually organized into tables, often created using spreadsheet software. Most people presume that the same software used to create a table of data should be used to display and analyze it. This is part of the reason for the popularity of spreadsheet programs such as Excel and Google Spreadsheets.
For serious statistical work, it’s helpful to take another approach that strictly separates the processes of data collection and of data analysis: use one program to create data files and another program to analyze the data stored in those files.
By doing this, one guarantees that the original data are not modified accidentally in the process of analyzing them. This also makes it possible to perform many different analyses of the data; modelers often create and compare many different models of the same data.
2.2 Reading Tabular Data into R
Data is central to statistics, and the tabular arrangement of data is very common. Accordingly, R provides a large number of ways to read in tabular data. These vary depending on how the data are stored, where they are located, etc, but they generally take on similar forms which will become familiar to you with use.
This text makes use of several datasets and most of these are available to you in the package
mosaicData. You can load this into your workspace with the now familiar command
or by checking the box next to
mosaicData in the packages tab in Rstudio. Once this is done, you can refer to a dataset by its name in mosaicData (clicking on
mosaicData in the packages tab in Rstudio will bring up an index of names and associated codebooks).
An often used classic dataset residing in
mosaicData is the height data collected by Sir Francis Galton. In the following commands look at the first few records:
## family father mother sex height nkids ## 1 1 78.5 67.0 M 73.2 4 ## 2 1 78.5 67.0 F 69.2 4 ## 3 1 78.5 67.0 F 69.0 4 ## 4 1 78.5 67.0 F 69.0 4 ## 5 2 75.5 66.5 M 73.5 4 ## 6 2 75.5 66.5 M 72.5 4
mosaicData contains many of our text’s datasets, it does not contain all of them, and you’ll be wanting to analyse your own data, generated and then stored in tabular form. The most common method of reading tabular data, for the purposes of this book, is the R operator
read.csv() which, not surprisingly, reads in
.csv or comma separated variable files. These are text files that can be generated by a spreadsheet.
read.csv() imports tabular data (in .csv format) into R from anywhere on your computer or on the web.
Reading in a data table that’s been connected with
read.csv() is simply a matter of knowing the name of the data set. For instance, one data table used in examples in this book is
swim100m.csv. All of the
.csv files of data mentioned in the text are available on the web at http://tinyhttp://tiny.cc/mosaic/ so to read in this data table and create an object in R that contains the data, use a command like this:
Swim <- read.csv("http://tiny.cc/mosaic/swim100m.csv")
The part of this command that requires creativity is choosing a name for the R object that will hold the data. In the above command it is called
Swim, but you might prefer another name, e.g.,
Sdata or even
Ralph. Beginning with a capital letter is standard practice, but not required. Remember, R is case sensitive. Of course, it’s sensible to choose names that are short, easy to type and remember, and remind you what the contents of the object are about.
To help you identify data tables that can be accessed through
read.csv(), examples in this book will be marked with a flag containing the name of the file.
2.3 Data Frames
The type of R object created by
read.csv() is called a
data frame and is essentially a tabular layout. To illustrate , here are the first several cases of the
Swim data frame created by the previous use of
## year time sex ## 1 1905 65.8 M ## 2 1908 65.6 M ## 3 1910 62.8 M ## 4 1912 61.6 M ## 5 1918 61.4 M ## 6 1920 60.4 M
What do you think a function might be called that prints out the last several cases? Try it.
Note that the
head() function, one of several functions that operate on data frames, takes the R object that you created, not the quoted name of the data file.
Data frames, like tabular data generally, involve variables and cases. In R, each of the variables is given a name. You can refer to the variable by name in a couple of different ways. To see the variable names in a data frame, something you might want to do to remind yourself of how names are spelled and capitalized, use the
##  "year" "time" "sex"
Another way to get quick information about the variables in a data frame is with
## year time sex ## Min. :1905 Min. :47.84 F:31 ## 1st Qu.:1924 1st Qu.:53.64 M:31 ## Median :1956 Median :56.88 ## Mean :1952 Mean :59.92 ## 3rd Qu.:1976 3rd Qu.:65.20 ## Max. :2004 Max. :95.00
To see how many cases there are in a data frame, use
##  62
2.4 Variables in Data Frames
Perhaps the most common operation on a data frame is to refer to the values in a single variable. The two ways you will most commonly use involve functions with a
data = argument and the direct use of the
$ notation is the most basic, if not the most intuitive, way of referring to a variable in a dataframe. Here we find the mean record time (
time) in the dataset we’ve named
##  59.92419
Think of this as referring to the variable by both its family name (the data frame’s name,
Swim) and its given name (
time), something like Clinton$Hillary.
Most of the statistical modeling functions you will encounter in this book are designed to work with data frames and allow you to refer directly to variables within a data frame. For instance:
mean( ~ time, data = Swim)
##  59.92419
min( ~ time, data = Swim)
##  47.84
data = argument tells the function which data frame to pull the variable from. The use of the tilde (
~) identifies the first argument as a model formula, which is necessary if the
data = argument is to be used. Leaving off that argument or the tilde leads to an error.
The advantage of the
data = approach becomes evident when you construct statements that involve more than one variable within a data frame. For instance, here’s a calculation of the mean time separately for the different sexes:
mean( time ~ sex, data = Swim )
## F M ## 65.19226 54.65613
mean( Swim$time ~ Swim$sex )
## F M ## 65.19226 54.65613
You will see much more of the tilde starting in Chapter @ref(“chap:simple-models”). It’s the R notation for “broken down by” or “versus.”
The ability of
median(), and similar functions to handle the
data = format is provided by the
mosaic package. When you encounter a function that can’t handle the
data = format, use the
2.5 Adding a New Variable
Sometimes you will compute a new quantity from the existing variables and want to treat this as a new variable. Adding a new variable to a data frame can be done with the
$ notation. For instance, here is how to create a new variable in
Swim that holds the
time converted from minutes to units of seconds:
Swim$minutes = Swim$time/60
The new variable appears just like the old ones:
head(Swim, n = 3L)
## year time sex minutes ## 1 1905 65.8 M 1.096667 ## 2 1908 65.6 M 1.093333 ## 3 1910 62.8 M 1.046667
You could also, if you want, redefine an existing variable, for instance:
Swim$time = Swim$time/60 head(Swim, n = 3L)
## year time sex minutes ## 1 1905 1.096667 M 1.096667 ## 2 1908 1.093333 M 1.093333 ## 3 1910 1.046667 M 1.046667
Such assignment operations do not change the original file (e.g. the swim100m.csv file) from which the data were read, only the data frame in the current session of R. This is an advantage, since it means that your data in the data file stay in their original state and therefore won’t be corrupted by operations made during analysis.
2.6 Sampling from a Sample Frame
Much of statistical analysis is concerned with the consequences of drawing a sample from the population. Ideally, you will have a sampling frame that lists every member of the population from which the sample is to be drawn. With this in hand, you could treat the individual cases in the sampling frame as if they were cards in a deck of hands. To pick your random sample, shuffle the deck and deal out the desired number of cards.
When doing real work in the field, you would use the randomly dealt cards to locate the real-world cases they correspond to. Sometimes in this book, however, in order to let you explore the consequences of sampling, you will select a sample from an existing data set. The
deal() function performs this, taking as an argument the data frame to be used in the selection and the number of cases to sample.
For example, the
kidsfeet.csv data set has \(n=39\) cases.
Kids <- read.csv("http://tiny.cc/mosaic/kidsfeet.csv") nrow(Kids)
##  39
Here’s how to take a random sample of five of the cases:
## name birthmonth birthyear length width sex biggerfoot domhand orig.id ## 23 Laura 9 88 24.0 8.3 G R L 23 ## 19 Lee 6 88 26.7 9.0 G L L 19 ## 29 Mike 11 88 24.2 8.9 B L R 29 ## 20 Heather 3 88 25.5 9.5 G R R 20 ## 4 Josh 1 88 25.2 9.8 B L R 4
The results returned by
deal() will never contain the same case more than once, just as if you were dealing cards from a shuffled deck. In contrast,
resample replaces each case after it is dealt so that it can appear more than once in the result. You wouldn’t want to do this to select from a sampling frame, but it turns out that there are valuable statistical uses for this sort of sampling with replacement. .
You’ll make use of re-sampling in Chapter ??.
## name birthmonth birthyear length width sex biggerfoot domhand orig.id ## 17 Caroline 12 87 24.0 8.7 G R L 17 ## 39 Alisha 9 88 24.6 8.8 G L R 39 ## 13 Cal 8 87 26.1 9.1 B L R 13 ## 35 Peter 4 88 24.7 8.6 B R L 35 ## 17.1 Caroline 12 87 24.0 8.7 G R L 17
Notice that Caroline was sampled twice.