Workshop overview and materials

Indices and tables

Workshop description

This is an intermediate/advanced R course appropriate for those with basic knowledge of R. It is intended for those already comfortable with using R for data analysis who wish to move on to writing their own functions. To the extent possible this workshop uses real-world examples. That is to say that concepts are introduced as they are needed for a realistic analysis task. In the course of working through a realistic project we will lean about interacting with web services, regular expressions, iteration, functions, control flow and more.

Prerequisite: basic familiarity with R, such as acquired from an introductory R workshop.

Materials and setup

Everyone should have R installed – if not:

Materials for this workshop include slides, example data sets, and example code.

Launch RStudio on Athena

  • To start R type these commands in the terminal:
  • Open up today’s R script
    • In RStudio, Go to File => Open Script
    • Locate and open the RProgramming.R script in the Rintro folder in your home directory
  • Go to Tools => Set working directory => To source file location (more on the working directory later)
  • I encourage you to add your own notes to this file!

Example project overview

Throughout this workshop we will return to a running example that involves acquiring, processing, and analyzing data from the Displaced Worker Survery (DWS). In this context we will learn about finding and using R packages, importing and manipulating data, writing functions, and more. The web page has been mirrored at http://tutorials.iq.harvard.edu/cps-uniform-data-extracts/cps-displaced-worker-survey/cps-dws-data for convenience.

Extracting elements from html

It is common for data to be made available on a website somewhere, either by a government agency, research group, or other organizations and entities. Often the data you want is spread over many files, and retrieving it all one file at a time is tedious and time consuming. Such is the case with the CPS data we will be using today.

The Center for Economic and Policy Research has helpfully compiled DWS data going back to 1994 [1]. Although we could open a web browser and download these files one at a time, it will be faster and easier to instruct R to do that for us. Doing it this way will also give us an excuse to talk about html parsing, regular expressions, package management, and other useful techniques.

Our goal is to download all the Stata data sets from http://tutorials.iq.harvard.edu/cps-uniform-data-extracts/cps-displaced-worker-survey/cps-dws-data/. In order to do that we need a list of the Uniform Resource Locators (URLs) of those files. The URLs we need are right there as links in the ceprdata.org webpage. All we have to do is read that data in a way R can understand.

Packages for parsing html

In order extract the data URls from the ceprdata.org website we need a package for parsing XML and HTML. How do we find such a package?

Task views
https://cran.r-project.org/web/views/WebTechnologies.html
R package search
http://www.r-pkg.org/search.html?q=html+xml
Web search
https://www.google.com/search?q=R+parse+html+xml&ie=utf-8&oe=utf-8

For parsing html in R I recommend either the xml2 package or the rvest package, with the former being more flexible and the later being more user friendly. Let’s use the friendlier one.

Extracting information from web pages with the rvest package

Our first task is to read the web page into R. We can do that using the read_html function. Next we want to find all the links (the <a> tags) and extract their href attributes. To give a better sense of this here is what the html for the 2010 data file link looks like:

We want the href part, i.e., “/wp-content/cps/data/cepr:sub:dws2010dta.zip”.

We can get all the <a> elements using the html_nodes function, and then extract the href attributes usig the html_attr function, like this:

## install.packages("rvest")
library(rvest)

## read the web page into R
dataPage <- read_html("http://tutorials.iq.harvard.edu/cps-uniform-data-extracts/cps-displaced-worker-survey/cps-dws-data/")

## find the link ("a") elements.
allAnchors <- html_nodes(dataPage, "a")
head(allAnchors, 15)

## extract the link ("href") attributes
allLinks <- html_attr(allAnchors, "href")
head(allLinks, 15)

Just the data please – regular expressions to the rescue

Looking at the output from the previous example you might notice a problem; we’ve matched all the URLs on the web page. Some of those (the ones that end in .zip) are the ones we want, others are menu links that we don’t want. How can we separate the data links from the other links on the page?

One answer is to use regular expressions to idenfify the links we want. Regular expressions are useful in general (not just in R!) and it is a good idea to be familiar with at least the basics. For our present purpose it will be more than enough to use regular expression that matches strings starting with /wp and ending with .zip.

In regulars expression ^, ., *, and $ are special characters with the following meanings:

^
matches the beginning of the string
.
matches any character
*
repeates the last caracter zero or more times
$
matches the end of the string

If you have not been introduced to regular expressions yet a nice interactive regex tester is available at http://www.regexr.com/ and an interactive tutorial is available at http://www.regexone.com/.

R comes with a grep function that can be used to search for patterns in strings, but for more sophisticated string manipulation I recommend the stringi package. The function names are more verbose, but it provides much more complete and robust string handling than is available in base R. For our relatively simple needs grep will suffice, but if you need to do extensive string manipulation in R the stringi package is the way to go.

dataLinks <- grep("^/wp.*\\.zip$", allLinks, value = TRUE)
head(dataLinks)

(Note that the backslashes in the above example are used to escape the . so that it is matched literally instead of matching any characters as it normallly would in a regular expression.)

Finally, the data links we’ve extracted are relative to the ceprdata website. To make them valid we need to prepend http://tutorials.iq.harvard.edu/ to each one. We can do that using the paste function.

dataLinks <- paste("http://tutorials.iq.harvard.edu", dataLinks, sep = "")
head(dataLinks)

Exercise 0: html parsing

The http://ceprdata.org/ website provides code books for the DWS data in .pdf format. Links to these code books are available on the documentation page at http://ceprdata.org/cps-uniform-data-extracts/cps-displaced-worker-survey/cps-dws-documentation/. Parse this page and extract the links to the code books.

Exercise 0 prototype

## read in the html page
ceprDoc <- read_html(
  "http://ceprdata.org/cps-uniform-data-extracts/cps-displaced-worker-survey/cps-dws-documentation/"
)
## get the codebook links
ceprCodeBookLinks <- html_attr(#extract attributes
  html_nodes(#from nodes
    ceprDoc,#in ceprDoc
    ## This uses an xpath expression to select just the codebook links.
    ## You could alternatively download all the links and filter them
    ## with a regular expression. Use whatever works and is comfortable!
    ## There is more than one right way.
    xpath = '//*[@id="content"]/article/div/ul[1]//a'),#matching this xpath
  'href' #extract href attributes
)

Downloading files in R

Now that we have a vector of URLs pointing to the data files we want to download, we want to iterate over the elements and download each file. We can use the download.file function to download the data files.

The download.file function requires a URL as the first argument, and a file name as the second argument. We can use the basename function to strip of the location part of the URL, leaving only the file name. We could do this verbosely by writing one line for each file:

## download.file(dataLinks[1], basename(dataLinks[1]))
## download.file(dataLinks[2], basename(dataLinks[2]))
## ...
## download.file(dataLinks[n], basename(dataLinks[3]))

but that is too much typing. First, it would be more convenient if the download.file function defaulted to destfile `` basename(url)=. Fortunately it is very easy to write your own functions in R. We can write a wrapper around the ``download.file function like this:

## a simple function to make downloading files easier
downloadFile <- function(url, # url to download
                         destfile = basename(url), # default name to save to
                         outdir = "./dataSets/", # default directory to save in
                         ... # other named arguments passed to download.file
                         ){
  ## create output directory if it doesn't exist
  if(!dir.exists(outdir)) {
    dir.create(outdir)
  }
  ## download the file using the specified url, output directory, and file name
  download.file(url = url, destfile = paste(outdir, destfile, sep = ""), ...)
}

Using this function we can download the data files more conveniently, but we haven’t yet addressed how to avoid typing out separate function calls for each file we need to download. For that we need iteration.

Downloading all the files

To download all the files conveniently we want to iterate over the vector of URLs and download each one. We can carry out this iteration in several ways, including using a for loop, or using one of the apply family of functions.

for and while loops in R work much the same as they do in other programming languages. The apply family of functions apply a function to each element of an object.

Iterating using for-loop

One way to download the data files is to use a for-loop to iterate over the contents of our vector of URLs. Some people will tell you to avoid for-loops in R but this is nonsense. Loops are convenient and useful, and while they are not the best tool for all situations calling for iteration they are perfectly appropriate for downloading a series of files. If you’ve used a for loop in any other language you will probably find the R implementation to be very similar.

For loops in R have the following general structure: for(<placeholder> in <thing to iterate over>) {do stuff with placeholder}. In our case we want to iterate over dataLinks and download each one, so this becomes

str(dataLinks)

## make a directory to store the data
dir.create("dataSets")

for(link in dataLinks) {
    downloadFile(link, outdir = "dataSets/")
}

Iterating over vectors and lists with the sapply function

The sapply function iterates over a vector or list and applies a function to each element. To start, let’s use sapply do download all the displaced worker survey data files:

## download all the dws data
sapply(dataLinks,
       downloadFile,
       outdir = "dataSets/")

Iterating in parallel with the mclapply function

The mclapply function iterates over a vector or list and applies a function to each element using multiple CPU cores (where available). Let’s use mclapply do download all the displaced worker survey data files:

## download all the dws data
library(parallel)
mclapply(dataLinks,
       downloadFile,
       outdir = "dataSets/",
       mc.cores = detectCores())

We can now use what we’ve learned about iteration to unzip all the files in the dataSets directory, a task I leave to you.

Exercise 1: Iterate and extract

Use a for loop or *apply function to unzip each of the .zip files in the dataSets directory.

BONUS (optional): calculate the size of each extracted file and calculate the difference in size between each .dta file and the .zip file it was extracted from.

Exercise 1 prototype

zipFiles <- list.files("dataSets", pattern = "\\.zip$", full.names=TRUE)

## using sapply
sapply(zipFiles, unzip, exdir = "dataSets")
## using a for loop
for(f in zipFiles) unzip(f, exdir = "dataSets")

## Calculating compression ratios
dataFiles <- list.files("dataSets", pattern = "\\.dta$", full.names = TRUE)

uncompSize <- round(file.size(dataFiles) / 1024^2)
compSize <- round(file.size(zipFiles) / 1024^2)

cbind(zipFile = paste0(basename(zipFiles), ": ", compSize, "Mb"),
      dtaFile = paste0(basename(dataFiles), ": ", uncompSize, "Mb"),
      diff = paste0(round(uncompSize - compSize), "Mb"),
      compression_ratio = round(uncompSize / compSize, digits = 3))

Importing and inspecting data and meta-data

Our next goal is to read in the data that we downloaded and extracted earlier. The data are stored as Stata data sets, which can be imported using the read.dta function in the foreign package. Let’s start by reading just the first data set.

## attach the foreign packge so we can read stata files
library(foreign)

## get a list of all the stata files in the dataSets directory
dataFiles <- list.files("dataSets", pattern = "\\.dta$", full.names=TRUE)

## read in the first one
ceprData1 <- read.dta(dataFiles[1])

Now that we’ve read in some of the data we want to get some more information about it.

Mode and length

Information about objects in R are stored as attributes of the object. All R objects have a storage mode and a length. Since all objects in R have these attributes we refer to them as intrinsic attributes. We can get the value of these intrinsic attributes using the mode and length functions respecively. For example, what is the mode and length of our ceprData1 object?

mode(ceprData1)
length(ceprData1)

Other properties of data.frames

So far we’ve seen that ceprData1 is a list of length 178. Actually ceprData1 is a special kind of list called a data.frame. We can see that by asking R what the class of the object is.

class(ceprData1)

A data.frame in R is a list with elements of equal length. It is a rectangular structure with rows and columns. In addition to the mode and length that all object in R have, data.frames also have dimension, (col)names, and =rownames.

dim(ceprData1)
names(ceprData1)
c(head(rownames(ceprData1)), "...", tail(rownames(ceprData1)))

Additional attributes

OK, so far we know the ceprData1 is a data.frame with 156246 rows and 178 columns, and that the variables have terrible cryptic names like cjpporg and ljagric. What do we actually have here? One way to answer the question is

browseURL(ceprCodeBookLinks[1])

and that is a good answer actually. In our case the meta-data has also been embedded in the .dta files by our friends at ceprdata.org. This meta-data has been attached to the ceprData1 data.frame in the form of additional attributes.

The system used by R for storing attributes of this kind is simple. Arbitrary attributes can be set using the attr function, and retrieved using either attr or the attributes function. Let’s take a quick look a this system before using it to access the ceprData1 meta-data.

x <- 1:10
attributes(x)
attr(x, "description") <- "This is vector of integers from 1 to 10"
attributes(x)
attr(x, "how_many") <- "There are ten things in this vector"
attributes(x)
attr(x, "description")

As we’ve seen, additional attributes and be accessed vie the attributes function. Let’s see what other attributes our ceprData object has.

ceprDataInfo <- attributes(ceprData1)
mode(ceprDataInfo)
class(ceprDataInfo)
length(ceprDataInfo)
names(ceprDataInfo)

Let’s iterate over the attributes of =ceprData1- and get some more information about the available meta-data

t(sapply(attributes(ceprData1),
         function(x) {
           c(mode = mode(x), class = class(x), length = length(x))
         })
  )

Extracting useful meta-data

We can extract elements from lists in a few different ways:

## extract by name
ceprDataInfo$datalabel #using $
ceprDataInfo["datalabel"] #using [
ceprDataInfo[["datalabel"]] #using [[, note the difference

## by position
ceprDataInfo[1]; ceprDataInfo[[1]] # note the difference

## by logical index
ceprDataInfo[c(TRUE, TRUE, rep(FALSE, 10))]
ceprDataInfo[sapply(ceprDataInfo, length) == 1]

Note that [ indexing on a list returns a list, and [[ indexing returns whatever contained in a single element of the list. This visual explanation may help: image0 [2]

You may have noticed during our earlier investigation of ceprDataInfo that many of the elements have length 178. That number might be familiar:

dim(ceprData1)

Exercise 2

Extract elements from ceprDataInfo that will help you understand what each column in deprData contains. Include at least the variable names and var.labels as well as any other information that you think will be useful.

Bonus (optional): supplement the ceprDataInfo you extracted in step one with the mode, class, etc. of each column in ceprData1

Exercise 2 prototype

ceprCodebook <- data.frame(
  ceprDataInfo[
    sapply(ceprDataInfo, length) == ncol(ceprData1)
  ])

ceprCodebook$mode <- sapply(ceprData1, mode)
ceprCodebook$class <- sapply(ceprData1, class)
ceprCodebook$n_distinct = sapply(ceprData1, function(x) length(unique(x)))

rbind(head(ceprCodebook), tail(ceprCodebook))

Aggregation

Now that we have the data read in, and we know what is in each column, I want to calculate the proportion displaced by year/rural/gender. I can do that using the aggregate function (the data.table and dplyr packages provide advanced aggregation capabilities, but aggregate is available in base R and works well for many things).

ceprData1 <- aggregate(ceprData1["dw"],
                       by = ceprData1[c("year", "rural", "female")],
                       FUN = mean, na.rm = TRUE)
ceprData1

Exercise 3

Now that we have a process for importing and aggregating the data we can apply it to all the data files we downloaded earlier. We can do that by wrapping the read.dta and aggregate code in a function and applying that function to each element of dataFiles using the sapply function, or using a for loop. Go ahead and give it a try! Note that this exercise is intentionally challenging; read the documentation, search stackoverflow.com, and use any other resources at your disposal as you attempt it.

Exercise 3 prototype

library(foreign)

ceprData <- mclapply(dataFiles, function(x) {
  tmp <- read.dta(x)
  return(aggregate(tmp["dw"],
                   by = tmp[c("year", "rural", "female")],
                   FUN = mean, na.rm = TRUE))})

Finishing touches

We now have a list of aggregated data.frames. The next step is to stack each element of the list so that we end up with one big data.frame instead of a list of small ones. We can stack two data.frames using the rbind function:

ceprData[[1]]; ceprData[[2]]
rbind(ceprData[[1]], ceprData[[2]])

and we can apply this operation to every element in the list using the do.call function.

str(ceprData <- do.call("rbind", ceprData))

Data formatting

Our final step before plotting our data is to format the values for rural and female. Currently these values are stored as 0/1 dummy codes, but I would like for the values to be spelled out.

Earlier we saw how to extract elements of R objects using bracket notation. To replace elements we using the replacement form, which looks like this:

 ceprData[["rural"]] <- factor(ceprData[["rural"]],
                               levels = c(0, 1),
                               labels = c("Non-rural", "Rural"))

 ceprData[["gender"]] <- factor(ceprData[["female"]],
                               levels = c(1, 0),
                               labels = c("Female", "Male"))

ceprData$displaced_percent <- ceprData$dw * 100

str(ceprData)

Plotting

Now we can take a look at the trends in worker displacement over the last few years.

library(ggplot2)
library(directlabels)
ceprPlot <- ggplot(ceprData, aes(x = year, y = displaced_percent, color = gender)) +
  geom_line() +
  geom_point() +
  facet_wrap(~rural)
direct.label(ceprPlot)

What else?

If there is anything else you want to learn how to do, now is the time to ask!

Go forth and code!

You now know everything you could possibly want to know about R. OK maybe not! But you do know how to manipulate character strings with regular expressions, write your own functions, execute code conditionally, iterate using for or sapply, inspect and modify attributes, and extract and replace object elements. There’s a lot more to learn, but that’s a pretty good start. As you go forth and write your own R code here are some resources that may be helpful.

Additional reading and resources

http://heather.cs.ucdavis.edu/~matloff/R/RProg.pdf

  • Calling C and Fortran from R (by Charles Geyer, UMinn)

http://www.stat.umn.edu/~charlie/rc/

  • State of the Art in Parallel Computing with R (Schmidberger et al.)

http://www.jstatso|.org/v31/i01/paper

Things that may surprise you

There are an unfortunately large number of surprises in R programming. Some of these “gotcha’s” are common problems in other languages, many are unique to R. We will only cover a few – for a more comprehensive discussion please see http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Floating point comparison

Floating point arithmetic is not exact:

.1 == .3/3

Solution: use all.equal():

all.equal(.1, .3/3)

Missing values

R does not exclude missing values by default – a single missing value in a vector means that many thing are unknown:

x <- c(1:10, NA, 12:20)
c(mean(x), sd(x), median(x), min(x), sd(x))

NA is not equal to anything, not even NA

NA == NA

Solutions: use na.rm = TRUE option when calculating, and is.na to test for missing

Automatic type conversion

Automatic type conversion happens a lot which is often useful, but makes it easy to miss mistakes

# combining values coereces them to the most general type
(x <- c(TRUE, FALSE, 1, 2, "a", "b"))
str(x)

# comparisons convert arguments to most general type
1 > "a"

Maybe this is what you expect... I would like to at least get a warning!

Optional argument inconsistencies

Functions you might expect to work similarly don’t always:

mean(1, 2, 3, 4, 5)*5
sum(1, 2, 3, 4, 5)

Why are these different?!?

args(mean)
args(sum)

Ouch. That is not nice at all!

Trouble with Factors

Factors sometimes behave as numbers, and sometimes as characters, which can be confusing!

(x <- factor(c(5, 5, 6, 6), levels = c(6, 5)))

str(x)

as.character(x)
# here is where people sometimes get lost...
as.numeric(x)
# you probably want
as.numeric(as.character(x))

Feedback

Footnotes

[1]Center for Economic and Policy Research. 2012. CPS Displaced Worker Uniform Extracts, Version 1.02. Washington, DC.
[2]Photo by Hadley Wickham via https://twitter.com/hadleywickham/status/643381054758363136/photo/1. Used by permission.