Workshop description

Data scientists are known and celebrated for modeling and visually displaying information, but down in the data science engine room there is a lot of less glamorous work to be done. Before data can be used effectively it must often be cleaned, corrected, and reformatted. This workshop introduces the basic tools needed to make your data behave, including data reshaping, regular expressions and other text manipulation tools.

This is an intermediate/advanced R course appropriate for those with basic knowledge of R. If you need a refresher we recommend the the IQSS R intro.

The lesson notes are available at https://dss.iq.harvard.edu

Prerequisites and Preparation

Prior to the workshop you should:

To start the workshop: - Download the 'R Data Wrangling` workshop materials from http://tutorials.iq.harvard.edu/R/RDataWrangling.zip. - Open Rstudio and create a new project (File ==> New Project ==> Existing Directory and select the RDataWrangling folder you downloaded and extracted earlier. - Open a new R script (File ==> New File ==> R script) and save it as Notes.R.

Example project overview

It is common for data to be made available on a website somewhere, either by a government agency, research group, or other organizations and entities. Often the data you want is spread over many files, and retrieving it all one file at a time is tedious and time consuming. Such is the case with the baby names data we will be using today.

The UK Office for National Statistics provides yearly data on the most popular baby names going back to 1996. The data is provided separately for boys and girls and is stored in Excel spreadsheets.

I have downloaded all the excel files containing boys names data from https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesenglandandwalesbabynamesstatisticsboys and made them available at http://tutorials.iq.harvard.edu/R/RDataWrangling/data/boysNames.zip.

Our mission is to extract and graph the top 100 boys names in England and Wales for every year since 1996. There are several things that make this challenging.

Problems with the data

While it was good of the UK Office for National Statistics to provide baby name data, they were not very diligent about arranging it in a convenient or consistent format.

Exercise 0

Our mission is to extract and graph the top 100 boys names in England and Wales for every year since 1996. There are several things that make this challenging.

  1. Locate the file named 1996boys_tcm77-254026.xlsx and open it in a spreadsheet. (If you don’t have a spreadsheet program installed on your computer you can downloads one from https://www.libreoffice.org/download/download/). What issues can you identify that might make working with these data more difficult?

  2. Locate the file named 2015boysnamesfinal.xlsx and open it in a spreadsheet. In what ways is the format different than the format of 1996boys_tcm77-254026.xlsx? How might these differences make it more difficult to work with these data?

Exercise 0 Prototype

  1. Locate the file named 1996boys_tcm77-254026.xlsx and open it in a spreadsheet. (If you don’t have a spreadsheet program installed on your computer you can downloads one from https://www.libreoffice.org/download/download/). What issues can you identify that might make working with these data more difficult?

The data does not start on row one. Headers are on row 7, followed by a blank line, followed by the actual data.

The data is stored in an inconvenient way, with ranks 1-50 in the first set of columns and ranks 51-100 in a separate set of columns.

There are notes below the data.

  1. Locate the file named 2015boysnamesfinal.xlsx and open it in a spreadsheet. In what ways is the format different than the format of 1996boys_tcm77-254026.xlsx? How might these differences make it more difficult to work with these data?

The worksheet containing the data of interest is in different positions and has different names from one year to the next. However, it always includes “Table 1” in the worksheet name.

Some years include columns for “changes in rank”, others do not.

These differences will make it more difficult to automate re-arranging the data since we have to write code that can handle different input formats.

Useful data manipulation packages

As you can see, the data is in quite a messy state. Note that this is not a contrived example; this is exactly the way the data came to us from the UK government website! Let’s start cleaning and organizing it. The tidyverse suite of packages provides many modern conveniences that will make this job easier.

## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.7
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Working with Excel worksheets

Each Excel file contains a worksheet with the baby names data we want. Each file also contains additional supplemental worksheets that we are not currently interested in. As noted above, the worksheet of interest differs from year to year, but always has “Table 1” in the sheet name.

The first step is to get a vector of file names.

Now that we’ve told R the names of the data files we can start working with them. For example, the first file is

## [1] "data/boys/1996boys_tcm77-254026.xlsx"

and we can use the excel_sheets function from the readxl package to list the worksheet names from this file.

## [1] "Contents"                     "Table 1 - Top 100 boys, E&W" 
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"

Iterating over file names with map

Now that we know how to retrieve the names of the worksheets in an Excel file we could start writing code to extract the sheet names from each file, e.g.,

## [1] "Contents"                     "Table 1 - Top 100 boys, E&W" 
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"
##  [1] "Contents"             "Metadata"             "Terms and Conditions"
##  [4] "Table 1"              "Table 2"              "Table 3"             
##  [7] "Table 4"              "Table 5"              "Table 6"             
## [10] "Related Publications"

This is not a terrible idea for a small number of files, but it is more convenient to let R do the iteration for us. We could use a for loop, or sapply, but the map family of functions from the purrr package gives us a more consistent alternative, so we’ll use that.

## [[1]]
## [1] "Contents"                     "Table 1 - Top 100 boys, E&W" 
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"  
## 
## [[2]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[3]]
## [1] "Contents"                        "Table 1 - Top 100 boys' names"  
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[4]]
## [1] "Contents"                        "Table 1 - Top 100 boys' names"  
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[5]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[6]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[7]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[8]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[9]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[10]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[11]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[12]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[13]]
## [1] "Contents"                        "Table 1 - Top 100 boys' names"  
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[14]]
## [1] "Contents"                        "Table 1 - Top 100 boys' names"  
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[15]]
## [1] "Contents"                        "Table 1 - Top 100 boys, E&W"    
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
## [7] "Table 6 - Boys names - E&W"     
## 
## [[16]]
##  [1] "Contents"                        "Metadata"                       
##  [3] "Terms and Conditions"            "Table 1 - Top 100 boys, E&W"    
##  [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
##  [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
##  [9] "Table 6 - Boys names - E&W"      "Related Publications"           
## 
## [[17]]
##  [1] "Contents"                        "Metadata"                       
##  [3] "Terms and Conditions"            "Table 1 - Top 100 boys, E&W"    
##  [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
##  [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
##  [9] "Table 6 - Boys names - E&W"      "Related Publications"           
## 
## [[18]]
##  [1] "Contents"                        "Metadata"                       
##  [3] "Terms and Conditions"            "Table 1 - Top 100 boys, E&W"    
##  [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
##  [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
##  [9] "Table 6 - Boys names - E&W"      "Related Publications"           
## 
## [[19]]
##  [1] "Contents"                        "Metadata"                       
##  [3] "Terms and Conditions"            "Table 1 - Top 100 boys, E&W"    
##  [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"  
##  [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month" 
##  [9] "Table 6 - Boys names - E&W"      "Related Publications"           
## 
## [[20]]
##  [1] "Contents"             "Metadata"             "Terms and Conditions"
##  [4] "Table 1"              "Table 2"              "Table 3"             
##  [7] "Table 4"              "Table 5"              "Table 6"             
## [10] "Related Publications"

Filtering strings using regular expressions

In order extract the correct worksheet names we need a way to extract strings containing “Table 1”. Base R provides some string manipulation capabilities (see ?regex, ?sub and ?grep), but we will use the stringr package because it is more user-friendly.

The stringr package provides functions to detect, locate, extract, match, replace, combine and split strings (among other things).

Here we want to detect the pattern “Table 1”, and only return elements with this pattern. We can do that using the str_subset function. The first argument to str_subset is character vector we want to search in. The second argument is a regular expression matching the pattern we want to retain.

If you are not familiar with regular expressions, http://www.regexr.com/ is a good place to start.

Now that we know how to filter character vectors using str_subset we can identify the correct sheet in a particular Excel file. For example,

## [1] "Table 1 - Top 100 boys, E&W"

Writing your own functions

The map* functions are useful when you want to apply a function to a list or vector of inputs and obtain the return values. This is very convenient when a function already exists that does exactly what you want. In the examples above we mapped the excel_sheets function to the elements of a vector containing file names. But now there is no function that both retrieves worksheet names and subsets them. Fortunately, writing functions in R is easy.

Now we can map this new function over our vector of file names.

## [[1]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[2]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[3]]
## [1] "Table 1 - Top 100 boys' names"
## 
## [[4]]
## [1] "Table 1 - Top 100 boys' names"
## 
## [[5]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[6]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[7]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[8]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[9]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[10]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[11]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[12]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[13]]
## [1] "Table 1 - Top 100 boys' names"
## 
## [[14]]
## [1] "Table 1 - Top 100 boys' names"
## 
## [[15]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[16]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[17]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[18]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[19]]
## [1] "Table 1 - Top 100 boys, E&W"
## 
## [[20]]
## [1] "Table 1"

Reading Excel data files

Now that we know the correct worksheet from each file we can actually read those data into R. We can do that using the read_excel function.

We’ll start by reading the data from the first file, just to check that it works. Recall that the actual data starts on row 7, so we want to skip the first 6 rows.

## Observations: 59
## Variables: 7
## $ X__1     <chr> NA, "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"...
## $ Name     <chr> NA, "JACK", "DANIEL", "THOMAS", "JAMES", "JOSHUA", "M...
## $ Count    <dbl> NA, 10779, 10338, 9603, 9385, 7887, 7426, 6496, 6193,...
## $ X__2     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ X__3     <dbl> NA, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...
## $ Name__1  <chr> NA, "DOMINIC", "NICHOLAS", "BRANDON", "RHYS", "MARK",...
## $ Count__1 <dbl> NA, 1519, 1385, 1337, 1259, 1222, 1192, 1186, 1135, 1...

Exercise 1

  1. Write a function that takes a file name as an argument and reads the worksheet containing “Table 1” from that file. Don’t forget to skip the first 6 rows.

  2. Test your function by using it to read one of the boys names Excel files.

  3. Use the map function to read data from all the Excel files, using the function you wrote in step 1.

Exercise 1 solution

## Observations: 59
## Variables: 7
## $ X__1     <chr> NA, "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"...
## $ Name     <chr> NA, "JACK", "DANIEL", "THOMAS", "JAMES", "JOSHUA", "M...
## $ Count    <dbl> NA, 10779, 10338, 9603, 9385, 7887, 7426, 6496, 6193,...
## $ X__2     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ X__3     <dbl> NA, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...
## $ Name__1  <chr> NA, "DOMINIC", "NICHOLAS", "BRANDON", "RHYS", "MARK",...
## $ Count__1 <dbl> NA, 1519, 1385, 1337, 1259, 1222, 1192, 1186, 1135, 1...

Data cleanup

Now that we’ve read in the data we still have some cleanup to do. Specifically, we need to:

  1. fix column names
  2. get rid of blank row and the top and the notes at the bottom
  3. get rid of extraneous “changes in rank” columns if they exist
  4. transform the side-by-side tables layout to a single table.

In short, we want to go from this: