Data scientists are known and celebrated for modeling and visually displaying information, but down in the data science engine room there is a lot of less glamorous work to be done. Before data can be used effectively it must often be cleaned, corrected, and reformatted. This workshop introduces the basic tools needed to make your data behave, including data reshaping, regular expressions and other text manipulation tools.

This is an intermediate/advanced R course appropriate for those with basic knowledge of R. If you need a refresher we recommend the the IQSS R intro.

The lesson notes are available at https://dss.iq.harvard.edu

Prior to the workshop you should:

- install R from https://cran.r-project.org/
- install RStudio from https://www.rstudio.com/products/rstudio/download/#download
- install the tidyverse package in R with
`install.packages("tidyverse")`

and attach it with`library(tidyverse)`

To start the workshop: - Download the 'R Data Wrangling` workshop materials from http://tutorials.iq.harvard.edu/R/RDataWrangling.zip. - Open Rstudio and create a new project (`File ==> New Project ==> Existing Directory`

and select the `RDataWrangling`

folder you downloaded and extracted earlier. - Open a new R script (`File ==> New File ==> R script`

) and save it as `Notes.R`

.

It is common for data to be made available on a website somewhere, either by a government agency, research group, or other organizations and entities. Often the data you want is spread over many files, and retrieving it all one file at a time is tedious and time consuming. Such is the case with the baby names data we will be using today.

The UK Office for National Statistics provides yearly data on the most popular baby names going back to 1996. The data is provided separately for boys and girls and is stored in Excel spreadsheets.

I have downloaded all the excel files containing boys names data from https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesenglandandwalesbabynamesstatisticsboys and made them available at http://tutorials.iq.harvard.edu/R/RDataWrangling/data/boysNames.zip.

Our mission is to extract and graph the **top 100** boys names in England and Wales for every year since 1996. There are several things that make this challenging.

While it was good of the UK Office for National Statistics to provide baby name data, they were not very diligent about arranging it in a convenient or consistent format.

Our mission is to extract and graph the **top 100** boys names in England and Wales for every year since 1996. There are several things that make this challenging.

Locate the file named

`1996boys_tcm77-254026.xlsx`

and open it in a spreadsheet. (If you don’t have a spreadsheet program installed on your computer you can downloads one from https://www.libreoffice.org/download/download/). What issues can you identify that might make working with these data more difficult?Locate the file named

`2015boysnamesfinal.xlsx`

and open it in a spreadsheet. In what ways is the format different than the format of`1996boys_tcm77-254026.xlsx`

? How might these differences make it more difficult to work with these data?

- Locate the file named
`1996boys_tcm77-254026.xlsx`

and open it in a spreadsheet. (If you don’t have a spreadsheet program installed on your computer you can downloads one from https://www.libreoffice.org/download/download/). What issues can you identify that might make working with these data more difficult?

The data does not start on row one. Headers are on row 7, followed by a blank line, followed by the actual data.

The data is stored in an inconvenient way, with ranks 1-50 in the first set of columns and ranks 51-100 in a separate set of columns.

There are notes below the data.

- Locate the file named
`2015boysnamesfinal.xlsx`

and open it in a spreadsheet. In what ways is the format different than the format of`1996boys_tcm77-254026.xlsx`

? How might these differences make it more difficult to work with these data?

The worksheet containing the data of interest is in different positions and has different names from one year to the next. However, it always includes “Table 1” in the worksheet name.

Some years include columns for “changes in rank”, others do not.

These differences will make it more difficult to automate re-arranging the data since we have to write code that can handle different input formats.

As you can see, the data is in quite a messy state. Note that this is not a contrived example; this is exactly the way the data came to us from the UK government website! Let’s start cleaning and organizing it. The `tidyverse`

suite of packages provides many modern conveniences that will make this job easier.

`## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──`

```
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.7
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
```

```
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
```

Each Excel file contains a worksheet with the baby names data we want. Each file also contains additional supplemental worksheets that we are not currently interested in. As noted above, the worksheet of interest differs from year to year, but always has “Table 1” in the sheet name.

The first step is to get a vector of file names.

Now that we’ve told R the names of the data files we can start working with them. For example, the first file is

`## [1] "data/boys/1996boys_tcm77-254026.xlsx"`

and we can use the `excel_sheets`

function from the *readxl* package to list the worksheet names from this file.

```
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"
```

`map`

Now that we know how to retrieve the names of the worksheets in an Excel file we could start writing code to extract the sheet names from each file, e.g.,

```
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"
```

```
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
```

```
## [1] "Contents" "Metadata" "Terms and Conditions"
## [4] "Table 1" "Table 2" "Table 3"
## [7] "Table 4" "Table 5" "Table 6"
## [10] "Related Publications"
```

This is not a terrible idea for a small number of files, but it is more convenient to let R do the iteration for us. We could use a `for`

loop, or `sapply`

, but the `map`

family of functions from the *purrr* package gives us a more consistent alternative, so we’ll use that.

```
## [[1]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"
##
## [[2]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[3]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[4]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[5]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[6]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[7]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[8]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[9]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[10]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[11]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[12]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[13]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[14]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[15]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[16]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[17]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[18]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[19]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[20]]
## [1] "Contents" "Metadata" "Terms and Conditions"
## [4] "Table 1" "Table 2" "Table 3"
## [7] "Table 4" "Table 5" "Table 6"
## [10] "Related Publications"
```

In order extract the correct worksheet names we need a way to extract strings containing “Table 1”. Base R provides some string manipulation capabilities (see `?regex`

, `?sub`

and `?grep`

), but we will use the *stringr* package because it is more user-friendly.

The *stringr* package provides functions to *detect*, *locate*, *extract*, *match*, *replace*, *combine* and *split* strings (among other things).

Here we want to detect the pattern “Table 1”, and only return elements with this pattern. We can do that using the `str_subset`

function. The first argument to `str_subset`

is character vector we want to search in. The second argument is a *regular expression* matching the pattern we want to retain.

If you are not familiar with regular expressions, http://www.regexr.com/ is a good place to start.

Now that we know how to filter character vectors using `str_subset`

we can identify the correct sheet in a particular Excel file. For example,

`## [1] "Table 1 - Top 100 boys, E&W"`

The `map*`

functions are useful when you want to apply a function to a list or vector of inputs and obtain the return values. This is very convenient when a function already exists that does exactly what you want. In the examples above we mapped the `excel_sheets`

function to the elements of a vector containing file names. But now there is no function that both retrieves worksheet names and subsets them. Fortunately, writing functions in R is easy.

Now we can map this new function over our vector of file names.

```
## [[1]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[2]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[3]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[4]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[5]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[6]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[7]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[8]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[9]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[10]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[11]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[12]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[13]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[14]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[15]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[16]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[17]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[18]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[19]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[20]]
## [1] "Table 1"
```

Now that we know the correct worksheet from each file we can actually read those data into R. We can do that using the `read_excel`

function.

We’ll start by reading the data from the first file, just to check that it works. Recall that the actual data starts on row 7, so we want to skip the first 6 rows.

```
tmp <- read_excel(
boy.file.names[1],
sheet = get.data.sheet.name(boy.file.names[1],
pattern = "Table 1"),
skip = 6)
library(dplyr, quietly=TRUE)
glimpse(tmp)
```

```
## Observations: 59
## Variables: 7
## $ X__1 <chr> NA, "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"...
## $ Name <chr> NA, "JACK", "DANIEL", "THOMAS", "JAMES", "JOSHUA", "M...
## $ Count <dbl> NA, 10779, 10338, 9603, 9385, 7887, 7426, 6496, 6193,...
## $ X__2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ X__3 <dbl> NA, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...
## $ Name__1 <chr> NA, "DOMINIC", "NICHOLAS", "BRANDON", "RHYS", "MARK",...
## $ Count__1 <dbl> NA, 1519, 1385, 1337, 1259, 1222, 1192, 1186, 1135, 1...
```

Write a function that takes a file name as an argument and reads the worksheet containing “Table 1” from that file. Don’t forget to skip the first 6 rows.

Test your function by using it to read

*one*of the boys names Excel files.Use the

`map`

function to read data from all the Excel files, using the function you wrote in step 1.

```
## 1. Write a function that takes a file name as an argument and reads
## the worksheet containing "Table 1" from that file.
read.baby.names <- function(file) {
sheet.name <- str_subset(excel_sheets(file), "Table 1")
read_excel(file, sheet = sheet.name, skip = 6)
}
## 2. Test your function by using it to read *one* of the boys names
## Excel files.
glimpse(read.baby.names(boy.file.names[1]))
```

```
## Observations: 59
## Variables: 7
## $ X__1 <chr> NA, "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"...
## $ Name <chr> NA, "JACK", "DANIEL", "THOMAS", "JAMES", "JOSHUA", "M...
## $ Count <dbl> NA, 10779, 10338, 9603, 9385, 7887, 7426, 6496, 6193,...
## $ X__2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ X__3 <dbl> NA, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...
## $ Name__1 <chr> NA, "DOMINIC", "NICHOLAS", "BRANDON", "RHYS", "MARK",...
## $ Count__1 <dbl> NA, 1519, 1385, 1337, 1259, 1222, 1192, 1186, 1135, 1...
```

Now that we’ve read in the data we still have some cleanup to do. Specifically, we need to:

- fix column names
- get rid of blank row and the top and the notes at the bottom
- get rid of extraneous “changes in rank” columns if they exist
- transform the side-by-side tables layout to a single table.

In short, we want to go from this: