Workshop overview and materials

Workshop description

This is an intermediate/advanced R course appropriate for those with basic knowledge of R. It is intended for those already comfortable with using R for data analysis who wish to move on to writing their own functions.

Prerequisite: basic familiarity with R, such as acquired from an introductory R workshop.

To the extent possible this workshop uses real-world examples. That is to say that concepts are introduced as they are needed for a realistic analysis task. In the course of working through a realistic project we will lean about regular expressions, iteration, functions, control flow and more.

This workshop uses the tidyverse package which provides more consistent file IO (readr), data manipulation (dplyr, tidyr) and functional programming (purrr) tools for R.

Materials and setup

NOTE: skip this section if you are running these examples in your web browser at tutorials-live.iq.harvard.edu

Everyone should have R installed – if not:

Materials for this workshop consists of notes and example code.

Example project overview

It is common for data to be made available on a website somewhere, either by a government agency, research group, or other organizations and entities. Often the data you want is spread over many files, and retrieving it all one file at a time is tedious and time consuming. Such is the case with the baby names data we will be using today.

The UK Office for National Statistics provides yearly data on the most popular baby names going back to 1996. The data is provided separately for boys and girls. These data have been cleaned up and copied to http://tutorials.iq.harvard.edu/example_data/baby_names/. Although we could open a web browser and download files one at a time, it will be faster and easier to instruct R to do that for us. Doing it this way will also give us an excuse to talk about iteration, text processing, and other useful techniques.

Extracting information from a text file (string manipulation)

Our first goal is to download all the .csv files from http://tutorials.iq.harvard.edu/example_data/baby_names/EW.

In order to do that we need a list of the Uniform Resource Locators (URLs) of those files. The URLs we need are right there as links in the web page. All we need to do is extract them.

Introduction to the tidyverse

In order extract the data URls from the website we will use functions for manipulating strings1. What functions or packages should we use? Here are some tools to help us decide.

Task views
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
R package search
http://www.r-pkg.org/search.html?q=string
Web search
https://www.google.com/search?q=R+string+manipulation&ie=utf-8&oe=utf-8

Base R provides some string manipulation capabilities (see ?regex, ?sub and ?grep), but I recommend either the stringr or stringi packages. The stringr package is more user-friendly, and the stringi package is more powerful and complex. Let’s use the friendlier one. While we’re attaching packages we’ll also attach the tidyverse package which provides more modern versions of many basic functions in R.

## install.packages("tidyverse")
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(stringr)

Packages in the tidyverse are often more convenient to use with pipes rather than nested function calls. The pipe operator looks like %>% and it inserts the value on the left as the first argument to the function on the right. It looks like this:

(x <- rnorm(5)) # extra parens around expression is a shortcut to assign and print
## [1]  0.3275039 -1.3261635  0.9757538 -1.5788018 -1.5597637
## nested function calls to sort and round
round(sort(x), digits = 2)
## [1] -1.58 -1.56 -1.33  0.33  0.98
## pipeline that does the same thing
x %>%
sort() %>%
round(digits = 2)
## [1] -1.58 -1.56 -1.33  0.33  0.98
## nested function calls to sample letters, convert to uppercase and sort.
sort(
  str_to_upper(
    sample(letters,
           5,
           replace = TRUE)))
## [1] "J" "O" "T" "X" "Z"
## pipeline that does the same thing:
letters %>%
  sample(5, replace = TRUE) %>%
  str_to_upper() %>%
  sort()
## [1] "J" "K" "M" "Q" "Z"

The examples in this workshop use the pipe when it makes examples easier to follow.

Reading text over http

Our first task is to read the web page into R. Most of the file IO functions in R can read either from a local path or from internet URLS. We can read text into R line-by-line using the read_lines function.

base.url <- "http://tutorials.iq.harvard.edu"
baby.names.path <- "example_data/baby_names/EW" 
baby.names.url <- str_c(base.url, baby.names.path, sep = "/")

baby.names.page <- read_lines(baby.names.url)

Now that we have some values to play with lets look closer. R has nice tools for inspecting the object attributes such as the storage type, length, and class.

## whate is base.url?
c(type = typeof(base.url),
  length = length(base.url),
  class = class(base.url))
##        type      length       class 
## "character"         "1" "character"
c(type = typeof(baby.names.page),
  length = length(baby.names.page),
  class = class(baby.names.page))

The str (structure) and glimpse function gives a nice overview.

str(baby.names.page)
##  chr [1:53] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">" ...

The S3 class system in R allows functions to define methods for specific classes. If we know the class of an object we can see what functions have methods specific to that class.

methods(class = class(base.url))
##  [1] all.equal                as.data.frame           
##  [3] as.Date                  as_function             
##  [5] as.POSIXlt               as.raster               
##  [7] coerce                   coerce<-                
##  [9] escape                   formula                 
## [11] getDLLRegisteredRoutines Ops                     
## [13] recode                  
## see '?methods' for accessing help and source code

Note that these methods are not exhaustive – we can do other things with these objects as well. methods(class = ) just tells you which functions have specific methods for working with objects of that class.

The typeof function tells us how the object is stored in memory. “Character” is one of the six atomic modes in R. The others are “logical”, “integer”, “numeric (double)”, “complex”, and “raw”. Objects reporting their mode as one of these are atomic vectors; they are the building blocks of most data structures in R.

Now that we know what we’re working with we can proceed to find and extract all the links in the page. Lets start by printing the last few lines of baby.names.page.

cat(tail(baby.names.page), sep = "\n")
## <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="girls_2014.csv">girls_2014.csv</a></td><td align="right">06-Oct-2016 13:12  </td><td align="right">144K</td></tr>
## <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="girls_2015.csv">girls_2015.csv</a></td><td align="right">06-Oct-2016 13:12  </td><td align="right">145K</td></tr>
## <tr><th colspan="5"><hr></th></tr>
## </table>
## <address>Apache/2.2.3 (Red Hat) Server at tutorials.iq.harvard.edu Port 80</address>
## </body></html>

We want to extract the “href” attributes, i.e., “girls_2014.csv” and “girls_2015.csv” in the above snippet. We can do that using the str_extract function, but in order to use it effectively we need to know something about regular expressions.

String manipulation with the stringr package

Regular expressions are useful in general (not just in R!) and it is a good idea to be familiar with at least the basics.

In regulars expression ^, ., *, +, $ and \ are special characters with the following meanings:

^
matches the beginning of the string
.
matches any character
*
repeats the previous match zero or more times
+
repeats the previous match one or more times
$
matches the end of the string
[]
specifies ranges of characters: \[a-z\] matches lower case letters
escapes special meaning: ‘.’ means “anything”, ‘.’ means “.”

Here’s how it works in R using the stringr package. Note that to create a backslash you must escape it with anouth backslash.

## make up some example data
user.info <- c("Dexter Bacon dbacon@gmail.com 32",
               "Angelica Sampson not available 28",
               "Roberta Modela roberta.modela@harvard.edu 26"
               )
## regex that matches emails (simplified, not realistic)
##                                               
email.regex <- "([a-z0-9]+@[a-z0-9]+\\.[a-z]+)"

Which users have an email address?

str_detect(user.info, email.regex)
## [1]  TRUE FALSE  TRUE

Keep only users with an email address

str_subset(user.info, email.regex)
## [1] "Dexter Bacon dbacon@gmail.com 32"            
## [2] "Roberta Modela roberta.modela@harvard.edu 26"

Extract the email addresses

str_extract(user.info, email.regex)
## [1] "dbacon@gmail.com"   NA                   "modela@harvard.edu"

Replace email addresses with an mailto link

str_replace(user.info, email.regex, "<a href='mailto:\\1'>\\1</a>")
## [1] "Dexter Bacon <a href='mailto:dbacon@gmail.com'>dbacon@gmail.com</a> 32"              
## [2] "Angelica Sampson not available 28"                                                   
## [3] "Roberta Modela roberta.<a href='mailto:modela@harvard.edu'>modela@harvard.edu</a> 26"

If you have not been introduced to regular expressions yet a nice interactive regex tester is available at http://www.regexr.com/ and an interactive tutorial is available at http://www.regexone.com/.

Exercise 1: string manipulation and regular expressions

Our job now is to match the file names using regular expressions. To get started lets copy an example string into the interactive regex tester at http://www.regexr.com/ and work with it until we find a regular expression that works.

  1. Open http://www.regexr.com/ in your web browser and paste in this string into the “Text” box (in the lower right-hand section of the page):
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="girls_2014.csv">girls_2014.csv</a></td><td align="right">06-Oct-2016 13:12  </td><td align="right">144K</td></tr>
  1. Find a regular expression that matches ‘girls_2014.csv’ and nothing else.
  2. Assign the regular expression you found to the name ‘girl.file.regex’ in R. Replace any backslashes with a double backslash.
  3. Extract the girl file names from baby.names.page and assign the values to the name ‘girl.file.names
  4. Repeat steps 1:4 for boys.
  5. Use the str_c function to prepend baby.names.url to girl.file.names and boy.file.names. Make sure to separate with a forward slash (“/”). ```

Exercise 1 prototype

## 1. Make sure you replace whatever example text was presented in the regexr.com site.

## 2. regular expression for girl csv files
girl.file.regex <- "girls_[0-9]*\\.csv"

## 3. extract girl file names
girl.file.names <- baby.names.page %>%
  str_subset(pattern = girl.file.regex) %>%
  str_extract(pattern = girl.file.regex)

## 4. same thing for boys
boy.file.regex <- "boys_[0-9]*\\.csv"
boy.file.names <- baby.names.page %>%
  str_subset(pattern = boy.file.regex) %>%
  str_extract(pattern = boy.file.regex)

## 5. construct URLs
(girl.file.names <- str_c(baby.names.url, girl.file.names, sep = "/"))
(boy.file.names <- str_c(baby.names.url, boy.file.names, sep = "/"))

Reading all the files (iteration, functions)

Reading .csv files

As mentioned earlier, we can read files directly from the internet. For example, we can read the first girls names file like this:

girl.names.1 <- read_csv(girl.file.names[1], na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
head(girl.names.1)
## # A tibble: 6 × 3
##    Rank    Name Count
##   <dbl>   <chr> <dbl>
## 1     1  SOPHIE  7087
## 2     2   CHLOE  6824
## 3     3 JESSICA  6711
## 4     4   EMILY  6415
## 5     5  LAUREN  6299
## 6     6  HANNAH  5916

Notice that we selected the first element of girl.file.names using [. This is called bracket extraction and it is a very useful feature of the R language.

Extracting and replacing vector elements

Elements of R objects can be extracted and replaced using bracket notation. Bracket extraction comes in a few different flavors. We can index atomic vectors in several different ways. Let’s start by making some example data to work with.

example.int.1 <- c(10, 11, 12, 13, 14, 15)
str(example.int.1)
##  num [1:6] 10 11 12 13 14 15

The names attribute can be used for indexing if it exists. The names attribute and be extracted and set using the names function, like this:

names(example.int.1) # no names yet, lets add some
## NULL
names(example.int.1) <- c("a1", "a2", "b1", "b2", "c1", "c2")
names(example.int.1)
## [1] "a1" "a2" "b1" "b2" "c1" "c2"

Indexing can be done by position.

## extract by position
example.int.1[1]
## a1 
## 10
example.int.1[c(1, 3, 5)]
## a1 b1 c1 
## 10 12 14

If an object has names you can index by name.

## extract by name
example.int.1[c("c2", "a1")]
## c2 a1 
## 15 10

Finally, objects can be indexed with a logical vector, extracting only the TRUE elements.

## logical extraction 
(one.names <- str_detect(names(example.int.1), "1"))
## [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE
example.int.1[one.names]
## a1 b1 c1 
## 10 12 14
example.int.1[example.int.1 > 12]
## b2 c1 c2 
## 13 14 15

Replacement works by assigning a value to an extraction.

example.int.2 <- example.int.1

## replace by position
example.int.2[1] <- 100

## replace by name
example.int.2["a2"] <- 200
example.int.2
##  a1  a2  b1  b2  c1  c2 
## 100 200  12  13  14  15
## logical replacement
(lt14 <- example.int.2 < 14)
##    a1    a2    b1    b2    c1    c2 
## FALSE FALSE  TRUE  TRUE FALSE FALSE
example.int.2[lt14] <- 0
example.int.2
##  a1  a2  b1  b2  c1  c2 
## 100 200   0   0  14  15
## "replace" non-existing element
example.int.2[c("z1", "2")] <- -10 

## compare lists to see the changes we made
example.int.1
## a1 a2 b1 b2 c1 c2 
## 10 11 12 13 14 15
example.int.2
##  a1  a2  b1  b2  c1  c2  z1   2 
## 100 200   0   0  14  15 -10 -10

Extracting and replacing list elements

List elements can be extracted and replaced in the same way as elements of atomic vectors. In addition, [[ can be used to extract or replace the contents of a list element. Here is how it works:

## make up an example list
example.list.1 <- list(a1 = c(a = 1, b = 2, c = 3),
                     a2 = c(4, 5, 6),
                     b1 = c("a", "b", "c", "d"),
                     b2 = c("e", "f", "g", "h"))
str(example.list.1)
## List of 4
##  $ a1: Named num [1:3] 1 2 3
##   ..- attr(*, "names")= chr [1:3] "a" "b" "c"
##  $ a2: num [1:3] 4 5 6
##  $ b1: chr [1:4] "a" "b" "c" "d"
##  $ b2: chr [1:4] "e" "f" "g" "h"

Extract by position.

str(example.list.1[c(1, 3)])
## List of 2
##  $ a1: Named num [1:3] 1 2 3
##   ..- attr(*, "names")= chr [1:3] "a" "b" "c"
##  $ b1: chr [1:4] "a" "b" "c" "d"
str(example.list.1[1])
## List of 1
##  $ a1: Named num [1:3] 1 2 3
##   ..- attr(*, "names")= chr [1:3] "a" "b" "c"
str(example.list.1[[1]]) # note the difference between [ and [[

Extract by name.

str(example.list.1[c("a1", "a2")])
(a.names <- str_detect(names(example.list.1), "a"))
## [1]  TRUE  TRUE FALSE FALSE
str(example.list.1[a.names])
## List of 2
##  $ a1: Named num [1:3] 1 2 3
##   ..- attr(*, "names")= chr [1:3] "a" "b" "c"
##  $ a2: num [1:3] 4 5 6

Bracket extraction can be chained to extract a nested element.

str(example.list.1[["a1"]][c("a", "c")])
##  Named num [1:2] 1 3
##  - attr(*, "names")= chr [1:2] "a" "c"

Logical extraction.

(el.length <- map_int(example.list.1, length))
## a1 a2 b1 b2 
##  3  3  4  4
(el.length4 <- el.length == 4)
##    a1    a2    b1    b2 
## FALSE FALSE  TRUE  TRUE
str(example.list.1[el.length4])
## List of 2
##  $ b1: chr [1:4] "a" "b" "c" "d"
##  $ b2: chr [1:4] "e" "f" "g" "h"

As with vectors, replacement works by assigning a value to an extraction.

example.list.2 <- example.list.1

## replace by position
example.list.2[[1]] <- c(a = 11, b = 12, c = 13)

## replace by name
example.list.2[["a2"]] <- c(10, 20, 30)

## iterate and replace by name
example.list.2[c("a1", "a2")] <- map(example.list.2[c("a1", "a2")],
                                        function(x) x * 100)

## logical replacement with iteration
(el.length <- map(example.list.2, length))
## $a1
## [1] 3
## 
## $a2
## [1] 3
## 
## $b1
## [1] 4
## 
## $b2
## [1] 4
(el.length4 <- el.length == 4)
##    a1    a2    b1    b2 
## FALSE FALSE  TRUE  TRUE
example.list.2[el.length4] <- map(example.list.2[el.length4],
                                     function(x) str_c("letter", x, sep = "_"))

## "replace" non-existing element
example.list.2[["c"]] <- list(x = letters[1:5], y = 1:5)
## compare lists to see the changes we made
str(example.list.1)
## List of 4
##  $ a1: Named num [1:3] 1 2 3
##   ..- attr(*, "names")= chr [1:3] "a" "b" "c"
##  $ a2: num [1:3] 4 5 6
##  $ b1: chr [1:4] "a" "b" "c" "d"
##  $ b2: chr [1:4] "e" "f" "g" "h"
str(example.list.2)
## List of 5
##  $ a1: Named num [1:3] 1100 1200 1300
##   ..- attr(*, "names")= chr [1:3] "a" "b" "c"
##  $ a2: num [1:3] 1000 2000 3000
##  $ b1: chr [1:4] "letter_a" "letter_b" "letter_c" "letter_d"
##  $ b2: chr [1:4] "letter_e" "letter_f" "letter_g" "letter_h"
##  $ c :List of 2
##   ..$ x: chr [1:5] "a" "b" "c" "d" ...
##   ..$ y: int [1:5] 1 2 3 4 5

Back to the problem at hand: reading data into a list

Using our knowledge of bracket extraction we could start reading in the data files like this:

boys <- list()
girls <- list()

boys[[1]] <- read_csv(boy.file.names[1], na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
boys[[2]] <- read_csv(boy.file.names[2], na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## ...
girls[[1]] <- read_csv(girl.file.names[1], na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
girls[[2]] <- read_csv(girl.file.names[2], na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## ...

Exercise 2: String manipulation, extraction and replacement

We saw in the previous example one way to start reading the baby name data, using positional bracket extraction and replacement. In this exercise we will improve on this method by doing the same thing using named extraction and replacement. The first step is to extract the years from boy.file.names and girl.file.names and assign then to the names attribute of our boys and girls lists.

  1. Use the list function to create empty lists named boys and girls.
  2. Write a regular expression that matches digits 0-9 repeated one or more times and use it to extract the years from boy.file.names and girl.file.names (use str_extract).
  3. Use the assignment from of the names function to assign the years vectors from step one to the names of boy.file.names and girl.file.names respectively.
  4. Extract the element named “2015” from girl.file.names and pass it as the argument to read_csv, assigning the result to a new element of the girls list named “2015”. Repeat for elements “2014” and “2013”.
  5. Repeat step three using boy.file.names and the boys list.

Exercise 2 prototype

  ## 1. create lists to store the data
  boys <- list()
  girls <- list()

  ## 2. extract years and assign to names
  boy.years <- str_extract(boy.file.names, "[0-9]+")
  girl.years <- str_extract(girl.file.names, "[0-9]+")
  names(girl.file.names) <- girl.years
  names(boy.file.names) <- boy.years

  ## 3. read first three years of girls names
  girls[["2015"]] <- read_csv(girl.file.names["2015"], na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
  girls[["2014"]] <- read_csv(girl.file.names["2014"], na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
  girls[["2013"]] <- read_csv(girl.file.names["2013"], na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
  ## 4. read first three years of boys names
  boys[["2015"]] <- read_csv(boy.file.names["2015"], na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
  boys[["2014"]] <- read_csv(boy.file.names["2014"], na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
  boys[["2013"]] <- read_csv(boy.file.names["2013"], na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
  ## 5. examine the result
  str(boys, max.level = 2)
  str(girls, max.level = 2)

Iterating using the map function

With a small number of files reading each one separately isn’t too bad, but it obviously doesn’t scale well. To read all the files conveniently we instead want to instruct R to iterate over the vector of URLs for us and download each one. We can carry out this iteration in several ways, including using one of the map* functions in the purrr package. Here is how it works.

## make up some example data
list.1 <- list(a = sample(1:5, 20, replace = TRUE),
               b = c(NA, sample(1:10, 20, replace = TRUE)),
               c = sample(10:15, 20, replace = TRUE))

Calculate the mean of every entry

map.1 <- map(list.1, mean)
str(map.1)
## List of 3
##  $ a: num 2.95
##  $ b: num NA
##  $ c: num 11.9

Calculate the mean of every entry, passing na.rm argument.

map.1 <- map(list.1, mean, na.rm = TRUE)
str(map.1)
## List of 3
##  $ a: num 2.95
##  $ b: num 6.5
##  $ c: num 11.9

Calculate the mean of every entry, returning a numberic vector instead of a list.

map.2 <- map_dbl(list.1, mean)
str(map.2)
##  Named num [1:3] 2.95 NA 11.9
##  - attr(*, "names")= chr [1:3] "a" "b" "c"

Calculate the mean of every entry, returning a character vector.

map.3 <- map_chr(list.1, mean)

Calculate summaries (map returns a list).

map.4 <- map(list.1, summary)
str(map.4)
## List of 3
##  $ a:Classes 'summaryDefault', 'table'  Named num [1:6] 1 1.75 3 2.95 4 5
##   .. ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
##  $ b:Classes 'summaryDefault', 'table'  Named num [1:7] 2 5 6.5 6.5 8.25 10 1
##   .. ..- attr(*, "names")= chr [1:7] "Min." "1st Qu." "Median" "Mean" ...
##  $ c:Classes 'summaryDefault', 'table'  Named num [1:6] 10 10.8 12 11.9 13 ...
##   .. ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...

Writing your own functions

The map* functions are useful when you want to apply a function to a list or vector of inputs and obtain the return values. This is very convenient when a function already exists that does exactly what you want. In the examples above we mapped mean and summary to the elements of a list. But what if there is no existing function that does exactly what we want? Suppose that rather than the set of statistics reported by the summary function we want to summarize each element in the list by calculating the length, mean, and standard deviation? In that case we will need to write a function that does what we want. Fortunately, writing functions in R is easy.

## define new function that returns the lenght, mean, and standard deviation
my.summary <- function(x) {
  n <- length(x)
  avg <- mean(x)
  std.dev <- sd(x)
  return(c(N = n, Mean = avg, Standard.Deviation = std.dev))
}

## test it out
my.summary(list.1[[1]])
##                  N               Mean Standard.Deviation 
##          20.000000           2.950000           1.538112

Our new function can be used just like any other function in R. For example, we can iterate over list.1 and apply our new function to each element.

map(list.1, my.summary)
## $a
##                  N               Mean Standard.Deviation 
##          20.000000           2.950000           1.538112 
## 
## $b
##                  N               Mean Standard.Deviation 
##                 21                 NA                 NA 
## 
## $c
##                  N               Mean Standard.Deviation 
##          20.000000          11.900000           1.552587

Note that you can use the special ... notation to pass named arguments without needing to define them all. For example:

## improve our function by allowing arbitrary arguments to be passed into the body
my.summary <- function(x, ...) {
  n <- length(x)
  avg <- mean(x, ...)
  std.dev <- sd(x, ...)
  return(c(N = n, Mean = avg, Standard.Deviation = std.dev))
}

Now we can try out the new version of our summary function. Calling it without any additional arguments works fine for the first element of the list:

my.summary(list.1[[1]])
##                  N               Mean Standard.Deviation 
##          20.000000           2.950000           1.538112

However, it doesn’t work so well on elements containing missing values:

my.summary(list.1[[2]])
##                  N               Mean Standard.Deviation 
##                 21                 NA                 NA

Even though our function does not have an na.rm argument we, can pass it to mean and sd via ...

my.summary(list.1[[2]], na.rm = TRUE) 
##                  N               Mean Standard.Deviation 
##          21.000000           6.500000           2.438723

Often when writing functions you want to skip some part of the function body under some conditions. For example, we might want omit missing values only if they exist:

## improve our function by removing missing values if they exist.
my.summary <- function(x, ...) {
  if(any(is.na(x))) {
    x <- na.omit(x)
  }
  n <- length(x)
  avg <- mean(x, ...)
  std.dev <- sd(x, ...)
  return(c(N = n, Mean = avg, Standard.Deviation = std.dev))
}

This is often useful for argument checking among other things.

my.summary <- function(x, mean.only = FALSE, ...) {
  if(!is.numeric(x)) {
    stop("x is not numeric.")
  }
  if(any(is.na(x))) {
    x <- na.omit(x)
  }
  if(mean.only) {
    stats <- c(N = length(x), Mean = mean(x))
  } else {
    stats <- c(N = length(x), Mean = mean(x), Standard.Deviation = sd(x))
  }
  return(stats)
}
map(list.1, my.summary)
## $a
##                  N               Mean Standard.Deviation 
##          20.000000           2.950000           1.538112 
## 
## $b
##                  N               Mean Standard.Deviation 
##          20.000000           6.500000           2.438723 
## 
## $c
##                  N               Mean Standard.Deviation 
##          20.000000          11.900000           1.552587
map(list.1, my.summary, mean.only = TRUE)
## $a
##     N  Mean 
## 20.00  2.95 
## 
## $b
##    N Mean 
## 20.0  6.5 
## 
## $c
##    N Mean 
## 20.0 11.9

OK, now that we know how to write functions lets get back to the problem at hand. We want to read each file in the girl.file.names and boy.file.names vectors.

Exercise 3: Iteration, file IO, functions

We know how to read csv files using read_csv. We know how to iterate using map. All we need to do now is put the two things together.

  1. Use the map and read_csv functions to read all the girls data into an object named girls.data.
  2. Do the same thing for the boys data (name the object boys.data).
  3. Inspect the boys and girls data lists. How many elements do they have? What class are they?
  4. Write a function that returns the class, typeof, and length of its argument. map this function to girls.data and boys.data.

Exercise 3 prototype

## 1. read in the girls names data
girls.data <- map(girl.file.names, read_csv, na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## 2. read in the boys names data
boys.data <- map(boy.file.names, read_csv, na = "")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Count = col_double()
## )
## 3. inspect boys and girls names data
length(boys.data)
length(girls.data)
class