Welcome

Materials and setup

NOTE: skip this section if you are not running R locally (e.g., if you are running R in your browser using a remote Jupyter server)

You should have R installed –if not:

Notes and examples for this workshop are available at

Start RStudio and open a new R script: - On Windows click the start button and search for rstudio. On Mac RStudio will be in your applications folder. - In Rstudio go to File -> New File -> R Script

What is R?

R is a programming language designed for statistical computing. Notable characteristics include:

  • Vast capabilities, wide range of statistical and graphical techniques
  • Very popular in academia, growing popularity in business: http://r4stats.com/articles/popularity/
  • Written primarily by statisticians
  • FREE (no monetary cost and open source)
  • Excellent community support: mailing list, blogs, tutorials
  • Easy to extend by writing new functions

Whatever you’re trying to do, you’re probably not the first to try doing it R. Chances are good that someone has already written a package for that.

Graphical User Interfaces (GUIs)

There are many different ways you can interact with R. See the Data Science Tools workshop notes for details.

For this workshop I encourage you to use RStudio; it is a good R-specific IDE that mostly just works.

Launch RStudio (skip if not using Rstudio)

Note: skip this section if you are not using Rstudio (e.g., if you are running these examples in a Jupyter notebook).

  • Start the RStudio program
  • In RStudio, go to File => New File =&gt R Script

The window in the upper-left is your R script. This is where you will write instructions for R to carry out.

The window in the lower-left is the R console. This is where results will be displayed.

Exercise 0

The purpose of this exercise is to give you an opportunity to explore the interface provided by RStudio (or whichever GUI you’ve decided to use). You may not know how to do these things; that’s fine! This is an opportunity to figure it out.

Also keep in mind that we are living in a golden age of tab completion. If you don’t know the name of an R function, try guessing the first two or three letters and pressing TAB. If you guessed correctly the function you are looking for should appear in a pop up!


  1. Try to get R to add 2 plus 2.

  2. Try to calculate the square root of 10.

  3. R includes extensive documentation, including a file named “An introduction to R”. Try to find this help file.

R basics

Function calls

The general form for calling R functions is

Arguments can be matched by name; unnamed arguments will be matched by position.

Assignment

Values can be assigned names and used in subsequent operations

  • The <- operator (less than followed by a dash) is used to save values
  • The name on the left gets the value on the right.
## [1] 3.162278

Asking R for help

You can ask R for help using the help function, or the ? shortcut.

The help function can be used to look up the documentation for a function, or to look up the documentation to a package. We can learn how to use the stats package by reading its documentation like this:

Example project: baby names!

General goals

I would like to know what the most popular baby names are. In the course of answering this question we will learn to call R functions, install and load packages, assign values to names, read and write data, and more.

Data sets

The examples in this workshop use the baby names data provided by the governments of the United States and the United Kingdom. A cleaned and merged version of these data is available at http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv.

Getting data into R

R has data reading functionality built-in – see e.g., help(read.table). However, faster and more robust tools are available, and so to make things easier on ourselves we will use a contributed package called readr instead. This requires that we learn a little bit about packages in R.

Installing and using R packages

A large number of contributed packages are available. If you are looking for a package for a specific task, https://cran.r-project.org/web/views/ and https://r-pkg.org are good places to start.

You can install a package in R using the install.packages() function. Once a package is installed you may use the library function to attach it so that it can be used.

Readers for common file types

In order to read data from a file, you have to know what kind of file it is. The table below lists functions that can import data from common plain-text formats.

Data Type Function
comma separated read_csv()
tab separated read_delim()
other delimited formats read_table()
fixed width read_fwf()

Note You may be confused by the existence of similar functions, e.g., read.csv and read.delim. These are legacy functions that tend to be slower and less robust than the readr functions. One way to tell them apart is that the faster more robust versions use underscores in their names (e.g., read_csv) while the older functions us dots (e.g., read.csv). My advice is to use the more robust newer versions, i.e., the ones with underscores.

Exercise 2

The purpose of this exercise is to practice reading data into R.

  1. Open the help page for the read_csv function. How can you limit the number of rows to be read in?

  2. Read just the first 10 rows of “"http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv”.

  3. Once you have successfully read in the first 10 rows, read the whole file, assigning the result to the name baby.names.

Exercise 2 solution

## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   Sex = col_character(),
##   Name = col_character(),
##   Count = col_integer()
## )
## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   Sex = col_character(),
##   Name = col_character(),
##   Count = col_integer()
## )

Data Manipulation

data.frame objects

Usually data read into R will be stored as a data.frame

  • A data.frame is a list of vectors of equal length
    • Each vector in the list forms a column
    • Each column can be a differnt type of vector
    • Typically columns are variables and the rows are observations

A data.frame has two dimensions corresponding the number of rows and the number of columns (in that order)

Tools for manipulating data.frame objects

R has decent data manipulation tools built-in – see e.g., help(Extract). However, these tools are powerful and complex and often overwhelm beginners. To make things easier on ourselves we will use a contributed package called dplyr instead.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Checking imported data

It is always a good idea to examine the imported data set–usually we want the results to be a data.frame

## [1] "tbl_df"     "tbl"        "data.frame"

We can get more information about R objects using the glimpse function.

## Observations: 197,106
## Variables: 4
## $ Year  <int> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 19...
## $ Sex   <chr> "Female", "Female", "Female", "Female", "Female", "Femal...
## $ Name  <chr> "aaliyah", "aarin", "aaron", "abagail", "abbey", "abbi",...
## $ Count <int> 802, 5, 26, 87, 510, 5, 311, 235, 17, 1402, 8, 5, 5, 6, ...

Filter data.frame rows

You can extract subsets of data.frames using filter to select rows meeting some condition.

## # A tibble: 19 x 4
##     Year Sex    Name  Count
##    <int> <chr>  <chr> <int>
##  1  1996 Female jill    306
##  2  1997 Female jill    254
##  3  1998 Female jill    206
##  4  1999 Female jill    169
##  5  2000 Female jill    168
##  6  2001 Female jill    130
##  7  2002 Female jill     85
##  8  2003 Female jill     83
##  9  2004 Female jill     53
## 10  2005 Female jill     61
## 11  2006 Female jill    108
## 12  2007 Female jill     42
## 13  2008 Female jill     25
## 14  2009 Female jill     34
## 15  2010 Female jill     31
## 16  2011 Female jill      5
## 17  2013 Female jill     13
## 18  2014 Female jill     18
## 19  2015 Female jill      7
## # A tibble: 2 x 4
##    Year Sex    Name  Count
##   <int> <chr>  <chr> <int>
## 1  1996 Female jill    306
## 2  1996 M      jack   4240

In the previous example we used == to filter rows. Other relational and logical operators are listed below.

Operator Meaning
== equal to
!= not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
%in% contained in

These operators may be combined with & (and) or | (or).

Exercise 2: Data Extraction

Read in the “babyNames.csv” file if you have not already done so, assigning the result to baby.names. The file is located at “http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv

  1. Extract data for the name “ashley”.

  2. Restrict the previous extraction to include only years between 2000 and 2004.

Exercise 2 solution

## # A tibble: 39 x 4
##     Year Sex    Name   Count
##    <int> <chr>  <chr>  <int>
##  1  1996 Female ashley 23676
##  2  1996 M      ashley    64
##  3  1997 Female ashley 20895
##  4  1997 M      ashley    63
##  5  1998 Female ashley 19871
##  6  1998 M      ashley    27
##  7  1999 Female ashley 18132
##  8  1999 M      ashley    28
##  9  2000 Female ashley 17997
## 10  2000 M      ashley    44
## # ... with 29 more rows
## # A tibble: 10 x 4
##     Year Sex    Name   Count
##    <int> <chr>  <chr>  <int>
##  1  2000 Female ashley 17997
##  2  2000 M      ashley    44
##  3  2001 Female ashley 16524
##  4  2001 M      ashley    33
##  5  2002 Female ashley 15339
##  6  2002 M      ashley    18
##  7  2003 Female ashley 14512
##  8  2003 M      ashley    31
##  9  2004 Female ashley 14370
## 10  2004 M      ashley    57

Adding, removing, and modifying data.frame columns

You can modify data.frames using mutate function. It works like this:

## # A tibble: 197,106 x 5
##     Year Sex    Name     Count Thousands
##    <int> <chr>  <chr>    <int>     <dbl>
##  1  1996 Female aaliyah    802     0.802
##  2  1996 Female aarin        5     0.005
##  3  1996 Female aaron       26     0.026
##  4  1996 Female abagail     87     0.087
##  5  1996 Female abbey      510     0.51 
##  6  1996 Female abbi         5     0.005
##  7  1996 Female abbie      311     0.311
##  8  1996 Female abbigail   235     0.235
##  9  1996 Female abbigale    17     0.017
## 10  1996 Female abby      1402     1.40 
## # ... with 197,096 more rows

Often one needs to replace values conditionally, as in the following example:

## # A tibble: 6 x 6
##    Year Sex    Name    Count Thousands Decade
##   <int> <chr>  <chr>   <int>     <dbl> <chr> 
## 1  1996 Female aaliyah   802     0.802 1990's
## 2  1996 Female aarin       5     0.005 1990's
## 3  1996 Female aaron      26     0.026 1990's
## 4  1996 Female abagail    87     0.087 1990's
## 5  1996 Female abbey     510     0.51  1990's
## 6  1996 Female abbi        5     0.005 1990's
## # A tibble: 6 x 6
##    Year Sex   Name    Count Thousands Decade
##   <int> <chr> <chr>   <int>     <dbl> <chr> 
## 1  2015 Male  william    19     0.019 2010's
## 2  2015 Male  wyatt      29     0.029 2010's
## 3  2015 Male  xavier     11     0.011 2010's
## 4  2015 Male  zander      6     0.006 2010's
## 5  2015 Male  zane        6     0.006 2010's
## 6  2015 Male  zayden      6     0.006 2010's

Exercise 3: Data manipulation

Read in the “babyNames.csv” file if you have not already done so, assigning the result to baby.names. The file is located at “http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv

  1. Ifyou look at unique(baby.names$Sex) you’ll notice that some records indicate Male with "M", while other records use "Male". Correct this by replacing "M" with "Male".

  2. Create a column named “Popular” containing a 1 in rows where Count is greater than 30000 and a 0 otherwise

  3. Filter the baby names data to display only the popular names.

Exercise 3 solution

## # A tibble: 17 x 7
##     Year Sex   Name        Count Thousands Decade Popular
##    <int> <chr> <chr>       <int>     <dbl> <chr>  <lgl>  
##  1  1996 Male  christopher 30870      30.9 1990's TRUE   
##  2  1996 Male  jacob       31864      31.9 1990's TRUE   
##  3  1996 Male  matthew     32031      32.0 1990's TRUE   
##  4  1996 Male  michael     38322      38.3 1990's TRUE   
##  5  1997 Male  jacob       34090      34.1 1990's TRUE   
##  6  1997 Male  matthew     31480      31.5 1990's TRUE   
##  7  1997 Male  michael     37505      37.5 1990's TRUE   
##  8  1998 Male  jacob       35958      36.0 1990's TRUE   
##  9  1998 Male  matthew     31091      31.1 1990's TRUE   
## 10  1998 Male  michael     36569      36.6 1990's TRUE   
## 11  1999 Male  jacob       35306      35.3 1990's TRUE   
## 12  1999 Male  matthew     30388      30.4 1990's TRUE   
## 13  1999 Male  michael     33854      33.9 1990's TRUE   
## 14  2000 Male  jacob       34418      34.4 2000's TRUE   
## 15  2000 Male  michael     31992      32.0 2000's TRUE   
## 16  2001 Male  jacob       32487      32.5 2000's TRUE   
## 17  2002 Male  jacob       30509      30.5 2000's TRUE

Grouping and Aggregation

So far we’ve seen than “Jacob”, “Matthew”, and “Michael” tend to be popular names. That isn’t very satisfying, because it leaves us wanting to know which girls names are popular, and perhaps how popularity has changed over time. To answer these questions we need to operate on groups within the data rather than on the whole data structure at once. The dplyr package makes this easy to do using the group_by function.

## # A tibble: 6 x 8
## # Groups:   Year, Sex [1]
##    Year Sex    Name    Count Thousands Decade Popular max_count
##   <int> <chr>  <chr>   <int>     <dbl> <chr>  <lgl>       <dbl>
## 1  1996 Female aaliyah   802     0.802 1990's FALSE       25150
## 2  1996 Female aarin       5     0.005 1990's FALSE       25150
## 3  1996 Female aaron      26     0.026 1990's FALSE       25150
## 4  1996 Female abagail    87     0.087 1990's FALSE       25150
## 5  1996 Female abbey     510     0.51  1990's FALSE       25150
## 6  1996 Female abbi        5     0.005 1990's FALSE       25150
## # A tibble: 6 x 8
## # Groups:   Year, Sex [1]
##    Year Sex   Name    Count Thousands Decade Popular max_count
##   <int> <chr> <chr>   <int>     <dbl> <chr>  <lgl>       <dbl>
## 1  2015 Male  william    19     0.019 2010's FALSE       19485
## 2  2015 Male  wyatt      29     0.029 2010's FALSE       19485
## 3  2015 Male  xavier     11     0.011 2010's FALSE       19485
## 4  2015 Male  zander      6     0.006 2010's FALSE       19485
## 5  2015 Male  zane        6     0.006 2010's FALSE       19485
## 6  2015 Male  zayden      6     0.006 2010's FALSE       19485
## # A tibble: 40 x 8
## # Groups:   Year, Sex [40]
##     Year Sex    Name    Count Thousands Decade Popular max_count
##    <int> <chr>  <chr>   <int>     <dbl> <chr>  <lgl>       <dbl>
##  1  1996 Female emily   25150      25.2 1990's FALSE       25150
##  2  1996 Male   michael 38322      38.3 1990's TRUE        38322
##  3  1997 Female emily   25731      25.7 1990's FALSE       25731
##  4  1997 Male   michael 37505      37.5 1990's TRUE        37505
##  5  1998 Female emily   26179      26.2 1990's FALSE       26179
##  6  1998 Male   michael 36569      36.6 1990's TRUE        36569
##  7  1999 Female emily   26537      26.5 1990's FALSE       26537
##  8  1999 Male   jacob   35306      35.3 1990's TRUE        35306
##  9  2000 Female emily   25953      26.0 2000's FALSE       25953
## 10  2000 Male   jacob   34418      34.4 2000's TRUE        34418
## # ... with 30 more rows

Note that the data remains grouped until you change the groups by running group_by again or remove grouping information with ungroup.

Grouping can be useful when modifying a data.frame with mutate or extracting subsets with filter, but it really shines when combined with summarize. For example, suppose that we want the most popular girl and boy names for each decade. In this case we need to summarize the by summing the Count column for each Sex X Decade group.

## # A tibble: 6 x 5
## # Groups:   Decade, Sex [6]
##   Decade Sex    Name     Count Thousands
##   <chr>  <chr>  <chr>    <int>     <dbl>
## 1 1990's Female emily   103597      104.
## 2 1990's Male   michael 146430      146.
## 3 2000's Female emily   223612      224.
## 4 2000's Male   jacob   273690      274.
## 5 2010's Female sophia  121787      122.
## 6 2010's Male   jacob   112227      112.

In the previous example we used sum and max, two examples of basic statistics functions in R. Other basic statistics functions include: - mean - median - sd - var - min - quantile - length

Exporting Data

Now that we have made some changes to our data set, we might want to save those changes to a file.

Saving and loading R workspaces

In addition to importing individual datasets, R can save and load entire workspaces

## [1] "baby.names"   "bn.by.decade" "fname"        "oname"       
## [5] "x"
## character(0)
## [1] "baby.names"   "bn.by.decade" "fname"        "oname"       
## [5] "x"

Exercise 4

  1. Calculate the total number of children born.

  2. Filter the data to extract data from 2004 and calculate the total number of children born in that year.

  3. Calculate the number of boys and girls born each year. Assign the result to the name births.by.year.

Exercise 4 solution

## # A tibble: 1 x 1
##      Total
##      <int>
## 1 64519299
## # A tibble: 1 x 1
##     Total
##     <int>
## 1 3294392
## # A tibble: 40 x 3
## # Groups:   Year [?]
##     Year Sex      Count
##    <int> <chr>    <int>
##  1  1996 Female 1497121
##  2  1996 Male   1728057
##  3  1997 Female 1478802
##  4  1997 Male   1714552
##  5  1998 Female 1496488
##  6  1998 Male   1733194
##  7  1999 Female 1497247
##  8  1999 Male   1736147
##  9  2000 Female 1527247
## 10  2000 Male   1769871
## # ... with 30 more rows

Basic graphs

R has decent plotting tools built-in – see e.g., help(plot). However, To make things easier on ourselves we will use a contributed package called ggplot2 instead.

First, we’ll plot the number of boys and girls born each year.

Next, we’ll filter out the most popular girls names and plot their popularity over time.

Exercise 5

  1. Add a new coloumn to the baby.names data equal to the proportion of boys and girls born each year with each name. That is, calculate Proportion = Count/sum(Count) for each Year X Sex group.

  2. Filter the baby.names data, retaining only the most popular girl and boy names for each year.

  3. Plot proportion over time to see changes in the proportion of parents choosing the most popular name of the year.