Welcome

Materials and setup

NOTE: skip this section if you are not running R locally (e.g., if you are running R in your browser using a remote Jupyter server)

You should have R installed –if not:

Materials for this workshop consists of notes and example code.

Start RStudio and open a new R script: - On Windows click the start button and search for rstudio. On Mac RStudio will be in your applications folder. - In Rstudio go to File -> Open File and open the Rintro.R file you downloaded earlier.

What is R?

R is a programming language designed for statistical computing. Notable characteristics include:

  • Vast capabilities, wide range of statistical and graphical techniques
  • Very popular in academia, growing popularity in business: http://r4stats.com/articles/popularity/
  • Written primarily by statisticians
  • FREE (no monetary cost and open source)
  • Excellent community support: mailing list, blogs, tutorials
  • Easy to extend by writing new functions

Whatever you’re trying to do, you’re probably not the first to try doing it R. Chances are good that someone has already written a package for that.

Graphical User Interfaces (GUIs)

There are many different ways you can interact with R. See for details.

For this workshop I encourage you to use RStudio; it is a good R-specific IDE that mostly just works.

Launch RStudio (skip if not using Rstudio)

Note: skip this section if you are not using Rstudio (e.g., if you are running these examples in a Jupyter notebook).

  • Start the RStudio program
  • In RStudio, go to File => New File =&gt R Script

The window in the upper-left is your R script. This is where you will write instructions for R to carry out.

The window in the lower-left is the R console. This is where results will be displayed.

Exercise 0

The purpose of this exercise is to give you an opportunity to explore the interface provided by RStudio (or whichever GUI you’ve decided to use). You may not know how to do these things; that’s fine! This is an opportunity to figure it out.

Also keep in mind that we are living in a golden age of tab completion. If you don’t know the name of an R function, try guessing the first two or three letters and pressing TAB. If you guessed correctly the function you are looking for should appear in a pop up!


  1. Try to get R to add 2 plus 2.
  1. Try to calculate the square root of 10.
  1. There is an R package named car. Try to install this package.
  1. R includes extensive documentation, including a file named “An introduction to R”. Try to find this help file.
  2. Open a new web browser or tab, go to http://cran.r-project.org/web/views/ and skim the topic closest to your field/interests.

Exercise 0 solution

## [1] 4
## [1] 4
## [1] 3.162278
## [1] 3.162278
## Installing package into '/home/izahn/R/x86_64-pc-linux-gnu-library/3.4'
## (as 'lib' is unspecified)

R basics

Function calls

The general form for calling R functions is

Arguments can be matched by name; unnamed arguments will be matched by position.

R packages

R packages can be installed from the Comprehensive R Archive Network (CRAN) using the install.packages function. Installing a package puts a copy of the package on your local computer, but does not make it available for use. To use an installed package you must attach it using the library function.

For this workshop we will use a suite of packages called “the tidyverse”. The tidyverse provides improved replacements for much of the basic data manipulation functionality in R. We can install and attach this package as follows:

Asking R for help

You can ask R for help using the help function, or the ? shortcut.

The help function can be used to look up the documentation for a function, or to look up the documentation to a package. We can learn how to use the stats package by reading its documentation like this:

Assignment

Values can be assigned names and used in subsequent operations

  • The <- operator (less than followed by a dash) is used to save values
  • The name on the left gets the value on the right.
## [1] 3.162278

Example project: baby names!

General goals

I would like to know what the most popular baby names are. In the course of answering this question we will learn to call R functions, install and load packages, assign values to names, read and write data, and more.

Data sets

The examples in this workshop use the baby names data provided by the governments of the United States and the United Kingdom. A cleaned and merged version of these data is available at http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv.

Getting data into R

Readers for common file types

In order to read data from a file, you have to know what kind of file it is. The table below lists the functions that can import data from common file formats.

data type function package
comma separated (.csv) read_csv() readr
other delimited formats read_delim() readr
R (.Rds) read_rds() readr
Stata (.dta) read_dta() haven
SPSS (.sav) read_spss() haven
SAS (.sas7bdat) read_sas() haven
Excel (.xls, .xlsx) read_excel() readxl

Exercise 2

The purpose of this exercise is to practice reading data into R.

  1. Install and attach the tidyverse package if you have not yet done so.

  2. Open the help page for the read_csv function. How can you limit the number of rows to be read in?

  1. Read just the first 10 rows of “"http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv”.
  1. Once you have successfully read in the first 10 rows, read the whole file, assigning the result to the name baby.names.

Exercise 2 solution

## Parsed with column specification:
## cols(
##   Location = col_character(),
##   Year = col_integer(),
##   Sex = col_character(),
##   Name = col_character(),
##   Count = col_double(),
##   Percent = col_double(),
##   Name.length = col_integer()
## )
## Parsed with column specification:
## cols(
##   Location = col_character(),
##   Year = col_integer(),
##   Sex = col_character(),
##   Name = col_character(),
##   Count = col_double(),
##   Percent = col_double(),
##   Name.length = col_integer()
## )

Checking imported data

It is always a good idea to examine the imported data set–usually we want the results to be a data.frame or a tibble.

## [1] "tbl_df"     "tbl"        "data.frame"

We can get more information about R objects using the glimpse function.

## Observations: 1,966,001
## Variables: 7
## $ Location    <chr> "England and Wales", "England and Wales", "England...
## $ Year        <int> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 19...
## $ Sex         <chr> "Female", "Female", "Female", "Female", "Female", ...
## $ Name        <chr> "sophie", "chloe", "jessica", "emily", "lauren", "...
## $ Count       <dbl> 7087, 6824, 6711, 6415, 6299, 5916, 5866, 5828, 52...
## $ Percent     <dbl> 2.3942729, 2.3054210, 2.2672450, 2.1672444, 2.1280...
## $ Name.length <int> 6, 5, 7, 5, 6, 6, 9, 7, 3, 5, 7, 5, 7, 4, 4, 7, 5,...

Data Manipulation

data.frame objects

Usually data read into R will be stored as a data.frame

  • A data.frame is a list of vectors of equal length
    • Each vector in the list forms a column
    • Each column can be a differnt type of vector
    • Typically columns are variables and the rows are observations

A data.frame has two dimensions corresponding the number of rows and the number of columns (in that order)

Slice and Filter data.frames rows

You can extract subsets of data.frames using slice to select rows by number and filter to select rows that match some condition. It works like this:

## # A tibble: 3 x 7
##   Location           Year Sex    Name    Count Percent Name.length
##   <chr>             <int> <chr>  <chr>   <dbl>   <dbl>       <int>
## 1 England and Wales  1996 Female sophie   7087    2.39           6
## 2 England and Wales  1996 Female chloe    6824    2.31           5
## 3 England and Wales  1996 Female jessica  6711    2.27           7
## # A tibble: 29 x 7
##    Location           Year Sex    Name  Count Percent Name.length
##    <chr>             <int> <chr>  <chr> <dbl>   <dbl>       <int>
##  1 England and Wales  1996 Female jill   8.00 0.00270           4
##  2 AZ                 1996 Female jill   5.00 0.0176            4
##  3 CA                 1996 Female jill  28.0  0.0122            4
##  4 CT                 1996 Female jill   8.00 0.0491            4
##  5 DE                 1996 Female jill   5.00 0.158             4
##  6 FL                 1996 Female jill  13.0  0.0179            4
##  7 GA                 1996 Female jill   7.00 0.0169            4
##  8 IA                 1996 Female jill  12.0  0.0840            4
##  9 IL                 1996 Female jill  20.0  0.0280            4
## 10 IN                 1996 Female jill  11.0  0.0334            4
## 11 KS                 1996 Female jill   5.00 0.0402            4
## 12 KY                 1996 Female jill   5.00 0.0255            4
## 13 MA                 1996 Female jill  15.0  0.0466            4
## 14 MD                 1996 Female jill   9.00 0.0364            4
## 15 MI                 1996 Female jill  14.0  0.0269            4
## 16 MN                 1996 Female jill  13.0  0.0523            4
## 17 MO                 1996 Female jill   9.00 0.0305            4
## 18 NJ                 1996 Female jill  10.0  0.0236            4
## 19 NY                 1996 Female jill  23.0  0.0221            4
## 20 OH                 1996 Female jill  13.0  0.0210            4
## 21 OR                 1996 Female jill   9.00 0.0541            4
## 22 PA                 1996 Female jill  12.0  0.0198            4
## 23 RI                 1996 Female jill   5.00 0.117             4
## 24 TN                 1996 Female jill   5.00 0.0169            4
## 25 TX                 1996 Female jill  20.0  0.0143            4
## 26 UT                 1996 Female jill   9.00 0.0591            4
## 27 VA                 1996 Female jill   9.00 0.0264            4
## 28 WA                 1996 Female jill   5.00 0.0175            4
## 29 WI                 1996 Female jill   7.00 0.0276            4
## # A tibble: 80 x 7
##    Location           Year Sex    Name     Count Percent Name.length
##    <chr>             <int> <chr>  <chr>    <dbl>   <dbl>       <int>
##  1 England and Wales  1996 Female jill      8.00 0.00270           4
##  2 England and Wales  1996 M      jack  10779    3.38              4
##  3 AK                 1996 M      jack     12.0  0.337             4
##  4 AL                 1996 M      jack     35.0  0.140             4
##  5 AR                 1996 M      jack     20.0  0.138             4
##  6 AZ                 1996 Female jill      5.00 0.0176            4
##  7 AZ                 1996 M      jack     59.0  0.176             4
##  8 CA                 1996 Female jill     28.0  0.0122            4
##  9 CA                 1996 M      jack    525    0.207             4
## 10 CO                 1996 M      jack     88.0  0.354             4
## 11 CT                 1996 Female jill      8.00 0.0491            4
## 12 CT                 1996 M      jack    100    0.521             4
## 13 DC                 1996 M      jack     17.0  0.340             4
## 14 DE                 1996 Female jill      5.00 0.158             4
## 15 DE                 1996 M      jack     13.0  0.339             4
## 16 FL                 1996 Female jill     13.0  0.0179            4
## 17 FL                 1996 M      jack    159    0.187             4
## 18 GA                 1996 Female jill      7.00 0.0169            4
## 19 GA                 1996 M      jack    105    0.217             4
## 20 HI                 1996 M      jack      8.00 0.140             4
## 21 IA                 1996 Female jill     12.0  0.0840            4
## 22 IA                 1996 M      jack     47.0  0.283             4
## 23 ID                 1996 M      jack     15.0  0.205             4
## 24 IL                 1996 Female jill     20.0  0.0280            4
## 25 IL                 1996 M      jack    323    0.399             4
## 26 IN                 1996 Female jill     11.0  0.0334            4
## 27 IN                 1996 M      jack     80.0  0.213             4
## 28 KS                 1996 Female jill      5.00 0.0402            4
## 29 KS                 1996 M      jack     50.0  0.333             4
## 30 KY                 1996 Female jill      5.00 0.0255            4
## 31 KY                 1996 M      jack     32.0  0.138             4
## 32 LA                 1996 M      jack     29.0  0.109             4
## 33 MA                 1996 Female jill     15.0  0.0466            4
## 34 MA                 1996 M      jack    165    0.444             4
## 35 MD                 1996 Female jill      9.00 0.0364            4
## 36 MD                 1996 M      jack     94.0  0.327             4
## 37 ME                 1996 M      jack     17.0  0.295             4
## 38 MI                 1996 Female jill     14.0  0.0269            4
## 39 MI                 1996 M      jack    179    0.301             4
## 40 MN                 1996 Female jill     13.0  0.0523            4
## 41 MN                 1996 M      jack    237    0.834             4
## 42 MO                 1996 Female jill      9.00 0.0305            4
## 43 MO                 1996 M      jack     78.0  0.226             4
## 44 MS                 1996 M      jack     12.0  0.0793            4
## 45 MT                 1996 M      jack     15.0  0.369             4
## 46 NC                 1996 M      jack     78.0  0.170             4
## 47 ND                 1996 M      jack      7.00 0.185             4
## 48 NE                 1996 M      jack     42.0  0.428             4
## 49 NH                 1996 M      jack     18.0  0.291             4
## 50 NJ                 1996 Female jill     10.0  0.0236            4
## 51 NJ                 1996 M      jack    149    0.306             4
## 52 NM                 1996 M      jack     14.0  0.130             4
## 53 NV                 1996 M      jack     22.0  0.211             4
## 54 NY                 1996 Female jill     23.0  0.0221            4
## 55 NY                 1996 M      jack    399    0.336             4
## 56 OH                 1996 Female jill     13.0  0.0210            4
## 57 OH                 1996 M      jack    169    0.241             4
## 58 OK                 1996 M      jack     32.0  0.168             4
## 59 OR                 1996 Female jill      9.00 0.0541            4
## 60 OR                 1996 M      jack     59.0  0.300             4
## 61 PA                 1996 Female jill     12.0  0.0198            4
## 62 PA                 1996 M      jack    114    0.167             4
## 63 RI                 1996 Female jill      5.00 0.117             4
## 64 RI                 1996 M      jack     21.0  0.379             4
## 65 SC                 1996 M      jack     23.0  0.117             4
## 66 SD                 1996 M      jack     12.0  0.288             4
## 67 TN                 1996 Female jill      5.00 0.0169            4
## 68 TN                 1996 M      jack     51.0  0.148             4
## 69 TX                 1996 Female jill     20.0  0.0143            4
## 70 TX                 1996 M      jack    168    0.108             4
## 71 UT                 1996 Female jill      9.00 0.0591            4
## 72 UT                 1996 M      jack     30.0  0.158             4
## 73 VA                 1996 Female jill      9.00 0.0264            4
## 74 VA                 1996 M      jack     87.0  0.223             4
## 75 VT                 1996 M      jack     13.0  0.510             4
## 76 WA                 1996 Female jill      5.00 0.0175            4
## 77 WA                 1996 M      jack     92.0  0.273             4
## 78 WI                 1996 Female jill      7.00 0.0276            4
## 79 WI                 1996 M      jack    118    0.404             4
## 80 WV                 1996 M      jack      8.00 0.0871            4

Select data.frame columns

slice and filter are used to extract rows. select is used to extract columns

## # A tibble: 10 x 2
##    Name      Count
##    <chr>     <dbl>
##  1 sophie     7087
##  2 chloe      6824
##  3 jessica    6711
##  4 emily      6415
##  5 lauren     6299
##  6 hannah     5916
##  7 charlotte  5866
##  8 rebecca    5828
##  9 amy        5206
## 10 megan      4948

You can also conveniently select a single column using $, like this:

## [1] 39.09729

In the previous example we used == to filter rows. Other relational and logical operators are listed below.

Operator Meaning
== equal to
!= not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
%in% contained in
& and
| or

Adding, removing, and modifying data.frame columns

You can modify data.frames using the mutate() function. It works like this:

## Observations: 1,966,001
## Variables: 8
## $ Location    <chr> "England and Wales", "England and Wales", "England...
## $ Year        <int> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 19...
## $ Sex         <chr> "Female", "Female", "Female", "Female", "Female", ...
## $ Name        <chr> "sophie", "chloe", "jessica", "emily", "lauren", "...
## $ Count       <dbl> 7087, 6824, 6711, 6415, 6299, 5916, 5866, 5828, 52...
## $ Percent     <dbl> 2.3942729, 2.3054210, 2.2672450, 2.1672444, 2.1280...
## $ Name.length <int> 6, 5, 7, 5, 6, 6, 9, 7, 3, 5, 7, 5, 7, 4, 4, 7, 5,...
## $ Thousands   <dbl> 7.087, 6.824, 6.711, 6.415, 6.299, 5.916, 5.866, 5...

Often one needs to replace values conditionally. The case_when function can be used for this purpose:

## Observations: 1,966,001
## Variables: 9
## $ Location    <chr> "England and Wales", "England and Wales", "England...
## $ Year        <int> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 19...
## $ Sex         <chr> "Female", "Female", "Female", "Female", "Female", ...
## $ Name        <chr> "sophie", "chloe", "jessica", "emily", "lauren", "...
## $ Count       <dbl> 7087, 6824, 6711, 6415, 6299, 5916, 5866, 5828, 52...
## $ Percent     <dbl> 2.3942729, 2.3054210, 2.2672450, 2.1672444, 2.1280...
## $ Name.length <int> 6, 5, 7, 5, 6, 6, 9, 7, 3, 5, 7, 5, 7, 4, 4, 7, 5,...
## $ Thousands   <dbl> 7.087, 6.824, 6.711, 6.415, 6.299, 5.916, 5.866, 5...
## $ Country     <chr> "UK", "UK", "UK", "UK", "UK", "UK", "UK", "UK", "U...

Exporting Data

Now that we have made some changes to our data set, we might want to save those changes to a file.

Saving and loading R workspaces

In addition to importing individual datasets, R can save and load entire workspaces

## [1] "baby.names"              "births.by.year"         
## [3] "bn"                      "foo"                    
## [5] "ma.baby.names"           "name.length.by.location"
## [7] "popular.girl.names"      "x"
## character(0)

Load the “myWorkspace.RData” file and check that it is restored

## [1] "baby.names"              "births.by.year"         
## [3] "bn"                      "foo"                    
## [5] "ma.baby.names"           "name.length.by.location"
## [7] "popular.girl.names"      "x"

Exercise 3: Data manipulation

Read in the “babyNames.csv” file if you have not already done so, assigning the result to baby.names. The file is located at “http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv

  1. Create a column named “Proportion” equal to Percent divided by 100.
  1. Filter baby.names to include only names given to at least 3 percent of Girls. Save this to a SPSS data set named “popularGirlNames.sav”)
  1. (Bonus) if you look at unique(baby.names$Sex) you’ll notice that some records indicate Male with "M", while other records use "Male". Correct this by replacing "M" with "Male".

Exercise 3 solution

Basic Statistics and Graphs

Basic statistics

Descriptive statistics of single variables are straightforward:

## [1] 5.978752
## [1] 6
## [1] 1.430346
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   5.979   7.000  18.000
##    Location              Year          Sex                Name          
##  Length:1966001     Min.   :1996   Length:1966001     Length:1966001    
##  Class :character   1st Qu.:2001   Class :character   Class :character  
##  Mode  :character   Median :2006   Mode  :character   Mode  :character  
##                     Mean   :2006                                        
##                     3rd Qu.:2011                                        
##                     Max.   :2015                                        
##      Count            Percent          Name.length       Thousands      
##  Min.   :    3.0   Min.   :0.000861   Min.   : 1.000   Min.   : 0.0030  
##  1st Qu.:    6.0   1st Qu.:0.010574   1st Qu.: 5.000   1st Qu.: 0.0060  
##  Median :   11.0   Median :0.029709   Median : 6.000   Median : 0.0110  
##  Mean   :   39.1   Mean   :0.105799   Mean   : 5.979   Mean   : 0.0391  
##  3rd Qu.:   25.0   3rd Qu.:0.090410   3rd Qu.: 7.000   3rd Qu.: 0.0250  
##  Max.   :10779.0   Max.   :4.519774   Max.   :18.000   Max.   :10.7790  
##    Country            Proportion       
##  Length:1966001     Min.   :8.610e-06  
##  Class :character   1st Qu.:1.057e-04  
##  Mode  :character   Median :2.971e-04  
##                     Mean   :1.058e-03  
##                     3rd Qu.:9.041e-04  
##                     Max.   :4.520e-02

Some of these functions (e.g., summary) will also work with data.frames and other types of objects, others (such as sd) will not.

Statistics by grouping variable(s)

The summarize function can be used to calculate statistics by grouping variable. Here is how it works.

## # A tibble: 2 x 3
##   Country `mean(Name.length)` `sd(Name.length)`
##   <chr>                 <dbl>             <dbl>
## 1 UK                     6.15              1.82
## 2 US                     5.96              1.37

You can group by multiple variables:

## # A tibble: 4 x 4
## # Groups:   Country [?]
##   Country Sex    `mean(Name.length)` `sd(Name.length)`
##   <chr>   <chr>                <dbl>             <dbl>
## 1 UK      Female                6.35              1.88
## 2 UK      Male                  5.90              1.71
## 3 US      Female                6.08              1.37
## 4 US      Male                  5.81              1.35

Save R output to a file

Earlier we learned how to write a data set to a file. But what if we want to write something that isn’t in a nice rectangular format, like the output of summary? For that we can use the sink() function:

## [1] "This is the summary of baby.names \n"
##    Location              Year          Sex                Name          
##  Length:1966001     Min.   :1996   Length:1966001     Length:1966001    
##  Class :character   1st Qu.:2001   Class :character   Class :character  
##  Mode  :character   Median :2006   Mode  :character   Mode  :character  
##                     Mean   :2006                                        
##                     3rd Qu.:2011                                        
##                     Max.   :2015                                        
##      Count            Percent          Name.length       Thousands      
##  Min.   :    3.0   Min.   :0.000861   Min.   : 1.000   Min.   : 0.0030  
##  1st Qu.:    6.0   1st Qu.:0.010574   1st Qu.: 5.000   1st Qu.: 0.0060  
##  Median :   11.0   Median :0.029709   Median : 6.000   Median : 0.0110  
##  Mean   :   39.1   Mean   :0.105799   Mean   : 5.979   Mean   : 0.0391  
##  3rd Qu.:   25.0   3rd Qu.:0.090410   3rd Qu.: 7.000   3rd Qu.: 0.0250  
##  Max.   :10779.0   Max.   :4.519774   Max.   :18.000   Max.   :10.7790  
##    Country            Proportion       
##  Length:1966001     Min.   :8.610e-06  
##  Class :character   1st Qu.:1.057e-04  
##  Mode  :character   Median :2.971e-04  
##                     Mean   :1.058e-03  
##                     3rd Qu.:9.041e-04  
##                     Max.   :4.520e-02

Exercise 4

  1. Calculate the total number of children born.
  1. Filter the data to extract only Massachusetts (Location “MA”), and calculate the total number of children born in Massachusetts.
  1. Group and summarize the data to calculate the number of children born each year. Assign the result to the name births.by.year.
  1. Calculate the average number of characters in baby names (using the “Name.length” column).
  1. Group and summarize to calculate the average number of characters in baby names for each location. Assign the result to the name name.length.by.location.

Exercise 4 solution

## [1] 76865321
## [1] 1232841
## [1] 5.978752
## # A tibble: 52 x 2
##    Location          `mean(Name.length)`
##    <chr>                           <dbl>
##  1 AK                               5.90
##  2 AL                               6.04
##  3 AR                               5.95
##  4 AZ                               6.01
##  5 CA                               6.00
##  6 CO                               5.93
##  7 CT                               5.94
##  8 DC                               5.86
##  9 DE                               5.95
## 10 England and Wales                6.15
## 11 FL                               6.03
## 12 GA                               6.03
## 13 HI                               5.76
## 14 IA                               5.87
## 15 ID                               5.90
## 16 IL                               5.98
## 17 IN                               5.94
## 18 KS                               5.93
## 19 KY                               5.97
## 20 LA                               5.93
## 21 MA                               5.91
## 22 MD                               5.89
## 23 ME                               5.93
## 24 MI                               5.94
## 25 MN                               5.84
## 26 MO                               5.92
## 27 MS                               6.07
## 28 MT                               5.82
## 29 NC                               5.99
## 30 ND                               5.79
## 31 NE                               5.91
## 32 NH                               5.95
## 33 NJ                               5.91
## 34 NM                               6.03
## 35 NV                               5.97
## 36 NY                               5.92
## 37 OH                               5.95
## 38 OK                               5.93
## 39 OR                               5.91
## 40 PA                               5.90
## 41 RI                               5.95
## 42 SC                               6.02
## 43 SD                               5.82
## 44 TN                               6.00
## 45 TX                               6.05
## 46 UT                               5.87
## 47 VA                               5.94
## 48 VT                               5.87
## 49 WA                               5.90
## 50 WI                               5.92
## 51 WV                               5.97
## 52 WY                               5.89

Wrap-up

Help us make this workshop better!

Additional resources