Welcome

Materials and setup

NOTE: skip this section if you are not running R locally (e.g., if you are running R in your browser using a remote Jupyter server)

You should have R installed –if not:

Materials for this workshop consists of notes and example code.

Start RStudio and open a new R script: - On Windows click the start button and search for rstudio. On Mac RStudio will be in your applications folder. - In Rstudio go to File -> Open File and open the Rintro.R file you downloaded earlier.

What is R?

R is a programming language designed for statistical computing. Notable characteristics include:

  • Vast capabilities, wide range of statistical and graphical techniques
  • Very popular in academia, growing popularity in business: http://r4stats.com/articles/popularity/
  • Written primarily by statisticians
  • FREE (no monetary cost and open source)
  • Excellent community support: mailing list, blogs, tutorials
  • Easy to extend by writing new functions

Whatever you’re trying to do, you’re probably not the first to try doing it R. Chances are good that someone has already written a package for that.

Graphical User Interfaces (GUIs)

There are many different ways you can interact with R. See for details.

For this workshop I encourage you to use RStudio; it is a good R-specific IDE that mostly just works.

Launch RStudio (skip if not using Rstudio)

Note: skip this section if you are not using Rstudio (e.g., if you are running these examples in a Jupyter notebook).

  • Start the RStudio program
  • In RStudio, go to File => New File =&gt R Script

The window in the upper-left is your R script. This is where you will write instructions for R to carry out.

The window in the lower-left is the R console. This is where results will be displayed.

Exercise 0

The purpose of this exercise is to give you an opportunity to explore the interface provided by RStudio (or whichever GUI you’ve decided to use). You may not know how to do these things; that’s fine! This is an opportunity to figure it out.

Also keep in mind that we are living in a golden age of tab completion. If you don’t know the name of an R function, try guessing the first two or three letters and pressing TAB. If you guessed correctly the function you are looking for should appear in a pop up!


  1. Try to get R to add 2 plus 2.
## write your answer here
  1. Try to calculate the square root of 10.
## write your answer here
  1. There is an R package named car. Try to install this package.
## write your answer here
  1. R includes extensive documentation, including a file named “An introduction to R”. Try to find this help file.
  2. Open a new web browser or tab, go to http://cran.r-project.org/web/views/ and skim the topic closest to your field/interests.

Exercise 0 solution

## 1. 2 plus 2
2 + 2
## [1] 4
## or
sum(2, 2)
## [1] 4
## 2. square root of 10:
sqrt(10)
## [1] 3.162278
## or
10^(1/2)
## [1] 3.162278
## 3. Install the "car" package:

## In Rstudio, go to the "Packages" tab and click the "Istall" button.
## Search in the pop-up window and click "Install".

## Alternatively, use the `install.packages` function like this:
install.packages("car")
## Installing package into '/home/izahn/R/x86_64-pc-linux-gnu-library/3.4'
## (as 'lib' is unspecified)
## --- Please select a CRAN mirror for use in this session ---
## 4. Find "An Introduction to R".

## Go to the main help page by running 'help.start() or using the GUI
## menu, find and click on the link to "An Introduction to R".

## 5. Go to <http://cran.r-project.org/web/views/> and skim the topic
##     closest to your field/interests.

## I like the machine learning topic.

R basics

Function calls

The general form for calling R functions is

## FunctionName(arg.1 = value.1, arg.2 = value.2, ..., arg.n - value.n)

Arguments can be matched by name; unnamed arguments will be matched by position.

R packages

R packages can be installed from the Comprehensive R Archive Network (CRAN) using the install.packages function. Installing a package puts a copy of the package on your local computer, but does not make it available for use. To use an installed package you must attach it using the library function.

For this workshop we will use a suite of packages called “the tidyverse”. The tidyverse provides improved replacements for much of the basic data manipulation functionality in R. We can install and attach this package as follows:

install.packages("tidyverse")
library("tidyverse")

Asking R for help

You can ask R for help using the help function, or the ? shortcut.

help(help)

The help function can be used to look up the documentation for a function, or to look up the documentation to a package. We can learn how to use the stats package by reading its documentation like this:

help(package = "stats")

Assignment

Values can be assigned names and used in subsequent operations

  • The <- operator (less than followed by a dash) is used to save values
  • The name on the left gets the value on the right.
sqrt(10) ## calculate square root of 10; result is not stored anywhere
## [1] 3.162278
x <- sqrt(10) # assign result to a variable named x

Example project: baby names!

General goals

I would like to know what the most popular baby names are. In the course of answering this question we will learn to call R functions, install and load packages, assign values to names, read and write data, and more.

Data sets

The examples in this workshop use the baby names data provided by the governments of the United States and the United Kingdom. A cleaned and merged version of these data is in dataSets/babyNames.csv.

Getting data into R

The “working directory” and listing files

R knows the directory it was started in, and refers to this as the “working directory”. Since our workshop examples are in the Rintro folder, we should all take a moment to set that as our working directory.

getwd() # what is my current working directory?
# setwd("~/Desktop/Rintro") # change directory

Note that “~” means “my home directory” but that this can mean different things on different operating systems. You can also use the Files tab in Rstudio to navigate to a directory, then click “More -> Set as working directory”.


It can be convenient to list files in a directory without leaving R

list.files("dataSets") # list files in the dataSets folder
## [1] "babyNames.csv"     "example.csv"       "example.dta"      
## [4] "example.rds"       "myWorkspace.RData"

Readers for common file types

In order to read data from a file, you have to know what kind of file it is. The table below lists the functions that can import data from common file formats.

data type function package
comma separated (.csv) read_csv() readr
other delimited formats read_delim() readr
R (.Rds) read_rds() readr
Stata (.dta) read_dta() haven
SPSS (.sav) read_spss() haven
SAS (.sas7bdat) read_sas() haven
Excel (.xls, .xlsx) read_excel() readxl

Exercise 2

The purpose of this exercise is to practice reading data into R. The data in “dataSets/babyNames.csv” is moderately tricky to read, making it a good data set to practice on.

  1. Install and attach the tidyverse package if you have not yet done so.

  2. Open the help page for the read_csv function. How can you limit the number of rows to be read in?

## write your answer here
  1. Read just the first 10 rows of “dataSets/babyNames.csv”.
## write your answer here
  1. Once you have successfully read in the first 10 rows, read the whole file, assigning the result to the name baby.names.
## write your answer here

Exercise 2 solution

## read ?read_csv

## limit rows with n_max argument
read_csv("http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv", n_max = 10)
## Parsed with column specification:
## cols(
##   Location = col_character(),
##   Year = col_integer(),
##   Sex = col_logical(),
##   Name = col_character(),
##   Count = col_double(),
##   Percent = col_double(),
##   Name.length = col_integer()
## )
## read all the data
baby.names <- read_csv("http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv", col_types = "??c????")

Checking imported data

It is always a good idea to examine the imported data set–usually we want the results to be a data.frame or a tibble.

class(baby.names) # check to see that it os a data.frame
## [1] "tbl_df"     "tbl"        "data.frame"

We can get more information about R objects using the glimpse function.

glimpse(baby.names) # details
## Observations: 1,966,001
## Variables: 7
## $ Location    <chr> "England and Wales", "England and Wales", "England...
## $ Year        <int> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 19...
## $ Sex         <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", ...
## $ Name        <chr> "sophie", "chloe", "jessica", "emily", "lauren", "...
## $ Count       <dbl> 7087, 6824, 6711, 6415, 6299, 5916, 5866, 5828, 52...
## $ Percent     <dbl> 2.3942729, 2.3054210, 2.2672450, 2.1672444, 2.1280...
## $ Name.length <int> 6, 5, 7, 5, 6, 6, 9, 7, 3, 5, 7, 5, 7, 4, 4, 7, 5,...

Data Manipulation

data.frame objects

Usually data read into R will be stored as a data.frame

  • A data.frame is a list of vectors of equal length
    • Each vector in the list forms a column
    • Each column can be a differnt type of vector
    • Typically columns are variables and the rows are observations

A data.frame has two dimensions corresponding the number of rows and the number of columns (in that order)

Slice and Filter data.frames rows

You can extract subsets of data.frames using slice to select rows by number and filter to select rows that match some condition. It works like this:

## make up some example data
(example.df <- data.frame(id  = rep(letters[1:4], each = 4),
                          t   = rep(1:4, times = 4),
                          var1 = runif(16),
                          var2 = sample(letters[1:3], 16, replace = TRUE)))
##    id t        var1 var2
## 1   a 1 0.954657345    a
## 2   a 2 0.241159034    b
## 3   a 3 0.657320597    b
## 4   a 4 0.934645616    a
## 5   b 1 0.452993804    c
## 6   b 2 0.214453464    b
## 7   b 3 0.256636372    a
## 8   b 4 0.196116040    a
## 9   c 1 0.988018868    a
## 10  c 2 0.007759574    c
## 11  c 3 0.465041328    b
## 12  c 4 0.095405366    c
## 13  d 1 0.270309668    c
## 14  d 2 0.263918635    a
## 15  d 3 0.449754772    a
## 16  d 4 0.072133932    c
## rows 2 and 4
slice(example.df, c(2, 4))
## # A tibble: 2 x 4
##       id     t      var1   var2
##   <fctr> <int>     <dbl> <fctr>
## 1      a     2 0.2411590      b
## 2      a     4 0.9346456      a
## rows where id == "a"
filter(example.df, id == "a")
##   id t      var1 var2
## 1  a 1 0.9546573    a
## 2  a 2 0.2411590    b
## 3  a 3 0.6573206    b
## 4  a 4 0.9346456    a
## rows where id is either "a" or "b"
filter(example.df, id %in% c("a", "b"))
##   id t      var1 var2
## 1  a 1 0.9546573    a
## 2  a 2 0.2411590    b
## 3  a 3 0.6573206    b
## 4  a 4 0.9346456    a
## 5  b 1 0.4529938    c
## 6  b 2 0.2144535    b
## 7  b 3 0.2566364    a
## 8  b 4 0.1961160    a

Select data.frame columns

slice and filter are used to extract rows. select is used to extract columns

select(example.df, id, var1)
##    id        var1
## 1   a 0.954657345
## 2   a 0.241159034
## 3   a 0.657320597
## 4   a 0.934645616
## 5   b 0.452993804
## 6   b 0.214453464
## 7   b 0.256636372
## 8   b 0.196116040
## 9   c 0.988018868
## 10  c 0.007759574
## 11  c 0.465041328
## 12  c 0.095405366
## 13  d 0.270309668
## 14  d 0.263918635
## 15  d 0.449754772
## 16  d 0.072133932
select(example.df, id, t, var1)
##    id t        var1
## 1   a 1 0.954657345
## 2   a 2 0.241159034
## 3   a 3 0.657320597
## 4   a 4 0.934645616
## 5   b 1 0.452993804
## 6   b 2 0.214453464
## 7   b 3 0.256636372
## 8   b 4 0.196116040
## 9   c 1 0.988018868
## 10  c 2 0.007759574
## 11  c 3 0.465041328
## 12  c 4 0.095405366
## 13  d 1 0.270309668
## 14  d 2 0.263918635
## 15  d 3 0.449754772
## 16  d 4 0.072133932

You can also conveniently select a single column using $, like this:

example.df$t
##  [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Data manipulation commands can be combined:

filter(select(example.df,
              id,
              var1),
       id == "a")
##   id      var1
## 1  a 0.9546573
## 2  a 0.2411590
## 3  a 0.6573206
## 4  a 0.9346456

In the previous example we used == to filter rows where id was “a”. Other relational and logical operators are listed below.

Operator Meaning
== equal to
!= not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
%in% contained in
& and
| or

Adding, removing, and modifying data.frame columns

You can modify data.frames using the mutate() function. It works like this:

example.df
##    id t        var1 var2
## 1   a 1 0.954657345    a
## 2   a 2 0.241159034    b
## 3   a 3 0.657320597    b
## 4   a 4 0.934645616    a
## 5   b 1 0.452993804    c
## 6   b 2 0.214453464    b
## 7   b 3 0.256636372    a
## 8   b 4 0.196116040    a
## 9   c 1 0.988018868    a
## 10  c 2 0.007759574    c
## 11  c 3 0.465041328    b
## 12  c 4 0.095405366    c
## 13  d 1 0.270309668    c
## 14  d 2 0.263918635    a
## 15  d 3 0.449754772    a
## 16  d 4 0.072133932    c
## modify example.df and assign the modified data.frame the name example.df
example.df <- mutate(example.df,
       var2 = var1/t, # replace the values in var2
       var3 = 1:length(t), # create a new column named var3
       var4 = factor(letters[t]),
       t = NULL # delete the column named t
       )
## examine our changes
example.df
##    id        var1        var2 var3 var4
## 1   a 0.954657345 0.954657345    1    a
## 2   a 0.241159034 0.120579517    2    b
## 3   a 0.657320597 0.219106866    3    c
## 4   a 0.934645616 0.233661404    4    d
## 5   b 0.452993804 0.452993804    5    a
## 6   b 0.214453464 0.107226732    6    b
## 7   b 0.256636372 0.085545457    7    c
## 8   b 0.196116040 0.049029010    8    d
## 9   c 0.988018868 0.988018868    9    a
## 10  c 0.007759574 0.003879787   10    b
## 11  c 0.465041328 0.155013776   11    c
## 12  c 0.095405366 0.023851341   12    d
## 13  d 0.270309668 0.270309668   13    a
## 14  d 0.263918635 0.131959318   14    b
## 15  d 0.449754772 0.149918257   15    c
## 16  d 0.072133932 0.018033483   16    d

Exporting Data

Now that we have made some changes to our data set, we might want to save those changes to a file.

# write data to a .csv file
write_csv(example.df, path = "example.csv")

# write data to an R file
write_rds(example.df, path = "example.rds")

# write data to a Stata file
library(haven)
write_dta(example.df, path = "example.dta")

Saving and loading R workspaces

In addition to importing individual datasets, R can save and load entire workspaces

ls() # list objects in our workspace
## [1] "baby.names" "example.df" "x"
save.image(file="myWorkspace.RData") # save workspace 
rm(list=ls()) # remove all objects from our workspace 
ls() # list stored objects to make sure they are deleted
## character(0)

Load the “myWorkspace.RData” file and check that it is restored

load("myWorkspace.RData") # load myWorkspace.RData
ls() # list objects
## [1] "baby.names" "example.df" "x"

Exercise 3: Data manipulation

Read in the “babyNames.csv” file if you have not already done so, assigning the result to baby.names. The file is located at “http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv

  1. Filter baby.names to show only names given to at least 5 percent of boys.
## write your answer here
  1. Create a column named “Proportion” equal to Percent divided by 100.
## write your answer here
  1. Filter baby.names to include only names given to at least 3 percent of Girls. Save this to a Stata data set named “popularGirlNames.dta”)
## write your answer here

Exercise 3 solution

filter(baby.names, Sex == "M" & Percent >= 5)
## # A tibble: 0 x 7
## # ... with 7 variables: Location <chr>, Year <int>, Sex <chr>, Name <chr>,
## #   Count <dbl>, Percent <dbl>, Name.length <int>
baby.names <- mutate(baby.names, Proportion = Percent/100)

popular.girl.names <- filter(baby.names, Sex == "F" & Percent >= 3)

write_csv(popular.girl.names, path = "popularGirlNames.dta")

Basic Statistics and Graphs

Basic statistics

Descriptive statistics of single variables are straightforward:

sum(example.df$var1) # calculate sum of var 1
## [1] 6.520324
mean(example.df$var1)
## [1] 0.4075203
median(example.df$var1)
## [1] 0.2671142
sd(example.df$var1) # calculate standard deviation of var1
## [1] 0.318886
var(example.df$var1)
## [1] 0.1016883
## summaries of individual columns
summary(example.df$var1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00776 0.20987 0.26711 0.40752 0.51311 0.98802
summary(example.df$var2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00388 0.07642 0.14094 0.24774 0.24282 0.98802
## summary of whole data.frame
summary(example.df)
##  id         var1              var2              var3       var4 
##  a:4   Min.   :0.00776   Min.   :0.00388   Min.   : 1.00   a:4  
##  b:4   1st Qu.:0.20987   1st Qu.:0.07642   1st Qu.: 4.75   b:4  
##  c:4   Median :0.26711   Median :0.14094   Median : 8.50   c:4  
##  d:4   Mean   :0.40752   Mean   :0.24774   Mean   : 8.50   d:4  
##        3rd Qu.:0.51311   3rd Qu.:0.24282   3rd Qu.:12.25        
##        Max.   :0.98802   Max.   :0.98802   Max.   :16.00

Some of these functions (e.g., summary) will also work with data.frames and other types of objects, others (such as sd) will not.

Statistics by grouping variable(s)

The summarize function can be used to calculate statistics by grouping variable. Here is how it works.

summarize(group_by(example.df, id), mean(var1), sd(var1))
## # A tibble: 4 x 3
##       id `mean(var1)` `sd(var1)`
##   <fctr>        <dbl>      <dbl>
## 1      a    0.6969456  0.3327803
## 2      b    0.2800499  0.1180474
## 3      c    0.3890563  0.4457757
## 4      d    0.2640293  0.1542263

You can group by multiple variables:

summarize(group_by(example.df, id, var3), mean(var1), sd(var1))
## # A tibble: 16 x 4
## # Groups:   id [?]
##        id  var3 `mean(var1)` `sd(var1)`
##    <fctr> <int>        <dbl>      <dbl>
##  1      a     1  0.954657345         NA
##  2      a     2  0.241159034         NA
##  3      a     3  0.657320597         NA
##  4      a     4  0.934645616         NA
##  5      b     5  0.452993804         NA
##  6      b     6  0.214453464         NA
##  7      b     7  0.256636372         NA
##  8      b     8  0.196116040         NA
##  9      c     9  0.988018868         NA
## 10      c    10  0.007759574         NA
## 11      c    11  0.465041328         NA
## 12      c    12  0.095405366         NA
## 13      d    13  0.270309668         NA
## 14      d    14  0.263918635         NA
## 15      d    15  0.449754772         NA
## 16      d    16  0.072133932         NA

Save R output to a file

Earlier we learned how to write a data set to a file. But what if we want to write something that isn’t in a nice rectangular format, like the output of summary? For that we can use the sink() function:

sink(file="output.txt", split=TRUE) # start logging
print("This is the summary of example.df \n")
## [1] "This is the summary of example.df \n"
print(summary(example.df))
##  id         var1              var2              var3       var4 
##  a:4   Min.   :0.00776   Min.   :0.00388   Min.   : 1.00   a:4  
##  b:4   1st Qu.:0.20987   1st Qu.:0.07642   1st Qu.: 4.75   b:4  
##  c:4   Median :0.26711   Median :0.14094   Median : 8.50   c:4  
##  d:4   Mean   :0.40752   Mean   :0.24774   Mean   : 8.50   d:4  
##        3rd Qu.:0.51311   3rd Qu.:0.24282   3rd Qu.:12.25        
##        Max.   :0.98802   Max.   :0.98802   Max.   :16.00
sink() ## sink with no arguments turns logging off

Exercise 4

  1. Calculate the total number of children born.
## write your answer here
  1. Filter the data to extract only Massachusetts (Location “MA”), and calculate the total number of children born in Massachusetts.
## write your answer here
  1. Group and summarize the data to calculate the number of children born each year. Assign the result to the name births.by.year.
## write your answer here
  1. Calculate the average number of characters in baby names (using the “Name.length” column).
## write your answer here
  1. Group and summarize to calculate the average number of characters in baby names for each location. Assign the result to the name name.length.by.location.

Exercise 4 solution

sum(baby.names$Count)
## [1] 76865321
sum(filter(baby.names, Location == "MA")$Count)
## [1] 1232841
births.by.year <- summarize(group_by(baby.names, Year), sum(Count))

mean(baby.names$Name.length)
## [1] 5.978752
name.length.by.location <- summarize(group_by(baby.names, Location), mean(Name.length))

Basic graphics: Frequency bars

Thanks to classes and methods, you can plot() many kinds of objects:

plot(example.df$var4)

Basic graphics: Boxplots by group

Thanks to classes and methods, you can plot() many kinds of objects:

plot(select(example.df, id, var1))