Welcome

Materials and setup

NOTE: skip this section if you are not running R locally (e.g., if you are running R in your browser using a remote Jupyter server)

You should have R installed –if not:

Notes and examples for this workshop are available at

Start RStudio create a new project: - On Windows click the start button and search for rstudio. On Mac RStudio will be in your applications folder. - In Rstudio go to File -> New Project. - Choose New Directory and New Project. - Choose a name and location for your new project directory.

Workshop goals and approach

In this workshop you will

  • learn R basics,
  • learn about the R package ecosystem,
  • practice reading files and manipulating data in R

A more general goal is to get you comfortable with R so that it seems less scary and mystifying than it perhaps does now. Note that this is by no means a complete or thorough introduction to R! It’s just enough to get you started.

This workshop is relatively informal, example-oriented, and hands-on. We won’t spend much time examining language features in detail. Instead we will work through an example, and learn some things about the R along the way.

As an example project we will analyze the popularity of baby names in the US from 1960 through 2017. Among the questions we will use R to answer are:

  • In which year did your name achieve peak popularity?
  • How many children were born each year?
  • What are the most popular names overall? For girls? For Boys?

Graphical User Interfaces (GUIs)

There are many different ways you can interact with R. See the Data Science Tools workshop notes for details.

For this workshop I encourage you to use RStudio; it is a good R-specific IDE that mostly just works.

Launch RStudio (skip if not using Rstudio)

Note: skip this section if you are not using Rstudio (e.g., if you are running these examples in a Jupyter notebook).

  • Start the RStudio program
  • In RStudio, go to File => New File =&gt R Script

The window in the upper-left is your R script. This is where you will write instructions for R to carry out.

The window in the lower-left is the R console. This is where results will be displayed.

Exercise 0

The purpose of this exercise is to give you an opportunity to explore the interface provided by RStudio (or whichever GUI you’ve decided to use). You may not know how to do these things; that’s fine! This is an opportunity to figure it out.

Also keep in mind that we are living in a golden age of tab completion. If you don’t know the name of an R function, try guessing the first two or three letters and pressing TAB. If you guessed correctly the function you are looking for should appear in a pop up!


  1. Try to get R to add 2 plus 2.
  1. Try to calculate the square root of 10.
  1. R includes extensive documentation, including a manual named “An introduction to R”. Use the RStudio help pane. to locate this manual.

R basics

Function calls

The general form for calling R functions is

Arguments can be matched by name; unnamed arguments will be matched by position.

Assignment

Values can be assigned names and used in subsequent operations

  • The <- operator (less than followed by a dash) is used to save values
  • The name on the left gets the value on the right.
## [1] 3.162278

Names should start with a letter, and contain only letters, numbers, underscores, and periods.

Asking R for help

You can ask R for help using the help function, or the ? shortcut.

The help function can be used to look up the documentation for a function, or to look up the documentation to a package. We can learn how to use the stats package by reading its documentation like this:

Getting data into R

R has data reading functionality built-in – see e.g., help(read.table). However, faster and more robust tools are available, and so to make things easier on ourselves we will use a contributed package called readr instead. This requires that we learn a little bit about packages in R.

Installing and using R packages

A large number of contributed packages are available. If you are looking for a package for a specific task, https://cran.r-project.org/web/views/ and https://r-pkg.org are good places to start.

You can install a package in R using the install.packages() function. Once a package is installed you may use the library function to attach it so that it can be used.

Readers for common file types

In order to read data from a file, you have to know what kind of file it is. The table below lists functions that can import data from common plain-text formats.

Data Type Function
comma separated read_csv()
tab separated read_delim()
other delimited formats read_table()
fixed width read_fwf()

Note You may be confused by the existence of similar functions, e.g., read.csv and read.delim. These are legacy functions that tend to be slower and less robust than the readr functions. One way to tell them apart is that the faster more robust versions use underscores in their names (e.g., read_csv) while the older functions us dots (e.g., read.csv). My advice is to use the more robust newer versions, i.e., the ones with underscores.

Baby names data

The examples in this workshop use US baby names data retrieved from https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data A cleaned and merged version of these data is available at http://tutorials.iq.harvard.edu/data/babyNames.csv.

Exercise 1: Reading the baby names data

Make sure you have installed the readr package and attached it with library(readr).

Baby names data are available at "http://tutorials.iq.harvard.edu/data/babyNames.csv".

  1. Open the read_csv help page to determine how to use it to read in data.

  2. Read the baby names data using the read_csv function and assign the result with the name baby.names.

  3. BONUS (optional): Save the baby.names data as a Stata data set babynames.dta and as an R data set babynames.rds.

Popularity of your name

In this section we will pull out specific names and examine changes in their popularity over time.

The baby.names object we created in the last exercise is a data.frame. There are many other data structures in R, but for now we’ll focus on working with data.frames.

R has decent data manipulation tools built-in – see e.g., help(Extract). However, these tools are powerful and complex and often overwhelm beginners. To make things easier on ourselves we will use a contributed package called dplyr instead.

Filtering and arranging data

One way to find the year in which your name was the most popular is to filter out just the rows corresponding to your name, and then arrange (sort) by Count.

To demonstrate these techniques we’ll try to determine whether “Alex”" or “Jim” was more popular in 1992. We start by filtering the data so that we keep only rows where Year is equal to 1992 and Name is either “Alex” or “Mark”.

## # A tibble: 4 x 4
##   Name  Sex   Count  Year
##   <chr> <chr> <dbl> <dbl>
## 1 Alex  Girls   366  1992
## 2 Mark  Girls    20  1992
## 3 Mark  Boys   8743  1992
## 4 Alex  Boys   7348  1992

Notice that we can we can combine conditons using & (AND) and | (OR).

In this case it’s pretty easy to see that “Mark” is more popular, but to make it even easier we can arrange the data so that the most popular name is listed first.

## # A tibble: 4 x 4
##   Name  Sex   Count  Year
##   <chr> <chr> <dbl> <dbl>
## 1 Mark  Girls    20  1992
## 2 Alex  Girls   366  1992
## 3 Alex  Boys   7348  1992
## 4 Mark  Boys   8743  1992
## # A tibble: 4 x 4
##   Name  Sex   Count  Year
##   <chr> <chr> <dbl> <dbl>
## 1 Mark  Boys   8743  1992
## 2 Alex  Boys   7348  1992
## 3 Alex  Girls   366  1992
## 4 Mark  Girls    20  1992

Other logical operators

In the previous example we used == to filter rows. Other relational and logical operators are listed below.

Operator Meaning
== equal to
!= not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
%in% contained in

These operators may be combined with & (and) or | (or).

Exercise 2: Peak popularity of your name

In this exercise you will discover the year your name reached its maximum popularity.

Read in the “babyNames.csv” file if you have not already done so, assigning the result to baby.names. The file is located at "http://tutorials.iq.harvard.edu/data/babyNames.csv"

Make sure you have installed the dplyr package and attached it with library(dplyr).

  1. Use filter to extract data for your name (or another name of your choice).
  1. Arrange the data you produced in step 1 above by Count. In which year was the name most popular?
  1. BONUS (optional): Filter the data to extract only the row containing the most popular boys name in 1999.

Exercise 2 solution

## # A tibble: 111 x 4
##    Name   Sex   Count  Year
##    <chr>  <chr> <dbl> <dbl>
##  1 George Boys  14063  1960
##  2 George Boys  13638  1961
##  3 George Boys  12553  1962
##  4 George Boys  12084  1963
##  5 George Boys  11793  1964
##  6 George Boys  10683  1965
##  7 George Boys   9942  1966
##  8 George Boys   9702  1967
##  9 George Boys   9388  1968
## 10 George Boys   9203  1969
## # … with 101 more rows
## # A tibble: 1 x 4
##   Name  Sex   Count  Year
##   <chr> <chr> <dbl> <dbl>
## 1 Jacob Boys  35361  1999