Introduction

Workshop description

  • This is an intermediate/advanced R course
  • Appropriate for those with basic knowledge of R
  • This is not a statistics course!
  • Learning objectives:
    • Learn the R formula interface
    • Specify factor contrasts to test specific hypotheses
    • Perform model comparisons
    • Run and interpret variety of regression models in R

Materials and Setup

Lab computer users: Log in using the user name and password on the board to your left.

Laptop users:

Everyone:

Launch RStudio

  • Open the RStudio program from the Windows start menu
  • Create a project in the Rstatistics folder you downloaded earlier:
    • File => New Project => Existing Directory => Browse and select the Rstatistics folder.

Set working directory

It is often helpful to start your R session by setting your working directory so you don’t have to type the full path names to your data and other files

You might also start by listing the files in your working directory

## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rstatistics"
## [1] "Exam.rds"          "NatHealth2008MI"   "NatHealth2011.rds"
## [4] "states.dta"        "states.rds"

Load the states data

The states.dta data comes from http://anawida.de/teach/SS14/anawida/4.linReg/data/states.dta.txt and appears to have originally appeared in Statistics with Stata by Lawrence C. Hamilton.

##      names                      var.labels
## 14    csat        Mean composite SAT score
## 15    vsat           Mean verbal SAT score
## 16    msat             Mean math SAT score
## 17 percent       % HS graduates taking SAT
## 18 expense Per pupil expenditures prim&sec
## 19  income Median household income, $1,000
## 20    high             % adults HS diploma
## 21 college         % adults college degree

Linear regression

Examine the data before fitting models

Start by examining the data to check for problems.

##     expense          csat       
##  Min.   :2960   Min.   : 832.0  
##  1st Qu.:4352   1st Qu.: 888.0  
##  Median :5000   Median : 926.0  
##  Mean   :5236   Mean   : 944.1  
##  3rd Qu.:5794   3rd Qu.: 997.0  
##  Max.   :9259   Max.   :1093.0
##            expense       csat
## expense  1.0000000 -0.4662978
## csat    -0.4662978  1.0000000

Plot the data before fitting models

Plot the data to look for multivariate outliers, non-linear relationships etc.

Linear regression example

  • Linear regression models can be fit with the lm() function
  • For example, we can use lm to predict SAT scores based on per-pupal expenditures:
## 
## Call:
## lm(formula = csat ~ expense, data = states.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -131.811  -38.085    5.607   37.852  136.495 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.061e+03  3.270e+01   32.44  < 2e-16 ***
## expense     -2.228e-02  6.037e-03   -3.69 0.000563 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59.81 on 49 degrees of freedom
## Multiple R-squared:  0.2174, Adjusted R-squared:  0.2015 
## F-statistic: 13.61 on 1 and 49 DF,  p-value: 0.0005631

Why is the association between expense and SAT scores negative?

Many people find it surprising that the per-capita expenditure on students is negatively related to SAT scores. The beauty of multiple regression is that we can try to pull these apart. What would the association between expense and SAT scores be if there were no difference among the states in the percentage of students taking the SAT?

## 
## Call:
## lm(formula = csat ~ expense + percent, data = states.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.921 -24.318   1.741  15.502  75.623 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 989.807403  18.395770  53.806  < 2e-16 ***
## expense       0.008604   0.004204   2.046   0.0462 *  
## percent      -2.537700   0.224912 -11.283 4.21e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.62 on 48 degrees of freedom
## Multiple R-squared:  0.7857, Adjusted R-squared:  0.7768 
## F-statistic: 88.01 on 2 and 48 DF,  p-value: < 2.2e-16

The lm class and methods

OK, we fit our model. Now what?

  • Examine the model object:
## [1] "lm"
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
## [1] "add1.lm"                   "alias.lm"                 
## [3] "anova.lm"                  "case.names.lm"            
## [5] "coerce,oldClass,S3-method" "confint.lm"               
## [7] "cooks.distance.lm"         "deviance.lm"              
## [9] "dfbeta.lm"
  • Use function methods to get more information about the fit
##                    2.5 %        97.5 %
## (Intercept) 995.01753164 1126.44735626
## expense      -0.03440768   -0.01014361

Linear Regression Assumptions

  • Ordinary least squares regression relies on several assumptions, including that the residuals are normally distributed and homoscedastic, the errors are independent and the relationships are linear.
  • Investigate these assumptions visually by plotting your model: