- This is an intermediate/advanced R course
- Appropriate for those with basic knowledge of R
- This is not a statistics course!
- Learning objectives:
- Learn the R formula interface
- Specify factor contrasts to test specific hypotheses
- Perform model comparisons
- Run and interpret variety of regression models in R

Lab computer users: Log in using the user name and password on the board to your left.

Laptop users:

- you should have R installedâ€“if not, open a web browser and go to http://cran.r-project.org and download and install it
- also helpful to install RStudo (download from http://rstudio.com)

Everyone:

- Download materials from http://tutorials.iq.harvard.edu/R/Rstatistics.zip
- Extract materials from RStatistics.zip (on lab machines
*right-click -> WinZip -> Extract to here*) and move to your desktop.

- Open the RStudio program from the Windows start menu
- Create a project in the Rstatistics folder you downloaded earlier:
`File => New Project => Existing Directory => Browse`

and select the Rstatistics folder.

It is often helpful to start your R session by setting your working directory so you donâ€™t have to type the full path names to your data and other files

```
# set the working directory
# setwd("~/Desktop/Rstatistics")
# setwd("C:/Users/dataclass/Desktop/Rstatistics")
```

You might also start by listing the files in your working directory

`## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rstatistics"`

```
## [1] "Exam.rds" "NatHealth2008MI" "NatHealth2011.rds"
## [4] "states.dta" "states.rds"
```

The *states.dta* data comes from http://anawida.de/teach/SS14/anawida/4.linReg/data/states.dta.txt and appears to have originally appeared in *Statistics with Stata* by Lawrence C. Hamilton.

```
# read the states data
states.data <- readRDS("dataSets/states.rds")
#get labels
states.info <- data.frame(attributes(states.data)[c("names", "var.labels")])
#look at last few labels
tail(states.info, 8)
```

```
## names var.labels
## 14 csat Mean composite SAT score
## 15 vsat Mean verbal SAT score
## 16 msat Mean math SAT score
## 17 percent % HS graduates taking SAT
## 18 expense Per pupil expenditures prim&sec
## 19 income Median household income, $1,000
## 20 high % adults HS diploma
## 21 college % adults college degree
```

Start by examining the data to check for problems.

```
# summary of expense and csat columns, all rows
sts.ex.sat <- subset(states.data, select = c("expense", "csat"))
summary(sts.ex.sat)
```

```
## expense csat
## Min. :2960 Min. : 832.0
## 1st Qu.:4352 1st Qu.: 888.0
## Median :5000 Median : 926.0
## Mean :5236 Mean : 944.1
## 3rd Qu.:5794 3rd Qu.: 997.0
## Max. :9259 Max. :1093.0
```

```
## expense csat
## expense 1.0000000 -0.4662978
## csat -0.4662978 1.0000000
```

Plot the data to look for multivariate outliers, non-linear relationships etc.

- Linear regression models can be fit with the
`lm()`

function - For example, we can use
`lm`

to predict SAT scores based on per-pupal expenditures:

```
# Fit our regression model
sat.mod <- lm(csat ~ expense, # regression formula
data=states.data) # data set
# Summarize and print the results
summary(sat.mod) # show regression coefficients table
```

```
##
## Call:
## lm(formula = csat ~ expense, data = states.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -131.811 -38.085 5.607 37.852 136.495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.061e+03 3.270e+01 32.44 < 2e-16 ***
## expense -2.228e-02 6.037e-03 -3.69 0.000563 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.81 on 49 degrees of freedom
## Multiple R-squared: 0.2174, Adjusted R-squared: 0.2015
## F-statistic: 13.61 on 1 and 49 DF, p-value: 0.0005631
```

Many people find it surprising that the per-capita expenditure on students is negatively related to SAT scores. The beauty of multiple regression is that we can try to pull these apart. What would the association between expense and SAT scores be if there were no difference among the states in the percentage of students taking the SAT?

```
##
## Call:
## lm(formula = csat ~ expense + percent, data = states.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.921 -24.318 1.741 15.502 75.623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 989.807403 18.395770 53.806 < 2e-16 ***
## expense 0.008604 0.004204 2.046 0.0462 *
## percent -2.537700 0.224912 -11.283 4.21e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.62 on 48 degrees of freedom
## Multiple R-squared: 0.7857, Adjusted R-squared: 0.7768
## F-statistic: 88.01 on 2 and 48 DF, p-value: < 2.2e-16
```

OK, we fit our model. Now what?

- Examine the model object:

`## [1] "lm"`

```
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
```

```
## [1] "add1.lm" "alias.lm"
## [3] "anova.lm" "case.names.lm"
## [5] "coerce,oldClass,S3-method" "confint.lm"
## [7] "cooks.distance.lm" "deviance.lm"
## [9] "dfbeta.lm"
```

- Use function methods to get more information about the fit

```
## 2.5 % 97.5 %
## (Intercept) 995.01753164 1126.44735626
## expense -0.03440768 -0.01014361
```

- Ordinary least squares regression relies on several assumptions, including that the residuals are normally distributed and homoscedastic, the errors are independent and the relationships are linear.
- Investigate these assumptions visually by plotting your model: