Data Management in Stata

Table of Contents

Introduction

Materials and Setup

  • Lab computer log in:
    • USERNAME: dataclass
    • PASSWORD: on the board to your left
  • Workshop materials:

Workshop Description

  • This is an Introduction to data management in Stata
  • Assumes basic knowledge of Stata
  • Not appropriate for people already well familiar with Stata
  • If you are catching on before the rest of the class, experiment with command features described in help files

Organization

  • Please feel free to ask questions at any point if they are relevant to the current topic (or if you are lost!)
  • There will be a Q&A after class for more specific, personalized questions
  • Collaboration with your neighbors is encouraged
  • If you are using a laptop, you will need to adjust paths accordingly

Opening Files in Stata

  • Look at bottom left hand corner of Stata screen
    • This is the directory Stata is currently reading from
  • Files are located in the StataDatMan folder in your home directory
  • Start by telling Stata where to look for these
// change directory
cd "~/tutorials/Stata/StataDatMan"

// Use dir to see what is in the directory:
dir
dir dataSets

// use the gss data set
use dataSets/gss.dta
set more off

cd "~/tutorials/Stata/StataDatMan"
/nfs/www/edu-harvard-iq-tutorials/Stata/StataDatMan


dir

total 100
drwxrwsr-x. 2 apache tutorwww  4096 Oct  9 08:44 dataSets/
-rwxrwxr-x. 1 izahn  tutorwww  1302 Oct  9 08:44 Exercises.do*
drwxrwsr-x. 2 apache tutorwww  4096 Oct  9 08:44 images/
drwxrwsr-x. 4 apache tutorwww  4096 Oct  9 08:44 StataDatMan/
-rwxrwxr-x. 1 izahn  tutorwww 17446 Oct  9 08:44 StataDatMan.do*
-rwxrwxr-x. 1 izahn  tutorwww 38153 Oct  9 08:44 StataDatMan.html*
-rwxrwxr-x. 1 izahn  tutorwww 20463 Oct  9 08:44 StataDatMan.org*
dir dataSets

total 2644
-rwxrwxr-x. 1 izahn tutorwww 275705 Oct  9 08:44 gss1.dta*
-rwxrwxr-x. 1 izahn tutorwww 263324 Oct  9 08:44 gss2.dta*
-rwxrwxr-x. 1 izahn tutorwww 532880 Oct  9 08:44 gssAddObserve.dta*
-rwxrwxr-x. 1 izahn tutorwww 527005 Oct  9 08:44 gssAppend.dta*
-rwxrwxr-x. 1 izahn tutorwww 527005 Oct  9 08:44 gsscompare1.dta*
-rwxrwxr-x. 1 izahn tutorwww 538755 Oct  9 08:44 gss.dta*
-rwxrwxr-x. 1 izahn tutorwww   1139 Oct  9 08:44 marital.dta*


use dataSets/gss.dta

Generating and replacing variables

Basic Data Manipulation Commands

Basic commands you'll use for generating new variables or recoding existing variables:

  • gen
  • egen
  • replace
  • recode

Many different means of accomplishing the same thing in Stata – find what is comfortable (and easy) for you!

Generate and Replace

The replace command is often used with logic statements. Available logical operators include the following:

Operator Meaning
== equal to
!= not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
& and
| or

For example:

//create "hapnew" variable
gen hapnew = .
//set to 0 if happy equals 1
replace hapnew=0 if happy==1 
//set to 1 if happy both and hapmar are greater than 3
replace hapnew=1 if happy>3 & hapmar>3 
// tabulate the new 
tab hapnew
gen hapnew = .
(1,419 missing values generated)

replace hapnew=0 if happy==1 
(435 real changes made)

replace hapnew=1 if happy>3 & hapmar>3 
(4 real changes made)

tab hapnew

     hapnew |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        435       99.09       99.09
          1 |          4        0.91      100.00
------------+-----------------------------------
      Total |        439      100.00

Recode

The recode command is basically generate and replace combined. You can recode an existing variable OR use recode to create a new variable (via the gen option).

// recode the wrkstat variable 
recode wrkstat (1=8) (2=7) (3=6) (4=5) (5=4) (6=3) (7=2) (8=1)
// recode wrkstat into a new variable named wrkstat2
recode wrkstat (1=8), gen(wrkstat2)
// tabulate workstat
tab wrkstat
recode wrkstat (1=8) (2=7) (3=6) (4=5) (5=4) (6=3) (7=2) (8=1)
(wrkstat: 1419 changes made)

recode wrkstat (1=8), gen(wrkstat2)
(32 differences between wrkstat and wrkstat2)

tab wrkstat

      LABOR FRCE |
          STATUS |      Freq.     Percent        Cum.
-----------------+-----------------------------------
WORKING FULLTIME |         32        2.26        2.26
WORKING PARTTIME |        155       10.92       13.18
TEMP NOT WORKING |         34        2.40       15.57
UNEMPL, LAID OFF |        214       15.08       30.66
         RETIRED |         29        2.04       32.70
          SCHOOL |         35        2.47       35.17
   KEEPING HOUSE |        146       10.29       45.45
           OTHER |        774       54.55      100.00
-----------------+-----------------------------------
           Total |      1,419      100.00

The table below illustrates common forms of recoding

Rule Example Meaning
#=# 3=1 3 recoded to 1
##=# 2. =9 2 and . recoded to 9
#/# = # 1/5=4 1 through 5 recoded to 4
nonmissing=# nonmiss=8 nonmissing recoded to 8
missing=# miss=9 missing recoded to 9

egen

The egen command ("extensions" to the gen command) provides convenient methods for performing many common data manipulation tasks.

For example, we can use egen to create a new variable that counts the number of "yes" responses on computer, email and internet use:

// count number of yes on use comp email and net 
egen compuser= anycount(usecomp usemail usenet), values(1)
tab compuser
egen compuser= anycount(usecomp usemail usenet), values(1)
tab compuser

    usecomp |
    usemail |
usenet == 1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        623       43.90       43.90
          1 |        142       10.01       53.91
          2 |         78        5.50       59.41
          3 |        576       40.59      100.00
------------+-----------------------------------
      Total |      1,419      100.00

Here are some additional examples of egen in action:

// assess how much missing data each participant has:
egen countmiss = rowmiss(age-wifeft)
codebook countmiss
// compare values on multiple variables
egen ftdiff=diff(wkftwife wkfthusb)
codebook ftdiff
egen countmiss = rowmiss(age-wifeft)
codebook countmiss

-------------------------------------------------------------------------------
countmiss                                                           (unlabeled)
-------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [0,7]                        units:  1
         unique values:  6                        missing .:  0/1,419

            tabulation:  Freq.  Value
                           296  0
                           215  1
                           113  2
                             7  3
                           782  6
                             6  7

egen ftdiff=diff(wkftwife wkfthusb)
codebook ftdiff

-------------------------------------------------------------------------------
ftdiff                                                   diff wkftwife wkfthusb
-------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/1,419

            tabulation:  Freq.  Value
                         1,169  0
                           250  1

You will need to refer to the documentation to discover what else egen can do: type "help egen" in Stata to get a complete list of functions.

Exercise 1: Generate, Replace, Recode & Egen

Open the gss.dta data.

  1. Generate a new variable that represents the squared value of age.
  2. Generate a new variable equal to "1" if income is greater than "19".
  3. Create a new variable that counts the number of missing responses for each respondent. What is the maximum number of missing variables?

Exercise 1 prototype   prototype

Open the gss.dta data.

  1. Generate a new variable that represents the squared value of age.
use dataSets/gss.dta, clear
gen age2 = age^2
use dataSets/gss.dta, clear
gen age2 = age^2
  1. Generate a new variable equal to "1" if income is greater than "19".
describe income
label list income
recode income (99=.) (98=.)
gen highincome =0 if income != .
replace highincome=1 if income>19
sum highincome
describe income

              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
income          double  %10.0g     income     TOTAL FAMILY INCOME FOR LAST YEAR
label list income
income:
           0 NAP
           1 UNDER $1 000
           2 $1 000 TO 2 999
           3 $3 000 TO 3 999
           4 $4 000 TO 4 999
           5 $5 000 TO 5 999
           6 $6 000 TO 6 999
           7 $7 000 TO 7 999
           8 $8 000 TO 9 999
           9 $10000 TO 12499
          10 $12500 TO 14999
          11 $15000 TO 17499
          12 $17500 TO 19999
          13 $20000 TO 22499
          14 $22500 TO 24999
          15 $25000 TO 29999
          16 $30000 TO 34999
          17 $35000 TO 39999
          18 $40000 TO 49999
          19 $50000 TO 59999
          20 $60000 TO 74999
          21 $75000 TO $89999
          22 $90000 - $109999
          23 $110000 OR OVER
          24 REFUSED
          98 DK
          99 NA
recode income (99=.) (98=.)
(income: 65 changes made)
gen highincome =0 if income != .
(65 missing values generated)
replace highincome=1 if income>19
(487 real changes made)
sum highincome

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
  highincome |      1,419    .3431994    .4749448          0          1
  1. Create a new variable that counts the number of missing responses for each respondent. What is the maximum number of missing variables?
egen nmissing = rowmiss(_all)
sum nmissing
egen nmissing = rowmiss(_all)
sum nmissing

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    nmissing |      1,419    4.820296    3.684793          0         10

By processing

The "bysort" Command

Sometimes, you'd like to create variables based on different categories of a single variable. For example, say you want to look at happiness based on whether an individual is male or female. The "bysort" prefix does just this:

// tabulate happy separately for male and female 
bysort sex: tab happy
// generate summary statistics using bysort 
bysort state: egen stateincome = mean(income)
bysort degree: egen degreeincome = mean(income)
bysort marital: egen marincomesd = sd(income)
bysort sex: tab happy

-------------------------------------------------------------------------------
-> sex = Male

      GENERAL |
    HAPPINESS |      Freq.     Percent        Cum.
--------------+-----------------------------------
   VERY HAPPY |        189       30.39       30.39
 PRETTY HAPPY |        350       56.27       86.66
NOT TOO HAPPY |         73       11.74       98.39
           NA |         10        1.61      100.00
--------------+-----------------------------------
        Total |        622      100.00

-------------------------------------------------------------------------------
-> sex = Female

      GENERAL |
    HAPPINESS |      Freq.     Percent        Cum.
--------------+-----------------------------------
   VERY HAPPY |        246       30.87       30.87
 PRETTY HAPPY |        447       56.09       86.95
NOT TOO HAPPY |         84       10.54       97.49
           DK |          1        0.13       97.62
           NA |         19        2.38      100.00
--------------+-----------------------------------
        Total |        797      100.00

bysort state: egen stateincome = mean(income)
variable state not found
r(111);
bysort degree: egen degreeincome = mean(income)
bysort marital: egen marincomesd = sd(income)

By prefix vs. by options

Some commands won't work with by prefix, but instead have a by option:

// generate separate histograms for female and male 
hist nethrs, by(sex)

histBysex.png

Missing values

Missing Values

You always need to consider how missing values are coded when recoding variables.

  • Stata's symbol for a missing value is "."
  • Stata interprets "." as a large value
  • Easy to make mistakes!

To identify highly educated women, we might use the command:

// generate and replace without considering missing values
gen hi_ed=0
replace hi_ed=1 if wifeduc>15
// What happens to our missing values?
tab hi_ed, mi nola
gen hi_ed=0
replace hi_ed=1 if wifeduc>15
(944 real changes made)

tab hi_ed, mi nola

      hi_ed |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        475       33.47       33.47
          1 |        944       66.53      100.00
------------+-----------------------------------
      Total |      1,419      100.00

It looks like around 66% have higher education, but look closer:

// gen hi_ed2, but don't set a value if wifeduc is missing
gen hi_ed2 = 0 if wifeduc != . 
// only replace non-missing
replace hi_ed2=1 if wifeduc >15 & wifeduc !=. 
//check to see that missingness is preserved
tab hi_ed2, mi
gen hi_ed2 = 0 if wifeduc != . 
(797 missing values generated)

replace hi_ed2=1 if wifeduc >15 & wifeduc !=. 
(147 real changes made)

 |        797       56.17      100.00
------------+-----------------------------------
      Total |      1,419      100.00

The correct value is 10%. Moral of the story? Be careful with missing values and remember that Stata considers missing values to be large!

Bulk Conversion to Missing Values

Often the data collection/generating procedure will have used some other value besides "." to represent missing values. The mvdecode command will convert all these values to missing. For example:

mvdecode _all, mv(999)
mvdecode _all, mv(999)
  • The "\all" command tells Stata to do this to all variables
  • Use this command carefully!
    • If you have any variables where "999" is a legitimate value, Stata is going to recode it to missing
    • As an alternative, you could list var names separately rather than using "\all"

Variable types

Variable Types

Stata uses two main types of variables: String and Numeric. To be able to perform any mathematical operations, your variables need to be in a numeric format. Stata can store numbers with differing levels of precision, as described in the table below.

type Minimum Maximum being 0 bytes
byte -127 100 +/-1 1
int -32,767 32,740 +/-1 2
long -2,147,483,647 2,147,483,620 +/-1 4
float -1.70141173319*1038 1.70141173319*1038 +/-10-38 4
double -8.9884656743*10307 8.9884656743*10307 +/-10-323 8
  • Precision for float is 3.795x10-8.
  • Precision for double is 1.414x10-16.

Converting to and from Strings

Stata provides several ways to convert to and from strings. You can use tostring and destring to convert from one type to the other:

// convert degree to a string
tostring degree, gen(degree_s)
// and back to a number
destring degree_s, gen(degree_n)
tostring degree, gen(degree_s)
degree_s generated as str1

destring degree_s, gen(degree_n)
degree_s has all characters numeric; degree_n generated as byte

Use decode and encode to convert to/from variable labels:

// convert degree to a descriptive string
decode degree, gen(degree_s2)
// and back to a number with labels
encode degree_s2, gen(degree_n2)
decode degree, gen(degree_s2)

encode degree_s2, gen(degree_n2)

Converting Strings to Date/Time

Often date/time variables start out as strings – You'll need to convert them to numbers using one of the conversion functions listed below.

Format Meaning String-to-numeric conversion function
%tc milliseconds clock(string, mask)
%td days date(string, mask)
%tw weeks weekly(string, mask)
%tm months monthly(string, mask)
%tq quarters quarterly(string, mask)
%ty years yearly(string, mask)

Date/time variables are stored as the number of units elapsed since 01jan1960 00:00:00.000. For example, the date function returns the number of days since that time, and the clock function returns the number of milliseconds since that time.

// create string variable and convert to date
gen date = "November 9 2020"
gen date1 = date(date, "MDY")
list date1 in 1/5
gen date = "November 9 2020"
gen date1 = date(date, "MDY")
list date1 in 1/5

     +-------+
     | date1 |
     |-------|
  1. | 22228 |
  2. | 22228 |
  3. | 22228 |
  4. | 22228 |
  5. | 22228 |
     +-------+

Formatting Numbers as Dates

Once you have converted the string to a number you can format it for display. You can simply accept the defaults used by your formatting string or provide details to customize it.

// format so humans can read the date
format date1 %d
list date1 in 1/5
// format with detail
format date1 %tdMonth_dd,_CCYY
list date1 in 1/5
format date1 %d
list date1 in 1/5

     +-----------+
     |     date1 |
     |-----------|
  1. | 09nov2020 |
  2. | 09nov2020 |
  3. | 09nov2020 |
  4. | 09nov2020 |
  5. | 09nov2020 |
     +-----------+

format date1 %tdMonth_dd,_CCYY
list date1 in 1/5

     +------------------+
     |            date1 |
     |------------------|
  1. | November 9, 2020 |
  2. | November 9, 2020 |
  3. | November 9, 2020 |
  4. | November 9, 2020 |
  5. | November 9, 2020 |
     +------------------+

Exercise 2: Missing Values, String Conversion, and by Processing

  1. Recode values "99" and "98" on the variable, "hrs1" as "missing."
  2. Recode the marital variable into a "string" variable and then back into a numeric variable.
  3. Create a new variable that associates each individual with the average number of hours worked among individuals with matching educational degrees (see the last "by" example for inspiration).

Exercise 2 prototype   prototype

  1. Recode values "99" and "98" on the variable, "hrs1" as "missing."
use dataSets/gss.dta, clear
sum hrs1
recode hrs1 (99=.) (98=.) 
sum hrs1
use dataSets/gss.dta, clear
sum hrs1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        hrs1 |      1,419    27.64905    24.30189         -1         99
recode hrs1 (99=.) (98=.) 
(hrs1: 11 changes made)
sum hrs1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        hrs1 |      1,408    27.09233    23.56243         -1         89
  1. Recode the marital variable into a "string" variable and then back into a numeric variable.
tostring marital, gen(marstring)
destring marstring, gen(mardstring)
//compare with
decode marital, gen(marital_s)
encode marital_s, gen(marital_n)

describe marital marstring mardstring marital_s marital_n
sum marital marstring mardstring marital_s marital_n
use dataSets/gss.dta, clear
tostring marital, gen(marstring)
marstring generated as str1
destring marstring, gen(mardstring)
marstring has all characters numeric; mardstring generated as byte

decode marital, gen(marital_s)
encode marital_s, gen(marital_n)

describe marital marstring mardstring marital_s marital_n

              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
marital         double  %10.0g     marital    MARITAL STATUS
marstring       str1    %9s                   MARITAL STATUS
mardstring      byte    %10.0g                MARITAL STATUS
marital_s       str13   %13s                  MARITAL STATUS
marital_n       long    %13.0g     marital_n
                                              MARITAL STATUS
sum marital marstring mardstring marital_s marital_n

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     marital |      1,419    2.560958    1.656606          1          5
   marstring |          0
  mardstring |      1,419    2.560958    1.656606          1          5
   marital_s |          0
   marital_n |      1,419     2.46864     1.11641          1          5
  1. Create a new variable that associates each individual with the average number of hours worked among individuals with matching educational degrees (see the last "by" example for inspiration).
bysort degree: egen hrsdegree = mean(hrs1)
tab hrsdegree
tab hrsdegree degree
bysort degree: egen hrsdegree = mean(hrs1)
variable hrsdegree already defined
r(110);
tab hrsdegree

  hrsdegree |      Freq.     Percent        Cum.
------------+-----------------------------------
   15.83951 |        243       17.12       17.12
   25.42857 |          7        0.49       17.62
   27.13793 |        725       51.09       68.71
   31.90099 |        101        7.12       75.83
    35.6483 |        236       16.63       92.46
   36.38679 |        106        7.47       99.93
         40 |          1        0.07      100.00
------------+-----------------------------------
      Total |      1,419      100.00
tab hrsdegree degree

           |              Respondent's highest degree
 hrsdegree | Less than  High scho  Junior co   Bachelor   Graduate |     Total
-----------+-------------------------------------------------------+----------
  15.83951 |       243          0          0          0          0 |       243 
  25.42857 |         0          0          0          0          0 |         7 
  27.13793 |         0        725          0          0          0 |       725 
  31.90099 |         0          0        101          0          0 |       101 
   35.6483 |         0          0          0        236          0 |       236 
  36.38679 |         0          0          0          0        106 |       106 
        40 |         0          0          0          0          0 |         1 
-----------+-------------------------------------------------------+----------
     Total |       243        725        101        236        106 |     1,419 


           | Respondent's highest
           |        degree
 hrsdegree |        DK         NA |     Total
-----------+----------------------+----------
  15.83951 |         0          0 |       243 
  25.42857 |         0          7 |         7 
  27.13793 |         0          0 |       725 
  31.90099 |         0          0 |       101 
   35.6483 |         0          0 |       236 
  36.38679 |         0          0 |       106 
        40 |         1          0 |         1 
-----------+----------------------+----------
     Total |         1          7 |     1,419

Merging, appending, and joining

Appending Datasets

Sometimes you have observations in two different datasets, or you'd like to add observations to an existing dataset. In this case you can use the append command to add observations to the end of the observations in the master dataset. For example:

clear
// from the append help file
webuse even
list
webuse odd
list
// Append even data to the end of the odd data
append using "http://www.stata-press.com/data/r14/even"
list
clear
clear

webuse even
(6th through 8th even numbers)
list

     +---------------+
     | number   even |
     |---------------|
  1. |      6     12 |
  2. |      7     14 |
  3. |      8     16 |
     +---------------+
webuse odd
(First five odd numbers)
list

     +--------------+
     | number   odd |
     |--------------|
  1. |      1     1 |
  2. |      2     3 |
  3. |      3     5 |
  4. |      4     7 |
  5. |      5     9 |
     +--------------+

append using "http://www.stata-press.com/data/r14/even"
list

     +---------------------+
     | number   odd   even |
     |---------------------|
  1. |      1     1      . |
  2. |      2     3      . |
  3. |      3     5      . |
  4. |      4     7      . |
  5. |      5     9      . |
     |---------------------|
  6. |      6     .     12 |
  7. |      7     .     14 |
  8. |      8     .     16 |
     +---------------------+
clear

To keep track of where observations came from, use the generate option as shown below:

webuse odd
append using "http://www.stata-press.com/data/r14/even", generate(observesource)
list
clear
webuse odd
(First five odd numbers)
 ce)
list

     +--------------------------------+
     | number   odd   observ~e   even |
     |--------------------------------|
  1. |      1     1          0      . |
  2. |      2     3          0      . |
  3. |      3     5          0      . |
  4. |      4     7          0      . |
  5. |      5     9          0      . |
     |--------------------------------|
  6. |      6     .          1     12 |
  7. |      7     .          1     14 |
  8. |      8     .          1     16 |
     +--------------------------------+
clear

There is a "force" option will allow for data type mismatches, but again this is not recommended.

Remember, append is for adding observations (i.e., rows) from a second data set.

Merging Datasets

You can merge variables from a second dataset to the dataset you're currently working with.

  • Current active dataset = master dataset
  • Dataset you'd like to merge with master = using dataset

There are different ways that you might be interested in merging data:

  • Two datasets with same participant pool, one row per participant (1:1)
  • A dataset with one participant per row with a dataset with multiple rows per participant (1:many or many:1)

Before you begin:

  • Identify the "ID" that you will use to merge your two datasets
  • Determine which variables you'd like to merge
  • In Stata >= 11, data does NOT have to be sorted
  • Variable types must match across datasets (there is a "force" option to get around this, but not recommended)
// Adapted from the merge help page
webuse autosize 
list
webuse autoexpense
list

webuse autosize
merge 1:1 make using "http://www.stata-press.com/data/r14/autoexpense"
list
clear

// keep only the matches (AKA "inner join")
webuse autosize, clear
merge 1:1 make using "http://www.stata-press.com/data/r14/autoexpense", keep(match) nogen
list
clear
webuse autosize 
(1978 Automobile Data)
list

     +------------------------------------+
     | make               weight   length |
     |------------------------------------|
  1. | Toyota Celica       2,410      174 |
  2. | BMW 320i            2,650      177 |
  3. | Cad. Seville        4,290      204 |
  4. | Pont. Grand Prix    3,210      201 |
  5. | Datsun 210          2,020      165 |
     |------------------------------------|
  6. | Plym. Arrow         3,260      170 |
     +------------------------------------+
webuse autoexpense
(1978 Automobile Data)
list

     +---------------------------------+
     | make                price   mpg |
     |---------------------------------|
  1. | Toyota Celica       5,899    18 |
  2. | BMW 320i            9,735    25 |
  3. | Cad. Seville       15,906    21 |
  4. | Pont. Grand Prix    5,222    19 |
  5. | Datsun 210          4,589    35 |
     +---------------------------------+

webuse autosize
(1978 Automobile Data)
merge 1:1 make using "http://www.stata-press.com/data/r14/autoexpense"

    Result                           # of obs.
    -----------------------------------------
    not matched                             1
        from master                         1  (_merge==1)
        from using                          0  (_merge==2)

    matched                                 5  (_merge==3)
    -----------------------------------------
list

     +---------------------------------------------------------------------+
     | make               weight   length    price   mpg            _merge |
     |---------------------------------------------------------------------|
  1. | BMW 320i            2,650      177    9,735    25       matched (3) |
  2. | Cad. Seville        4,290      204   15,906    21       matched (3) |
  3. | Datsun 210          2,020      165    4,589    35       matched (3) |
  4. | Plym. Arrow         3,260      170        .     .   master only (1) |
  5. | Pont. Grand Prix    3,210      201    5,222    19       matched (3) |
     |---------------------------------------------------------------------|
  6. | Toyota Celica       2,410      174    5,899    18       matched (3) |
     +---------------------------------------------------------------------+
clear


webuse autosize, clear
(1978 Automobile Data)
 match) nogen

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                 5  
    -----------------------------------------
list

     +---------------------------------------------------+
     | make               weight   length    price   mpg |
     |---------------------------------------------------|
  1. | BMW 320i            2,650      177    9,735    25 |
  2. | Cad. Seville        4,290      204   15,906    21 |
  3. | Datsun 210          2,020      165    4,589    35 |
  4. | Pont. Grand Prix    3,210      201    5,222    19 |
  5. | Toyota Celica       2,410      174    5,899    18 |
     +---------------------------------------------------+
clear

Remember, merge is for adding variables (i.e., columns) from a second data set.

Merge Options

There are several options that provide more fine-grain control over what happens to non-id columns contained in both data sets. If you've carefully cleaned and prepared the data prior to merging this shouldn't be an issue, but here are some details about how stata handles this situation.

  • In standard merge, the master dataset is the authority and WON'T CHANGE
  • If your master dataset has missing data and some of those values are not missing in your using dataset, specify "update" – this will fill in missing data in master
  • If you want data from your using dataset to overwrite that in your master, specify "replace update" – this will replace master data with using data UNLESS the value is missing in the using dataset

Many-to-many merges

Stata allows you to specify merges like merge m:m id using newdata.dta, but I have never seen this do anything useful. To quote the official Stata manual:

m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen. Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below

(emphasis added).

If you are thinking about using merge m:m chances are good that you actually need joinby. Here is a quick example, modified from the joinby help page.

clear
webuse parent
list
webuse children
list
// Complete and utter nonsense!
merge m:m family_id using http://www.stata-press.com/data/r14/parent 
// You want joinby instead
clear
webuse children
joinby family_id using http://www.stata-press.com/data/r14/parent
clear
webuse parent
(Data on Parents)
list

     +--------------------------------+
     | family~d   parent~d   x1    x3 |
     |--------------------------------|
  1. |     1030         10   39   600 |
  2. |     1025         11   20   643 |
  3. |     1025         12   27   721 |
  4. |     1026         13   30   760 |
  5. |     1026         14   26   668 |
     |--------------------------------|
  6. |     1030         15   32   684 |
     +--------------------------------+
webuse children
file http://www.stata-press.com/data/r14/children.dta not found
r(601);
list

     +--------------------------------+
     | family~d   parent~d   x1    x3 |
     |--------------------------------|
  1. |     1030         10   39   600 |
  2. |     1025         11   20   643 |
  3. |     1025         12   27   721 |
  4. |     1026         13   30   760 |
  5. |     1026         14   26   668 |
     |--------------------------------|
  6. |     1030         15   32   684 |
     +--------------------------------+

merge m:m family_id using http://www.stata-press.com/data/r14/parent 

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                 6  (_merge==3)
    -----------------------------------------

clear
webuse children
file http://www.stata-press.com/data/r14/children.dta not found
r(601);
joinby family_id using http://www.stata-press.com/data/r14/parent
variable family_id not found
r(111);

Remeber, merge m:m is old and broken; do not use. Anytime you think you might want m:m you should use joinby instead.

Creating summarized data sets

Collapse

Collapse will take master data and create a new dataset of summary statistics

  • Useful in hierarchical linear modeling if you'd like to create aggregate, summary statistics
  • Can generate group summary data for many descriptive stats
  • Can also attach weights

Before you collapse:

  • Save your master dataset and then save it again under a new name (this will prevent collapse from writing over your original data_
  • Consider issues of missing data. Do you want Stata to use all possible observations? If not, the cw (casewise) option will make casewise deletions

Collapse Example

// Adapted from the collapse help page
clear
webuse college
list
// mean and sd by hospital
collapse (mean) mean_gpa = gpa mean_hour = hour (sd) sd_gpa = gpa sd_hour = hour, by(year)
list
clear
clear
webuse college
list

     +----------------------------+
     | gpa   hour   year   number |
     |----------------------------|
  1. | 3.2     30      1        3 |
  2. | 3.5     34      1        2 |
  3. | 2.8     28      1        9 |
  4. | 2.1     30      1        4 |
  5. | 3.8     29      2        3 |
     |----------------------------|
  6. | 2.5     30      2        4 |
  7. | 2.9     35      2        5 |
  8. | 3.7     30      3        4 |
  9. | 2.2     35      3        2 |
 10. | 3.3     33      3        3 |
     |----------------------------|
 11. | 3.4     32      4        5 |
 12. | 2.9     31      4        2 |
     +----------------------------+

 our, by(year)
list

     +--------------------------------------------------+
     | year   mean_gpa   mean_h~r     sd_gpa    sd_hour |
     |--------------------------------------------------|
  1. |    1        2.9       30.5   .6055301   2.516612 |
  2. |    2   3.066667   31.33333   .6658328    3.21455 |
  3. |    3   3.066667   32.66667   .7767453   2.516612 |
  4. |    4       3.15       31.5   .3535534   .7071068 |
     +--------------------------------------------------+
clear

You could also generate different statistics for multiple variables

Exercise 3: Merge, Append, and Collapse

Open the gss2.dta dataset. This dataset contains only half of the variables that are in the complete gss dataset.

  1. Merge dataset gss1.dta with dataset gss2.dta. The identification variable is "id."
  2. Open the gss.dta dataset and merge in data from the "marital.dta" dataset, which includes income information grouped by individuals' marital status. The marital dataset contains collapsed data regarding average statistics of individuals based on their marital status.
  3. Open the gssAppend.dta dataset and Create a new dataset that combines the observations in gssAppend.dta with those in gssAddObserve.dta.
  4. Open the gss.dta dataset. Create a new dataset that summarizes mean and standard deviation of income based on individuals' degree status ("degree"). In the process of creating this new dataset, rename your three new variables.

Exercise 3 prototype   prototype

Open the gss2.dta dataset. This dataset contains only half of the variables that are in the complete gss dataset.

  1. Merge dataset gss1.dta with dataset gss2.dta. The identification variable is "id."
use dataSets/gss2.dta, clear
merge 1:1 id using dataSets/gss1.dta
save gss3.dta, replace
  1. Open the gss.dta dataset and merge in data from the "marital.dta" dataset, which includes income information grouped by individuals' marital status. The marital dataset contains collapsed data regarding average statistics of individuals based on their marital status.
use dataSets/gss.dta, clear
merge m:1 marital using dataSets/marital.dta, nogenerate replace update
save gss4.dta, replace
  1. Open the gssAppend.dta dataset and Create a new dataset that combines the observations in gssAppend.dta with those in gssAddObserve.dta.
use dataSets/gssAppend.dta, clear
append using dataSets/gssAddObserve, generate(observe)
  1. Open the gss.dta dataset. Create a new dataset that summarizes mean and standard deviation of income based on individuals' degree status ("degree"). In the process of creating this new dataset, rename your three new variables.
use dataSets/gss.dta, clear
save collapse2.dta, replace
use collapse2.dta, clear
collapse (mean) meaninc=income (sd) sdinc=income, by(marital)

Wrap-up

Help Us Make This Workshop Better

Additional resources

These workshop notes by Harvard University are licensed Creative Commons License. Presented by Data Science Services at IQSS