Screen Scraping: A Hands-on Introduction

Table of Contents

1 Goals

  • Get a working python environment installed on your own computer
  • Make python seem less scary
  • Understand some of the differences between python and other languages
  • Understand what screen scraping is all about
  • Learn the tools to scrape web sites (and other structured text) effectively

This may take some time! This time is available indefinitely, depending on how quickly we go it could take a number of sessions - my intention is to play it by ear to see what we need to focus on.

1.1 Who am I?

My name is Alex Storer, and I'm part of the Data Science Services team at IQSS. I have a PhD in Computational Neuroscience, and have done a lot of programming and scripting to interact with data.

Our team can help you with your research questions, both with the statistics and the technology. If you want to chat with us, simply e-mail support@help.hmdc.harvard.edu.

1.2 What is this page?

This is a tutorial that I wrote using the org-mode in emacs. It is hosted here:

http://www.people.fas.harvard.edu/~astorer/scraping/scraping.html

You can always find details about our ongoing workshops here:

http://dss.iq.harvard.edu

2 Basic Python

Python is a powerful interpreted language that people often use for scraping. We'll highlight here a few of the most helpful features for understanding Python code and writing scrapers. This is by no means a complete or thorough introduction to Python! It's just enough to get by.

2.1 Installation

Python comes in two modern flavors, version 2 and version 3. There are some important language differences between them, and in practice, almost everyone uses version 2. To install it, go here and select the relevant operating system.

2.1.1 IDE

An IDE, or Integrated Development Environment, is used to facilitate programming. A good IDE does things like code highlighting, error checking, one-click-running, and easy integration across multiple files. An example of a crappy IDE is notepad. I like to use emacs. Most people prefer something else.

2.1.2 Wing IDE 101

For this session, I recommend Wing 101. It's a free version of a more fully-featured IDE, but for beginners, it's perfect. If you don't already have an IDE that you're invested in, or you want your intro to python to be as painless as possible, you should install it. It's cross platform.

  • Getting Started in Wing
    Once you have Wing installed, you might want to use the tutorial to learn how to navigate around in it.

    ./img/tutorial.jpg

    Opening the tutorial in Wing 101.

2.2 Further Python Resources

But wait, I want to spend four months becoming a Python guru!

Dude, you're awesome. Here are some resources that will help you:

2.3 Diving In

In Wing, there is a window open called Python Shell

  • If you know R, think of this just like the R command line
  • If you've never programmed before, think of this as a graphing calculator
print 2+4
6

2.3.1 Basic Text Handling

  • Of course, this graphing calculator can handle text, too!
mystr = "Hello, World!"
print mystr
print len(mystr)
Hello, World!
13
Python CodeR CodeEnglish Translation
print 2+4print(2+4)Print the value of 2+4
mystr = '`Hello World'`mystr <- '`Hello World'`Assign the string "Hello World" to the variable mystr
len(mystr)nchar(mystr)How "long" is the variable mystr? Note: R can tell you how long it is, but if you want the number of characters, that's what you need to ask for.

Note to Stata Users:
Assigning a variable is not the same as adding a "column" to your dataset.

2.3.2 Indexing and Slicing

Get the first element of a string.

  • Note: Python counts from 0. This is a common convention in most languages constructed by computer scientists.
mystr = "Dogs in outer space"
print mystr[0]
D

Get the last element of a string

mystr = "Dogs in outer space"
print mystr[-1]
print mystr[len(mystr)-1]
e
e
mystr = "Dogs in outer space"
print mystr[1:3]
print mystr[3:]
print mystr[:-3]
og
s in outer space
Dogs in outer sp

2.3.3 Including Other Packages

  • By default, python doesn't include every possible "package"
    • This is similar to R, but unlike Matlab
    • Use the include statement to load a library
import math
print math.sin(math.pi)
1.22464679915e-16

After we import from a package, we have to access sub-elements of that package using the . operator. Notice also that while the value 1.22464679915e-16 is very nearly 0, the math module doesn't know that sin(π) = 0. There are smarter modules for doing math in Python, like scipy and numpy. Some people love using Python for Math. I think it makes more sense to use R.

  • If you want to import something into your namespace
    • from math import <myfunction> or
    • from math import *
from math import *
print sin(pi)
1.22464679915e-16

2.3.4 Objects and methods

Python makes extensive use of objects. An object has

  • Methods: functions that work only on that option
  • Fields: data that only that type of object has

For example, let's imagine a fruit object. A fruit might have a field called hasPeel, which tells you whether this fruit is peeled. It could also have a method called peel, which alters the state of the fruit.

str = "THE World is A BIG and BEAUTIFUL place.  "
print str.upper()
name = "Alex Storer"
print name.swapcase()
THE WORLD IS A BIG AND BEAUTIFUL PLACE.  
aLEX sTORER

Here we defined two strings, str and name, and used these to invoke string methods which affect the case of the string.

  • You can write your own objects and methods
  • Objects can be sub-classes of other objects
    • e.g., a psychologist is a type of researcher, who does everything a researcher does but also some other things only a pyschologist does.

2.3.5 Defining Functions

You can write your own functions, pieces of code that can be used to take specific inputs and give outputs. You can create a function by using the def command.

def square(x):
    return x*x
print square(9)
81

Pay close attention to the whitespace that is used in Python! Unlike other languages, it is not ignored. Everything with the same indentation is in the same level. Above, the statement return x*x is part of the square function, but the following line is outside of the function definition.

2.3.6 Logical Flow

./img/decision-tree.png

The xkcd guide to writing good code

You can think about this logical process as being in pseudocode.

IF do things right
   ---> code well
OTHERWISE
   ---> do things fast

A lot of programming is figuring out how to fit things into this sort of if=/=else structure. Let's look at an example in Python.

  • The method find returns the index of the first location of a string match
mystr = "This is one cool looking string!"
if mystr.find("string")>len(mystr)/2:
    print "The word 'string' is in the second half"
else:
    print "The word 'string is not in the second half"
The word 'string' is in the second half

What happens if the word "string" is not there at all?

  • The method find returns -1 if the string isn't found
mystr = "I don't know about you, but I only use velcro."
print mystr.find("string")
if mystr.find("string")>len(mystr)/2:
    print "The word 'string' is in the second half"
elif mystr.find("string")>=0:
    print "The word 'string is not in the second half"
else:
    print "The word 'string' isn't there!"
-1
The word 'string' isn't there!
  • Important Note: In Python, most everything evaluates to True. Exceptions include 0 and None. This means that you can say things like if (result) where the result may be a computation, a string search, or anything like that. As long as it evaluates to True, it will work!

2.3.7 Review

  • if, elif and else can be used to control the flow of a program
  • strings are a type of a object, and have a number of methods that come with them, including find, upper and swapcase
    • methods are called using mystring.method()
    • The list of methods for strings can be found in the Python documentation
  • def can be used to define a function
    • The return statement determine what the function returns

2.4 For Loops

The for loop is a major component of how python is used. You can iterate over lots of different things, and python is smart enough to know how to do it.

  • Note: the following is what's called pseudocode - something that looks like code, but isn't going to run. It's a helpful way to clarify the steps that you need to take to get things to work.
for (item in container):
    process item
    print item
print "done processing items!"

Notice the use of the <TAB> (or spacing) - that's how python knows whether we're inside the loop or not!

2.4.1 Example

str = "Daddy ran to help Ann.  Up and down went the seesaw."      
for word in str.split():
    print word
Daddy
ran
to
help
Ann.
Up
and
down
went
the
seesaw.

Notice the use of str.split(): this is an example of calling a method of a string object. It returns a list of words after splitting the string on whitespace.

2.5 Lists

  • A list is a data type that can hold anything.
  • Lists are iterable (you can pass them to a for loop
  • You can .append, .extend,and otherwise manipulate lists. Python Documentation
mylist = ['dogs',1,4,"fishes",["hearts","clovers"],list]  
for element in mylist:
    print element    
mylist.reverse()
print mylist
dogs
1
4
fishes
['hearts', 'clovers']
<type 'list'>
[<type 'list'>, ['hearts', 'clovers'], 'fishes', 4, 1, 'dogs']

2.6 Exercise

  1. Write a function that takes in a string, and outputs the square of its length.
  2. Write a function that returns the number of capitalized letters in a string. Hint: try using =lower= and the == operator
  3. Write a function that returns everything in a string up to "dog", and returns "not found" if the string is not present.

2.6.1 Exercise Solutions

  • Exercise 1:
    Write a function that takes in a string, and outputs the square of its length.

    Notice that a function can call another function that you wrote.

    def square(x):
        return x*x
    
    def sqlen(x):
        return square(len(x))
    
    print sqlen("Feet")
    
    16
    
  • Exercise 2
    Write a function that returns the number of capitalized letters in a string.
    def numcaps(x):
        lowerstr = x.lower()
        ncaps = 0
        for i in range(len(x)):
            if lowerstr[i]!=x[i]:
                ncaps += 1
        return ncaps
    
    teststr = "Dogs and Cats are both Animals"
    print teststr, "has", str(numcaps(teststr)), "capital letters"
    
    Dogs and Cats are both Animals has 3 capital letters
    
  • Exercise 3
    def findDog(x):
        mylist = x.split("dog")
        if len(mylist) < 2:
            return "not found"
        else:
            return mylist[0]    
        return mylist
    print findDog("i have a dog but not a cat")
    print findDog("i have a fish but not a cat")
    print findDog("i have a dog but not a dogwood")
    
    
    i have a 
    not found
    i have a 
    

2.7 dict type

A dict, short for dictionary, is a helpful data structure in Python for building mappings between inputs and outputs.

http://code.google.com/edu/languages/google-python-class/images/dict.png

2.7.1 Examples

mydict = dict()
mydict["dogs"] = 14
mydict["fish"] = "slumberland"
mydict["dogs"]+= 3
print mydict
{'fish': 'slumberland', 'dogs': 17}
len(mydict["fish"])

One of the nice things about python is that even when very condensed, it is still readable. People talk about coding in a pythonic way, meaning to write very tight, readable code.

print dict([(x, x**2) for x in (2, 4, 6)]) 
{2: 4, 4: 16, 6: 36}

Let's use a dictionary to store word counts from a sentence.

str = "Up and down went the seesaw. Up it went.  Down it went.  Up, up, up!"
print str
for i in [",",".","!"]:
    str = str.replace(i," ")
print str
str = str.lower()
print str
print set(str.lower().split())
Up and down went the seesaw. Up it went.  Down it went.  Up, up, up!
Up and down went the seesaw  Up it went   Down it went   Up  up  up 
up and down went the seesaw  up it went   down it went   up  up  up 
set(['and', 'up', 'it', 'down', 'seesaw', 'went', 'the'])

We see that a set contains an unordered collection of the elements of the list returned by split(). Let's make a dictionary with keys that are pulled from this set.

str = "Up and down went the seesaw. Up it went.  Down it went.  Up, up, up!"
for i in [",",".","!"]:
    str = str.replace(i," ")
words = str.lower().split()
d = dict.fromkeys(set(words),0)
print d
for w in words:
    d[w]+=1
print d
{'and': 0, 'down': 0, 'seesaw': 0, 'went': 0, 'the': 0, 'up': 0, 'it': 0}
{'and': 1, 'down': 2, 'seesaw': 1, 'went': 3, 'the': 1, 'up': 5, 'it': 2}

2.7.2 Writing to CSV

A very useful feature of dictionaries is that there is an easy method to write them out to a CSV (comma-separated variable) file.

import csv
f = open('/tmp/blah.csv','w')
nums = [1,2,3]
c = csv.DictWriter(f,nums)
for i in range(0,10):
    c.writerow(dict([(x, x**i) for x in nums]))
f.close()

This writes out the following csv file:

1,1,1
1,2,3
1,4,9
1,8,27
1,16,81
1,32,243
1,64,729
1,128,2187
1,256,6561
1,512,19683    

2.7.3 A Note on File Objects

  • Think about file objects like a book
    • If a file is open, you don't want other people to mess with it
    • Files can be opened for reading or writing
    • There are methods to move around an open file
  • Close the book when you're done reading it!
  • Python documentation on "File I/O" is here
EnglishPythonOutput
Open blah.txt just for readingf = open('blah.txt','r')file object f
Get the next line in a filestr = f.readline()string containing a single line
Get the entire filestr = f.read()string containing entire file
Go to the beginning of a filef.seek(0)None
Close blah.txtf.close()None

To play with this, download this file somewhere on your hard drive. I'm putting it on my hard drive as /tmp/gaga.txt. On Windows, it may look more like C:\temp\gaga.txt - just make sure you get the path correct when you tell Python where to look!

f = open('/tmp/gaga.txt','r')
print f
str = f.read()
print "str has length: ", len(str)
str2 = f.read()
print "str2 has length: ", len(str2)
f.seek(0)
str3 = f.readline()
print "str3 has length: ", len(str3)
f.close()
<open file '/tmp/gaga.txt', mode 'r' at 0x10045e8a0>
str has length:  1220
str2 has length:  0
str3 has length:  77

You'll use file objects a lot. As we see them, I'll try to point out what's important about them.

2.7.4 Exercise

  • Exercise 1
    Write a function that counts the number of unique letters in a word.
  • Exercise 2
    Write a function that takes in a string, and returns a dict that tells you how many words of each number of letters there are.
    "Dogs and cats are all animals"
     dogs and cats are al  animls
     4    3   4    3   2   6
     {2: 1, 3: 2, 4: 2, 6: 1}
    
  • Exercise 3
    Loop over a list of strings, and write a csv that contains a column for each number and a row for each string.
     1,2,3,4,5,6,7,8,9,10,11,12,13
     2,3,2,3,4,5,2,3,2,1 , 0, 0, 0
     5,2,1,0,1,2,0,0,0,0 , 0, 0, 0
     etc.
    

2.7.5 Exercise Solutions

  • Exercise 1
    Write a function that counts the number of unique letters in a word.
    def uniqueletters(w):
        d = dict()
        for char in w:
            d[char] = 1
        return len(d.keys())
    print uniqueletters("dog")
    print uniqueletters("dogged")
    
    
    3
    4
    
  • Exercise 2
    Write a function that takes in a string, and returns a dict that tells you how many words of each number of letters there are.
    "Dogs and cats are all animals"
     dogs and cats are al  animls
     4    3   4    3   2   6
     {2: 1, 3: 2, 4: 2, 6: 1}
    
    def uniqueletters(w):
        d = dict()
        for char in w:
            d[char] = 1
        return len(d.keys())
    
    def wordcounter(str):
        d = dict()
        for w in str.split():
            u = uniqueletters(w)
            if u in d.keys():           
                d[u]+=1
            else:
                d[u] = 1
        return d
    
    print wordcounter("Dogs and cats are all animals")
    
    
    {2: 1, 3: 2, 4: 2, 6: 1}
    
  • Exercise 3
    Loop over a list of strings, and write a csv that contains a column for each number and a row for each string.
     1,2,3,4,5,6,7,8,9,10,11,12,13
     2,3,2,3,4,5,2,3,2,1 , 0, 0, 0
     5,2,1,0,1,2,0,0,0,0 , 0, 0, 0
     etc.
    
    import csv
    def uniqueletters(w):
        d = dict()
        for char in w:
            d[char] = 1
        return len(d.keys())
    
    def wordcounter(str):
        d = dict()
        for w in str.split():
            u = uniqueletters(w)
            if u in d.keys():           
                d[u]+=1
            else:
                d[u] = 1
        return d
    
    def listwriter(l):
        emptydict = dict([(x, 0) for x in range(1,26)])
        f = open('/tmp/blah.csv','w')
        c = csv.DictWriter(f,sorted(emptydict.keys())) 
        c.writeheader()    
        for str in l:
            c.writerow(dict(emptydict.items()+wordcounter(str).items()))
        f.close()
    
    listwriter(["Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation.",
                "We observe today not a victory of party, but a celebration of freedom -- symbolizing an end, as well as a beginning -- signifying renewal, as well as change.", 
                "So, first of all, let me assert my firm belief that the only thing we have to fear is fear itself -- nameless, unreasoning, unjustified terror which paralyzes needed efforts to convert retreat into advance."])
    
    

    Here is the resulting CSV file:

    1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25
    1,2,1,2,5,3,0,1,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
    5,8,4,1,2,4,3,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
    1,7,6,9,4,2,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
    

3 Regular Expressions

Regular expressions are a framework for doing complicated manipulation.

3.1 A first example

For example, consider the following text:

Joseph Schmoe (15), Phone:(934) 292-2390, SSN:295-48-2019

A first guess for a rule to get the area code would be to find a grouping of three numbers. Let's look at the source code for this in python.

import re  
str = "Joseph Schmoe (15), Phone:(934) 292-2390, SSN:295-48-2311"  
print re.findall("\d\d\d",str)
['934', '292', '239', '295', '231']

3.1.1 What the code does

  • import re
    • tells python to use the regular expression library. (Like library(zelig))
  • str = ...
    • defines a string
    • Python will figure out that the type is a string based on the fact that it's in quotes
    • There is a difference between
      foo = '333'
      

      and

      foo = 333     
      
  • re.findall("\d\d\d",str)
    • From the re library, call the findall function
      • When in doubt, Google it.
        • By the way, googling things effectively is the most important modern research skill there is.
    • Finds all of the matched of the regular expression \d\d\d in str
      • Returns it as a list
import re
str = "Joseph Schmoe (15), Phone:(934) 292-2390, SSN:295-48-2311"  
print re.findall("\((\d\d\d)\)",str)
['934']

3.1.2 Different expressions

"Joseph Schmoe (15), Phone:(934) 292-2390, SSN:295-48-2311"  
EnglishRegexfindall Output
Any three numbers\d\d\d['934', '292', '239', '295', '231']
Any three numbers that start with (\(\d\d\d['(934']
One or more adjacent numbers\d+['15', '934', '292', '2390', '295', '48', '2311']
One or more numbers in parenthesis\(\d+\)['(15)', '(934)']
Three numbers in parenthesis\(\d\d\d\)['(934)']
Three numbers in parenthesis, but group only the numbers\((\d\d\d)\)['934']

3.2 Further examples

import re
str = "Joseph Schmoe, Bowling High Score:(225), Phone:(934) 292-2390"  
print re.findall("\w+:\((\d+)\)",str)
['225', '934']
  • The \w is code for any alphanumeric character and the underscore.
  • The : is code for only the character :.
import re
str = "I called his phone after he phoned me, but he has two phones!"  
print re.findall("phone\w*",str)
['phone', 'phoned', 'phones']
  • We match all instances of "phone" with any number of characters after it
    • Note the difference between \w+ (1 or more) and \w* (0 or more)
import re
str = "I called his phone after he phoned me, but he has two phones!"  
print re.findall("phone\w+",str)
['phoned', 'phones']

3.3 Other helpful regex tools

Regular expressions are extremely powerful, and are used extensively for text processing. Here are some good places to look for regex help:

  • Python re library has documentation of how to use regex in python with examples
    • I can never remember regex syntax, so I go here all the time.
  • Regexr is an interactive regex checker
  • Textbooks on regex will tell you not just how to use them, but how they are implemented. Help answer the question "what is the best regex for this situation?"

3.4 Exercises

This file contains 100 blogs about dogs in a structured text format that may be familiar to you.

  • Exercise 1
    Use regular expressions to parse this file and write a csv file containing the article number and the number of words. (I'm going to start by downloading it to my hard drive, but if you're macho you want to figure out how to use the urllib module to parse it without downloading.)
  • Exercise 2
    Write a CSV file that investigates whether articles contain certain words. In particular, do dog bloggers write more about 'pets' or 'companions'?

3.5 Solutions

  • Exercise 1
    import csv, re
    f = open('/tmp/example.txt')
    fp = open('/tmp/result.csv','wb')
    
    c = csv.DictWriter(fp,["Article Number","Words"]) 
    articlenum = 0
    for line in f:
        d = dict()
        r = re.match("LENGTH:\s*(\d+)",line)
        if r:
            articlenum+=1
            d["Article Number"] = articlenum
            d["Words"] = r.groups()[0]
            c.writerow(d)        
    f.close()
    fp.close()        
    
    

    The result.csv file is:

    1,305
    2,303
    3,425
    4,275
    5,197
    6,615
    7,281
    8,466
    9,692
    10,656
    11,294
    12,674
    13,1455
    14,1454
    15,1063
    16,1066
    17,512
    18,433
    19,294
    20,528
    21,758
    22,497
    23,598
    24,957
    25,163
    26,661
    27,616
    28,521
    29,331
    30,275
    31,266
    32,762
    33,365
    34,781
    35,753
    36,442
    37,1251
    38,462
    39,230
    40,281
    41,564
    42,510
    43,316
    44,1060
    45,402
    46,990
    47,392
    48,536
    49,509
    50,636
    51,973
    52,234
    53,675
    54,416
    55,488
    56,487
    57,546
    58,596
    59,326
    60,312
    61,369
    62,1507
    63,2398
    64,183
    65,1718
    66,280
    67,302
    68,302
    69,1326
    70,549
    71,460
    72,302
    73,288
    74,288
    75,269
    76,308
    77,2241
    78,515
    79,526
    80,320
    81,400
    82,301
    83,302
    84,263
    85,297
    86,300
    87,953
    88,308
    89,1019
    90,787
    91,307
    92,371
    93,512
    94,303
    95,285
    96,302
    97,666
    98,490
    99,551
    100,411
    
  • Exercise 2

    Let's begin just by checking some basic regular expressions

    import re
    str = "A competition between Pets and Animal Companions!  How do you refer to your dog?"
    print "\w*:"
    print re.findall("\w*",str)
    print "[p]et:"
    print re.findall("[p]et",str)
    print "[pP]et:"
    print re.findall("[pP]et",str)
    
    \w*:
    ['A', '', 'competition', '', 'between', '', 'Pets', '', 'and', '', 'Animal', '', 'Companions', '', '', '', 'How', '', 'do', '', 'you', '', 'refer', '', 'to', '', 'your', '', 'dog', '', '']
    [p]et:
    ['pet']
    [pP]et:
    ['pet', 'Pet']
    

    Great! So we know how to match "pet" or "Pet", but it still matches "competition"! Let's write out some patterns that we would like to match:

    Do Match
    I own a dog - pets are great!
    Do you have a pet?
    Pets are wonderful.
    I've got to tell you–pets are the best!
    Don't Match
    Great competition!
    Petabytes of data are needed.
    I went to the petting zoo with my companion!
    She owns a whippet.

    It looks to me like we need the word "pet" with a space or punctuation at the beginning or the end, with an optional s at the end.

    [-,\s.;][pP]et
    Either a dash a comma whitespace a period or a semicolonEither p or Pthe letters et
    import re
    strlist = ["I own a dog - pets are great!",  "Do you have a pet?", "Pets are wonderful.", "I've got to tell you--pets are the best!", "Great competition!", "Petabytes of data are needed.", "I went to the petting zoo with my companion!", "She owns a whippet."]
    for str in strlist:
        print str    
        print re.findall("[-,\s.;][pP]et",str)
    
    
    I own a dog - pets are great!
    [' pet']
    Do you have a pet?
    [' pet']
    Pets are wonderful.
    []
    I've got to tell you--pets are the best!
    ['-pet']
    Great competition!
    []
    Petabytes of data are needed.
    []
    I went to the petting zoo with my companion!
    [' pet']
    She owns a whippet.
    []
    

    This isn't good enough! We're going to need to change the endings, too.

    [-,\s.;][pP]et[s]?[.\s.;-]
    Either a dash a comma whitespace a period or a semicolonEither p or Pthe letters etan optional sEith a period, whitespace, a semicolon or a dash
    import re
    strlist = ["I own a dog - pets are great!",  "Do you have a pet?", "Pets are wonderful.", "I've got to tell you--pets are the best!", "Great competition!", "Petabytes of data are needed.", "I went to the petting zoo with my companion!", "She owns a whippet."]
    for str in strlist:
        print str    
        print re.findall("[-,\s.;?][pP]et[s]?[,\s.;-?]",str)
    
    
    I own a dog - pets are great!
    [' pets ']
    Do you have a pet?
    [' pet?']
    Pets are wonderful.
    []
    I've got to tell you--pets are the best!
    ['-pets ']
    Great competition!
    []
    Petabytes of data are needed.
    []
    I went to the petting zoo with my companion!
    []
    She owns a whippet.
    []
    

    We're almost there! We just need to make it so a string can also begin with Pets.

    ^[pP]et[s]?[.\s.;-]
    Only match the beginning of a stringEither p or Pthe letters etan optional sEith a period, whitespace, a semicolon or a dash

    So we will either match the regular expression ^[pP]et[s]?[.\s.;-] or the expression [-,\s.;?][pP]et[s]?[,\s.;-?]. The syntax for this is the pipe operator |.

    Our regular expression just to check for pets is:

    [-,\s.;?][pP]et[s]?[,\s.;-?]|^[pP]et[s]?[,\s.;-?]

    This looks like a sloppy mess, but we built it up by hand ourselves, and it's really not so bad!

    import re
    strlist = ["I own a dog - pets are great!",  "Do you have a pet?", "Pets are wonderful.", "I've got to tell you--pets are the best!", "Great competition!", "Petabytes of data are needed.", "I went to the petting zoo with my companion!", "She owns a whippet."]
    for str in strlist:
        print str    
        print re.findall("[-,\s.;?][pP]et[s]?[,\s.;-?]|^[pP]et[s]?[,\s.;-?]",str)  
    
    I own a dog - pets are great!
    [' pets ']
    Do you have a pet?
    [' pet?']
    Pets are wonderful.
    ['Pets ']
    I've got to tell you--pets are the best!
    ['-pets ']
    Great competition!
    []
    Petabytes of data are needed.
    []
    I went to the petting zoo with my companion!
    []
    She owns a whippet.
    []
    

    Having constructed this regex for pets, we can now do the same for companion. Because the word companion isn't going to be inside words the way pet is, we don't have to be as careful. Let's say we need to match companion and companions, but not companionship. We can copy the same regex for pets, but remove the gunk from the beginning (although it probably can't hurt for correctness to include it!)

    Let's try: [cC]ompanion[s]?[,\s.;-?]

    Note: Remember to use re.match to match the beginning of the string only, and re.search to match anywhere!

    import csv, re
    f = open('/tmp/example.txt')
    fp = open('/tmp/pets.csv','wb')
    
    c = csv.DictWriter(fp,["Article Number","Words","Pet","Companion"]) 
    articlenum = 0
    for line in f:
        r = re.match("LENGTH:\s*(\d+)",line)
        if r:
            if articlenum>0:
                c.writerow(d)           
            d = dict()    
            articlenum+=1
            d["Article Number"] = articlenum
            d["Words"] = r.groups()[0]
            d["Pet"] = 0
            d["Companion"] = 0
        else:       
            pets = re.search("[-,\s.;?][pP]et[s]?[,\s.;-?]|^[pP]et[s]?[,\s.;-?]",line)
            companions = re.search("[cC]ompanion[s]?[,\s.;-?]",line)
            if pets:
                d["Pet"] = 1
            if companions:
                d["Companion"] = 1
    
    f.close()
    fp.close()  
    
    

    Let's take a look at the csv file.

    1,305,0,0
    2,303,1,0
    3,425,1,0
    4,275,1,0
    5,197,0,0
    6,615,0,0
    7,281,1,1
    8,466,1,0
    9,692,1,0
    10,656,0,0
    11,294,1,0
    12,674,0,0
    13,1455,1,0
    14,1454,1,0
    15,1063,1,0
    16,1066,1,0
    17,512,0,0
    18,433,1,0
    19,294,1,0
    20,528,1,0
    21,758,1,0
    22,497,0,0
    23,598,0,0
    24,957,0,0
    25,163,0,0
    26,661,0,0
    27,616,0,1
    28,521,0,0
    29,331,0,1
    30,275,1,0
    31,266,1,0
    32,762,0,0
    33,365,0,0
    34,781,0,1
    35,753,0,0
    36,442,0,0
    37,1251,0,0
    38,462,0,0
    39,230,0,0
    40,281,0,0
    41,564,0,0
    42,510,1,0
    43,316,1,0
    44,1060,1,1
    45,402,1,0
    46,990,1,0
    47,392,0,0
    48,536,1,0
    49,509,1,0
    50,636,1,0
    51,973,1,0
    52,234,0,0
    53,675,1,0
    54,416,1,0
    55,488,1,0
    56,487,1,0
    57,546,1,0
    58,596,1,0
    59,326,1,0
    60,312,1,0
    61,369,0,0
    62,1507,0,1
    63,2398,1,0
    64,183,1,0
    65,1718,1,0
    66,280,1,0
    67,302,0,0
    68,302,1,0
    69,1326,1,0
    70,549,1,0
    71,460,1,0
    72,302,1,0
    73,288,1,0
    74,288,0,0
    75,269,0,0
    76,308,0,0
    77,2241,0,0
    78,515,1,1
    79,526,0,0
    80,320,1,0
    81,400,0,0
    82,301,1,0
    83,302,1,0
    84,263,1,0
    85,297,1,0
    86,300,0,0
    87,953,0,0
    88,308,1,0
    89,1019,1,0
    90,787,1,0
    91,307,0,0
    92,371,0,0
    93,512,1,0
    94,303,1,0
    95,285,0,0
    96,302,1,0
    97,666,0,0
    98,490,0,0
    99,551,1,1
    

4 Web Sites

4.1 Example: Egypt Independent / المصري اليوم

4.1.1 Aside: "Brittleness"

  • A brittle system is one that is not resistant to change
  • For example, between early April and late April of 2012, Egypt Independent transitioned from
    http://www.egyptindependent.com/node/725861
    

    to a new URL naming scheme that involves the title:

    http://www.egyptindependent.com/news/european-union-will-keep-mubarak-assets-ice-illicit-gains-authority-head-says
    

    All scrapers are brittle.

    • The assumptions you're forced to make about how information is organized on a given website will not hold forever.
    • In fact, the legality of scraping is not entirely clear, and some sites may not be interested in you hammering their servers!

4.1.2 Metadata

Sometimes, metadata is included which tells us important things about our article

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="msvalidate.01" content="F1F61CF0E5EC4EC2940FCA062AB13A53" />
<meta name="google-site-verification" content="Q8FKHdNoQ2EH7SH1MzwH_JNcgVgMYeCgFnzNlXlR4N0" />
<title>European Union will keep Mubarak assets on ice, Illicit Gains Authority head says | Egypt Independent</title>
<!-- tC490Uh18j-7O_rp7nG0_e6U9QY -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="canonical" href="http://www.egyptindependent.com/node/725861" />
<meta name="keywords" content="Assem al-Gohary, corruption, EU, freezing  Mubarak’s assets, Hosni Mubarak, Illicit Gains Authority (IGA), News, Top stories" />
<meta name="description" content="The European Union will continue to freeze the assets of former President Hosni Mubarak, his family and other former officials although Egypt has thus far been unsuccessful in recovering funds siphoned abroad by the regime." />
<meta name="abstract" content="Al-Masry Al-Youm - Egypt&#039;s leading independent media group المصرى اليوم للصحافة والنشر هى مؤسسة إعلامية مصرية مستقلة تأسست عام  ,2003." />
  • Keywords, abstract, description and title are all clear
  • Lots of other gunk that isn't relevant to us!
  • Pulling information out of this document requires that we know how they organize title their metadata!
    • What if keywords were called terms?

4.1.3 Body

The actual body of the article can be found by right-clicking on the text we're interested in from Chrome or Firefox and selecting "Insepct Element"

<div class="panel-region-separator"></div><div class="panel-pane pane-node-body" >    
  <div class="pane-content">
    <p>The European Union will continue to freeze the assets of former President Hosni Mubarak, his family and other former officials although Egypt has thus far been unsuccessful in recovering funds siphoned abroad by the regime.</p>
<p>The Illicit Gains Authority (IGA), the judicial committee responsible for recovering the money, on Wednesday received an official notification from the European Union, confirming its freeze on the assets would be renewed another year as of 19 March, state-run MENA news service reported on Wednesday.&nbsp;</p>
<p>&ldquo;This was in response to a request by Egypt,&rdquo; the state news agency quoted IGA head Assem al-Gohary as saying.&nbsp;</p>
<p>Egypt formally asked European Union countries earlier this month to continue freezing funds belonging to Mubarak, his two sons and other members of his administration.</p>
<p>Shortly after Mubarak was forced to step down in February 2011, the public prosecutor ordered that the foreign assets of the deposed president and his family be frozen.</p>
<p>Mubarak&#39;s actual worth is still unknown after more than a year of investigations into his foreign and domestic assets. Last year claims that Mubarak, in his nearly 30-year reign as head of state, may have amassed a fortune of up to US$70 billion &mdash; greater than that of Microsoft&#39;s Bill Gates &mdash; helped drive the protests that eventually brought him down.</p>
<p>Last year Swiss authorities also froze Mubarak&rsquo;s assets, acting more speedily than when the EU froze the assets of another deposed North African ruler, former Tunisian President Zine al-Abidine Ben Ali.</p>
<p>On Wednesday, the IGA met with the Swiss ambassador in Cairo to discuss the difficulties it faces in recovering those funds, in light of the obligations of the United Nations Convention Against Corruption on the member states, reported MENA.</p>
<p>Gohary once estimated the frozen assets at 410 million Swiss francs (LE2.7 billion), which Egypt is trying to repatriate in cooperation with the Foreign Ministry.</p>
  </div>

All of the body is included in the panel-pane pane-node-body section of this site, within the sub-section pane-content. Our "algorithm" for getting this information out will require finding the exact section of the site that we require pulling this data out from. If you don't do this, any terms that are on the sidebar will end up being in your analysis!

4.1.4 Scraping Articles

Every News Feature is on a page in the following scheme:

http://www.egyptindependent.com/subchannel/News%20features?page=5

And this paper goes back 77 pages, to April, 2009.

Investigating the source for a single search page can tell us what we have to do to get at the relevant information:

<div class="views-row views-row-4 views-row-even">
  <div class="views-field-field-published-date-value">
    <span class="field-content"><span class="date-display-single">09 Feb 2012</span></span>
  </div>
  <div class="views-field-title">
    <span class="field-content"><a href="http://www.egyptindependent.com/node/647936">Parliament Review: A week of comedy and disappointment</a></span>
  </div> 
  <div class="views-field-body">
    <span class="field-content">This week&rsquo;s parliamentary sessions had the public joking about airing future sessions on comedy channels instead of news, and those who abstained from the polls telling those who participated, in hope of having a legitimate authority...</span>
  </div>  

Our algorithm to scrape articles from this page will be as follows:

  1. Initialize FOO=1
  2. Go to http://www.egyptindependent.com/subchannel/News%20features?page=FOO
  3. Repeat until complete:
    1. Find the next occurence of views-row...
    2. Find the sub-field called views-field-field-published-date-value and retrieve its value (the date)
    3. Find the sub-field called views-field-title and retrieve its value (the title)
    4. Follow the link from above
    5. Within the link, find the meta-data keywords and retrieve their values (the keywords)
    6. Within the link, find the panel-pane pane-node-body section, and retrieve the test (the article itself)

4.1.5 Scraping Exercise!

Not all web sites are designed in the same way. Go to the site of your choice, and figure out how to get the articles you're interested in. Write out pseudocode that will tell you:

  1. How to download individual articles
  2. How to get the Author of an article
  3. How to get the Title of an article
  4. How to get the Date of an article
  5. How to get the text of the article

If you need a site to practice on that isn't too challenging, check out Robert Ebert's Blog.

5 Web scraping in python

Now that we know how we want to scrape and have some grasp on the tools that are necessary, let's try and pull the articles and their metadata off of this website.

==

import urllib
baseurl = "http://www.egyptindependent.com/subchannel/News%20features?page="
destpath = "/tmp/"
npages = 10 # should be 10
for i in range(1,npages):
    urllib.urlretrieve (baseurl+str(i),destpath+"page"+str(i)+".html")   
Note: Windows users, you may need your destination to be specified using two slashes, e.g. C://Python27//tmp//

If we take a look at what exists after running this script, we can see that it worked.

bash-3.2$ ls /tmp/page*
/tmp/page1.html      /tmp/page3.html /tmp/page5.html /tmp/page7.html /tmp/page9.html
/tmp/page2.html      /tmp/page4.html /tmp/page6.html /tmp/page8.html   

Aside: The os module

If you're doing lots of things in a script that will involve files or paths, but you want it to work cross-platform, consider using the os and os.path modules. Do things like

  • change the current directory
  • get the directory or filename of a file

5.1 Using ElementTree

Here is a very basic html tree which we can work with.

import urllib
fileloc = 'http://www.people.fas.harvard.edu/~astorer/scraping/test.html'
f = urllib.urlopen(fileloc)
print f.read()

<html>
    <head>
        <title>Example page</title>
    </head>
    <body>
        <p>Moved to <a href="http://example.org/">example.org</a>
        or <a href="http://example.com/">example.com</a>.</p>
    </body>
</html>

  • The ElementTree is a hierarchical structure of Elements.
  • list() returns a list of the children of a single Element
  • An Element contains
    • A tag (what kind of element is it)
    • text of what lives in the element
from xml.etree.ElementTree import ElementTree
fileloc = '/Users/astorer/Work/presentations/scraping/test.html'
tree = ElementTree()
tree.parse(fileloc)   
elem = tree.find('body')
print elem
print list(elem)
elem = tree.find('body/p')
print elem
print list(elem)
print elem.tag
print elem.text
<Element 'body' at 0x1004c5310>
[<Element 'p' at 0x1004c5350>]
<Element 'p' at 0x1004c5350>
[<Element 'a' at 0x1004c5390>, <Element 'a' at 0x1004c53d0>]
p
Moved to 

5.2 Using lxml

Now let's see how we can parse out the list of article URLs from an xml page. Our basic approach isn't going to work here, and we need to install an external package.

5.2.1 Installing a Package

External packages can be easily installed in python using the easy_install command. The only challenge is in making sure that if you have multiple version of python installed, you are installing the libraries to the correct location. I'm on a mac, but the Python version on a mac is 2.6, and I prefer using 2.7. Make sure you install the setuptools for 2.7 following these instructions. Then, run

sudo easy_install-2.7 lxml

If you have no idea what I'm talking about, it will probably be fine if you simply use the following:

sudo easy_install lxml

On windows, try doing

easy_install lxml

To verify that this installed for you, open up python, and type

import lxml

If you get an error, sheck your setup and try reinstalling.

5.2.2 Using lxml

lxml will generate an ElementTree for us after parsing the xml. Let's review some of the functions that will be useful for us in this example.

EnglishPython
Construct a parserlxml.etree.HTMLParser()
Parse an HTML filelxml.etree.parse(file,parser)
Get all instances of <span class="...">MyTree.xpath('.//span[@class="..."]')
Get all instances of <span class="date"> within <div class="article">MyTree.xpath('.//div[@class="article"]/span[@class="date"]')
Make a list of tuples that we can iterate overzip(iterable1,iterable2,...)
Encode a string foo as unicode (UTF-8)foo.encode("UTF-8")

The xpath syntax is described in more detail here. Briefly, we are finding every occurence of spans with the class date-display-single, no matter where they live in the tree. Then we can iterate over them to get the actual dates. Similarly, we can iterate over all links that are within the <span class="field content"> that are within the <div class="views-field-title"> and zip it with the dates to iterate over both simultaneously. Notice that whenever foreign characters are used, Python will be unable to display them unless we encode the string first as unicode. The following code makes this explicit.

==

from lxml import etree
fname = '/tmp/page1.html'
fp = open(fname, 'rb')
parser = etree.HTMLParser()
tree   = etree.parse(fp, parser)
dateelems = tree.xpath('.//span[@class="date-display-single"]')
linkelems = tree.xpath('.//div[@class="views-field-title"]/span[@class="field-content"]/a')     
for (d,l) in zip(dateelems,linkelems):
    print d.text
    print l.get('href')         
    print l.text.encode("utf-8")
13 Apr 2012
http://www.egyptindependent.com/news/muslim-brotherhood-returns-street-politics-fills-square
Muslim Brotherhood returns to street politics, fills square
12 Apr 2012
http://www.egyptindependent.com/news/meet-your-presidential-candidate-omar-suleiman-phantom
Meet your presidential candidate: Omar Suleiman, the phantom
11 Apr 2012
http://www.egyptindependent.com/news/administrative-court-ruling-leaves-transition-timetable-disarray
Administrative court ruling leaves transition timetable in disarray
11 Apr 2012
http://www.egyptindependent.com/news/shater-faces-early-hiccups-campaign-trail
Shater faces early hiccups on campaign trail
11 Apr 2012
http://www.egyptindependent.com/news/new-alternatives-may-bolster-moussa%E2%80%99s-chances
New alternatives may bolster Moussa’s chances
09 Apr 2012
http://www.egyptindependent.com/news/profile-kamal-al-helbawy-defector-conscience
Profile: Kamal al-Helbawy, a defector of conscience
09 Apr 2012
http://www.egyptindependent.com/news/suleiman-president-game-changer-or-set-plan
Suleiman for president: Game changer or set plan?
09 Apr 2012
http://www.egyptindependent.com/news/despite-prison-time-revolutionaries-uniform-continue-struggle
Despite prison time, revolutionaries in uniform continue struggle
06 Apr 2012
http://www.egyptindependent.com/news/parliament-review-constitution-crisis-continues-brothers-spread-their-wings
Parliament Review: Constitution crisis continues as Brothers spread their wings
06 Apr 2012
http://www.egyptindependent.com/news/profile-april-6-genealogy-youth-movement
Profile: April 6, genealogy of a youth movement

5.2.3 XPath Examples

  • Get all links under <div class="views-field-title">
    from lxml import etree
    fname = '/tmp/page1.html'
    fp = open(fname, 'rb')
    parser = etree.HTMLParser()
    tree   = etree.parse(fp, parser)
    elems = tree.xpath('.//div[@class="views-field-title"]//a')
    for e in elems:
        print e.text.encode('utf-8')
    
    Muslim Brotherhood returns to street politics, fills square
    Meet your presidential candidate: Omar Suleiman, the phantom
    Administrative court ruling leaves transition timetable in disarray
    Shater faces early hiccups on campaign trail
    New alternatives may bolster Moussa’s chances
    Profile: Kamal al-Helbawy, a defector of conscience
    Suleiman for president: Game changer or set plan?
    Despite prison time, revolutionaries in uniform continue struggle
    Parliament Review: Constitution crisis continues as Brothers spread their wings
    Profile: April 6, genealogy of a youth movement
    Parliament Review: A week of political exclusion and inclusion
    Friday's protest to unite, or further divide
    On campaign trail, Moussa speaks villagers' language
    Elections commission's disqualifications dubious, say experts
    Lawyers blame flawed justice system for acquittal of protesters’ accused killers
    Halloween!
    An Ode to Love
    Letters to Treze
    علشان مننشاش المغربي
    Wedding dance of "Beja" tribe
    Egyptian protester passes out after harassment
    "Pearly Pink Flower"
    I Cry!
    "The Amazing Bibliotheca Alexandrina!"
    am agree  title
    am agree  title
    Divorce between  Margaret Scobey and Mr. Baradei
    
  • Get all clickable images

    These will look like:

    <a href="www.webpage.com"><img src="laksjdasldkj.jpg"></a>
    
    from lxml import etree
    fname = '/tmp/page1.html'
    fp = open(fname, 'rb')
    parser = etree.HTMLParser()
    tree   = etree.parse(fp, parser)
    elems = tree.xpath('.//a/img')
    for e in elems:
        print e.get('src')
    
    /sites/default/files/img/english_logo.png
    http://www.egyptindependent.com//sites/default/files/imagecache/video_thumbnail/video/2011/03/28/22597/lshn_mnnshsh_lmgrby_87049_1558797204.jpg
    http://www.egyptindependent.com//sites/default/files/imagecache/video_thumbnail/video/2010/05/09/3685/wedding_dance_17135_1577232530.jpg
    http://www.egyptindependent.com//sites/default/files/imagecache/video_thumbnail/video/2010/04/14/2252/rabw_14731_451336279.jpg
    http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/photo/2011/11/13/27866/pink_flower.jpg
    http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/photo/2011/11/09/27866/ips.jpg
    http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/photo/2011/10/29/27866/amazing.jpg
    http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/caricature/2010/04/14/120/piioioioioi.jpg
    http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/caricature/2010/04/13/120/wewweweww.jpg
    http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/caricature/2010/04/05/489/separation.jpg
    /sites/default/files/W300.jpg
    

5.2.4 lxml Exercise

Write a csv file that contains every image along with the location that it links to. If the webpage has:

<a href="www.webpage.com"><img src="laksjdasldkj.jpg"></a>

Your entry in the csv file would look like:

www.webpage.com, laksjdasldkj.jpg

Hint: use the elt.getparent() method to query elements 'above' a given element elt.

5.2.5 Solutions to exercise

import csv
from lxml import etree

fname = '/tmp/page1.html'
fp = open(fname, 'rb')
f = open('/tmp/links.csv','w')
entries = ["Image","Link"]
c = csv.DictWriter(f,entries)

parser = etree.HTMLParser()
tree   = etree.parse(fp, parser)
lnkelems = tree.xpath('.//a/img')
for lnk in lnkelems:
    d = dict()
    d["Image"] = lnk.get('src')
    d["Link"] = lnk.getparent().get('href')
    c.writerow(d)

fp.close()
f.close()

The resulting file is a CSV file.:

/sites/default/files/img/english_logo.png,/
http://www.egyptindependent.com//sites/default/files/imagecache/video_thumbnail/video/2011/03/28/22597/lshn_mnnshsh_lmgrby_87049_1558797204.jpg,http://www.egyptindependent.com/node/377786
http://www.egyptindependent.com//sites/default/files/imagecache/video_thumbnail/video/2010/05/09/3685/wedding_dance_17135_1577232530.jpg,http://www.egyptindependent.com/node/40195
http://www.egyptindependent.com//sites/default/files/imagecache/video_thumbnail/video/2010/04/14/2252/rabw_14731_451336279.jpg,http://www.egyptindependent.com/node/26345
http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/photo/2011/11/13/27866/pink_flower.jpg,http://www.egyptindependent.com/node/514135
http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/photo/2011/11/09/27866/ips.jpg,http://www.egyptindependent.com/node/513040
http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/photo/2011/10/29/27866/amazing.jpg,http://www.egyptindependent.com/node/509791
http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/caricature/2010/04/14/120/piioioioioi.jpg,http://www.egyptindependent.com/node/26275
http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/caricature/2010/04/13/120/wewweweww.jpg,http://www.egyptindependent.com/node/26269
http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/caricature/2010/04/05/489/separation.jpg,http://www.egyptindependent.com/node/24767
/sites/default/files/W300.jpg,http://www.almasryalyoum.com/en/your-guide

5.2.6 Downloading articles from each page

Goal: A file with the dates, titles and location of each article. Save each article in html form to the hard drive.

from lxml import etree
import csv     
import urllib
import re

f = open('/tmp/files.csv','w')
entries = ["Day","Month","Year","Title","Remote","Local"]
c = csv.DictWriter(f,entries)


destpath = '/tmp/'
fname = '/tmp/page1.html'
fp = open(fname, 'rb')
parser = etree.HTMLParser()
tree   = etree.parse(fp, parser)
dateelems = tree.xpath('.//div[@class="views-field-field-published-date-value"]/span[@class="field-content"]/span[@class="date-display-single"]')
linkelems = tree.xpath('.//div[@class="panel-pane pane-views pane-subchannel-news subchannel-pane"]//div[@class="views-field-title"]/span[@class="field-content"]/a')

for (d,l) in zip(dateelems,linkelems):
    entry = dict()
    myDate = d.text.split()
    urlname = l.get('href')
    print urlname
    entry["Day"] = myDate[0]
    entry["Month"] = myDate[1]
    entry["Year"] = myDate[2]
    remotename = re.match('.*/(.*)',urlname)
    dest = destpath+remotename.group(1)+".html"
    urllib.urlretrieve (urlname,dest)
    entry["Local"] = dest
    entry["Remote"] = urlname
    entry["Title"] = l.text.encode("utf-8")
    c.writerow(entry)
    print entry

f.close()
fp.close()

http://www.egyptindependent.com/news/muslim-brotherhood-returns-street-politics-fills-square
{'Remote': 'http://www.egyptindependent.com/news/muslim-brotherhood-returns-street-politics-fills-square', 'Title': 'Muslim Brotherhood returns to street politics, fills square', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/muslim-brotherhood-returns-street-politics-fills-square.html', 'Day': '13'}
http://www.egyptindependent.com/news/meet-your-presidential-candidate-omar-suleiman-phantom
{'Remote': 'http://www.egyptindependent.com/news/meet-your-presidential-candidate-omar-suleiman-phantom', 'Title': 'Meet your presidential candidate: Omar Suleiman, the phantom', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/meet-your-presidential-candidate-omar-suleiman-phantom.html', 'Day': '12'}
http://www.egyptindependent.com/news/administrative-court-ruling-leaves-transition-timetable-disarray
{'Remote': 'http://www.egyptindependent.com/news/administrative-court-ruling-leaves-transition-timetable-disarray', 'Title': 'Administrative court ruling leaves transition timetable in disarray', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/administrative-court-ruling-leaves-transition-timetable-disarray.html', 'Day': '11'}
http://www.egyptindependent.com/news/shater-faces-early-hiccups-campaign-trail
{'Remote': 'http://www.egyptindependent.com/news/shater-faces-early-hiccups-campaign-trail', 'Title': 'Shater faces early hiccups on campaign trail', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/shater-faces-early-hiccups-campaign-trail.html', 'Day': '11'}
http://www.egyptindependent.com/news/new-alternatives-may-bolster-moussa%E2%80%99s-chances
{'Remote': 'http://www.egyptindependent.com/news/new-alternatives-may-bolster-moussa%E2%80%99s-chances', 'Title': 'New alternatives may bolster Moussa\xe2\x80\x99s chances', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/new-alternatives-may-bolster-moussa%E2%80%99s-chances.html', 'Day': '11'}
http://www.egyptindependent.com/news/profile-kamal-al-helbawy-defector-conscience
{'Remote': 'http://www.egyptindependent.com/news/profile-kamal-al-helbawy-defector-conscience', 'Title': 'Profile: Kamal al-Helbawy, a defector of conscience', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/profile-kamal-al-helbawy-defector-conscience.html', 'Day': '09'}
http://www.egyptindependent.com/news/suleiman-president-game-changer-or-set-plan
{'Remote': 'http://www.egyptindependent.com/news/suleiman-president-game-changer-or-set-plan', 'Title': 'Suleiman for president: Game changer or set plan?', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/suleiman-president-game-changer-or-set-plan.html', 'Day': '09'}
http://www.egyptindependent.com/news/despite-prison-time-revolutionaries-uniform-continue-struggle
{'Remote': 'http://www.egyptindependent.com/news/despite-prison-time-revolutionaries-uniform-continue-struggle', 'Title': 'Despite prison time, revolutionaries in uniform continue struggle', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/despite-prison-time-revolutionaries-uniform-continue-struggle.html', 'Day': '09'}
http://www.egyptindependent.com/news/parliament-review-constitution-crisis-continues-brothers-spread-their-wings
{'Remote': 'http://www.egyptindependent.com/news/parliament-review-constitution-crisis-continues-brothers-spread-their-wings', 'Title': 'Parliament Review: Constitution crisis continues as Brothers spread their wings', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/parliament-review-constitution-crisis-continues-brothers-spread-their-wings.html', 'Day': '06'}
http://www.egyptindependent.com/news/profile-april-6-genealogy-youth-movement
{'Remote': 'http://www.egyptindependent.com/news/profile-april-6-genealogy-youth-movement', 'Title': 'Profile: April 6, genealogy of a youth movement', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/profile-april-6-genealogy-youth-movement.html', 'Day': '06'}

The resulting file is a CSV file.:

13,Apr,2012,"Muslim Brotherhood returns to street politics, fills square",http://www.egyptindependent.com/news/muslim-brotherhood-returns-street-politics-fills-square,/tmp/muslim-brotherhood-returns-street-politics-fills-square.html
12,Apr,2012,"Meet your presidential candidate: Omar Suleiman, the phantom",http://www.egyptindependent.com/news/meet-your-presidential-candidate-omar-suleiman-phantom,/tmp/meet-your-presidential-candidate-omar-suleiman-phantom.html
11,Apr,2012,Administrative court ruling leaves transition timetable in disarray,http://www.egyptindependent.com/news/administrative-court-ruling-leaves-transition-timetable-disarray,/tmp/administrative-court-ruling-leaves-transition-timetable-disarray.html
11,Apr,2012,Shater faces early hiccups on campaign trail,http://www.egyptindependent.com/news/shater-faces-early-hiccups-campaign-trail,/tmp/shater-faces-early-hiccups-campaign-trail.html
11,Apr,2012,New alternatives may bolster Moussa’s chances,http://www.egyptindependent.com/news/new-alternatives-may-bolster-moussa%E2%80%99s-chances,/tmp/new-alternatives-may-bolster-moussa%E2%80%99s-chances.html
09,Apr,2012,"Profile: Kamal al-Helbawy, a defector of conscience",http://www.egyptindependent.com/news/profile-kamal-al-helbawy-defector-conscience,/tmp/profile-kamal-al-helbawy-defector-conscience.html
09,Apr,2012,Suleiman for president: Game changer or set plan?,http://www.egyptindependent.com/news/suleiman-president-game-changer-or-set-plan,/tmp/suleiman-president-game-changer-or-set-plan.html
09,Apr,2012,"Despite prison time, revolutionaries in uniform continue struggle",http://www.egyptindependent.com/news/despite-prison-time-revolutionaries-uniform-continue-struggle,/tmp/despite-prison-time-revolutionaries-uniform-continue-struggle.html
06,Apr,2012,Parliament Review: Constitution crisis continues as Brothers spread their wings,http://www.egyptindependent.com/news/parliament-review-constitution-crisis-continues-brothers-spread-their-wings,/tmp/parliament-review-constitution-crisis-continues-brothers-spread-their-wings.html
06,Apr,2012,"Profile: April 6, genealogy of a youth movement",http://www.egyptindependent.com/news/profile-april-6-genealogy-youth-movement,/tmp/profile-april-6-genealogy-youth-movement.html

5.2.7 Exercise

Modify the above code so that instead of iterating over only the first page, it iterates over all pages.

  • Consider using the glob library to look for all of the html files in a directory.
  • Can you do this so you don't save the pages, but parse them directly?
    • Use google and the python documentation to help figure it out!

    Now that we've seen lxml in action, let's figure out how to use it to pull out just the text of the article. Recall that all of the original text is in the following tags:

    <div class="panel-pane pane-node-body" >    
    <div class="pane-content">
    

5.3 Stripping text

Can be included if there's interest!

5.4 Parallelization to increase speed

Can be included if there's interest!

Written by Alex Storer: (support@help.hmdc.harvard.edu) for IQSS - Spring, 2012