Python: A Hands-on Introduction

Table of Contents

1 Goals

  • Get a working python environment installed on your own computer
  • Make python seem less scary
  • Understand some of the differences between python and other languages
  • Understand what screen scraping is all about
  • Learn the tools to scrape web sites (and other structured text) effectively

This may take some time! This time is available indefinitely, depending on how quickly we go it could take a number of sessions - my intention is to play it by ear to see what we need to focus on.

1.1 Who am I?

My name is Alex Storer, and I'm part of the Data Science Services team at IQSS. I have a PhD in Computational Neuroscience, and have done a lot of programming and scripting to interact with data.

Our team can help you with your research questions, both with the statistics and the technology. If you want to chat with us, simply e-mail support@help.hmdc.harvard.edu.

1.2 What is this page?

This is a tutorial that I wrote using the org-mode in emacs. It is hosted here:

http://www.people.fas.harvard.edu/~astorer/scraping/scraping.html

You can always find details about our ongoing workshops here:

http://dss.iq.harvard.edu

2 Basic Python

Python is a powerful interpreted language that people often use for scraping. We'll highlight here a few of the most helpful features for understanding Python code and writing scrapers. This is by no means a complete or thorough introduction to Python! It's just enough to get by.

2.1 Installation

Python comes in two modern flavors, version 2 and version 3. There are some important language differences between them, and in practice, almost everyone uses version 2. To install it, go here and select the relevant operating system.

2.1.1 IDE

An IDE, or Integrated Development Environment, is used to facilitate programming. A good IDE does things like code highlighting, error checking, one-click-running, and easy integration across multiple files. An example of a crappy IDE is notepad. I like to use emacs. Most people prefer something else.

2.1.2 Wing IDE 101

For this session, I recommend Wing 101. It's a free version of a more fully-featured IDE, but for beginners, it's perfect. If you don't already have an IDE that you're invested in, or you want your intro to python to be as painless as possible, you should install it. It's cross platform.

  • Getting Started in Wing
    Once you have Wing installed, you might want to use the tutorial to learn how to navigate around in it.

    ./img/tutorial.jpg

    Opening the tutorial in Wing 101.

2.2 Further Python Resources

But wait, I want to spend four months becoming a Python guru!

Dude, you're awesome. Here are some resources that will help you:

2.3 Diving In

In Wing, there is a window open called Python Shell

  • If you know R, think of this just like the R command line
  • If you've never programmed before, think of this as a graphing calculator
print 2+4
6

2.3.1 Basic Text Handling

  • Of course, this graphing calculator can handle text, too!
mystr = "Hello, World!"
print mystr
print len(mystr)
Hello, World!
13
Python CodeR CodeEnglish Translation
print 2+4print(2+4)Print the value of 2+4
mystr = '`Hello World'`mystr <- '`Hello World'`Assign the string "Hello World" to the variable mystr
len(mystr)nchar(mystr)How "long" is the variable mystr? Note: R can tell you how long it is, but if you want the number of characters, that's what you need to ask for.

Note to Stata Users:
Assigning a variable is not the same as adding a "column" to your dataset.

2.3.2 Indexing and Slicing

Get the first element of a string.

  • Note: Python counts from 0. This is a common convention in most languages constructed by computer scientists.
mystr = "Dogs in outer space"
print mystr[0]
D

Get the last element of a string

mystr = "Dogs in outer space"
print mystr[-1]
print mystr[len(mystr)-1]
e
e
mystr = "Dogs in outer space"
print mystr[1:3]
print mystr[3:]
print mystr[:-3]
og
s in outer space
Dogs in outer sp

2.3.3 Including Other Packages

  • By default, python doesn't include every possible "package"
    • This is similar to R, but unlike Matlab
    • Use the include statement to load a library
import math
print math.sin(math.pi)
1.22464679915e-16

After we import from a package, we have to access sub-elements of that package using the . operator. Notice also that while the value 1.22464679915e-16 is very nearly 0, the math module doesn't know that sin(π) = 0. There are smarter modules for doing math in Python, like scipy and numpy. Some people love using Python for Math. I think it makes more sense to use R.

  • If you want to import something into your namespace
    • from math import <myfunction> or
    • from math import *
from math import *
print sin(pi)
1.22464679915e-16

2.3.4 Objects and methods

Python makes extensive use of objects. An object has

  • Methods: functions that work only on that option
  • Fields: data that only that type of object has

For example, let's imagine a fruit object. A fruit might have a field called hasPeel, which tells you whether this fruit is peeled. It could also have a method called peel, which alters the state of the fruit.

str = "THE World is A BIG and BEAUTIFUL place.  "
print str.upper()
name = "Alex Storer"
print name.swapcase()
THE WORLD IS A BIG AND BEAUTIFUL PLACE.  
aLEX sTORER

Here we defined two strings, str and name, and used these to invoke string methods which affect the case of the string.

  • You can write your own objects and methods
  • Objects can be sub-classes of other objects
    • e.g., a psychologist is a type of researcher, who does everything a researcher does but also some other things only a pyschologist does.

2.3.5 Defining Functions

You can write your own functions, pieces of code that can be used to take specific inputs and give outputs. You can create a function by using the def command.

def square(x):
    return x*x
print square(9)
81

Pay close attention to the whitespace that is used in Python! Unlike other languages, it is not ignored. Everything with the same indentation is in the same level. Above, the statement return x*x is part of the square function, but the following line is outside of the function definition.

2.3.6 Logical Flow

./img/decision-tree.png

The xkcd guide to writing good code

You can think about this logical process as being in pseudocode.

IF do things right
   ---> code well
OTHERWISE
   ---> do things fast

A lot of programming is figuring out how to fit things into this sort of if=/=else structure. Let's look at an example in Python.

  • The method find returns the index of the first location of a string match
mystr = "This is one cool looking string!"
if mystr.find("string")>len(mystr)/2:
    print "The word 'string' is in the second half"
else:
    print "The word 'string is not in the second half"
The word 'string' is in the second half

What happens if the word "string" is not there at all?

  • The method find returns -1 if the string isn't found
mystr = "I don't know about you, but I only use velcro."
print mystr.find("string")
if mystr.find("string")>len(mystr)/2:
    print "The word 'string' is in the second half"
elif mystr.find("string")>=0:
    print "The word 'string is not in the second half"
else:
    print "The word 'string' isn't there!"
-1
The word 'string' isn't there!
  • Important Note: In Python, most everything evaluates to True. Exceptions include 0 and None. This means that you can say things like if (result) where the result may be a computation, a string search, or anything like that. As long as it evaluates to True, it will work!

2.3.7 Review

  • if, elif and else can be used to control the flow of a program
  • strings are a type of a object, and have a number of methods that come with them, including find, upper and swapcase
    • methods are called using mystring.method()
    • The list of methods for strings can be found in the Python documentation
  • def can be used to define a function
    • The return statement determine what the function returns

2.4 For Loops

The for loop is a major component of how python is used. You can iterate over lots of different things, and python is smart enough to know how to do it.

  • Note: the following is what's called pseudocode - something that looks like code, but isn't going to run. It's a helpful way to clarify the steps that you need to take to get things to work.
for (item in container):
    process item
    print item
print "done processing items!"

Notice the use of the <TAB> (or spacing) - that's how python knows whether we're inside the loop or not!

2.4.1 Example

str = "Daddy ran to help Ann.  Up and down went the seesaw."      
for word in str.split():
    print word
Daddy
ran
to
help
Ann.
Up
and
down
went
the
seesaw.

Notice the use of str.split(): this is an example of calling a method of a string object. It returns a list of words after splitting the string on whitespace.

2.5 Lists

  • A list is a data type that can hold anything.
  • Lists are iterable (you can pass them to a for loop
  • You can .append, .extend,and otherwise manipulate lists. Python Documentation
mylist = ['dogs',1,4,"fishes",["hearts","clovers"],list]  
for element in mylist:
    print element    
mylist.reverse()
print mylist
dogs
1
4
fishes
['hearts', 'clovers']
<type 'list'>
[<type 'list'>, ['hearts', 'clovers'], 'fishes', 4, 1, 'dogs']

2.6 Exercise

  1. Write a function that takes in a string, and outputs the square of its length.
  2. Write a function that returns the number of capitalized letters in a string. Hint: try using =lower= and the == operator
  3. Write a function that returns everything in a string up to "dog", and returns "not found" if the string is not present.

2.6.1 Exercise Solutions

  • Exercise 1:
    Write a function that takes in a string, and outputs the square of its length.

    Notice that a function can call another function that you wrote.

    def square(x):
        return x*x
    
    def sqlen(x):
        return square(len(x))
    
    print sqlen("Feet")
    
    16
    
  • Exercise 2
    Write a function that returns the number of capitalized letters in a string.
    def numcaps(x):
        lowerstr = x.lower()
        ncaps = 0
        for i in range(len(x)):
            if lowerstr[i]!=x[i]:
                ncaps += 1
        return ncaps
    
    teststr = "Dogs and Cats are both Animals"
    print teststr, "has", str(numcaps(teststr)), "capital letters"
    
    Dogs and Cats are both Animals has 3 capital letters
    
  • Exercise 3
    def findDog(x):
        mylist = x.split("dog")
        if len(mylist) < 2:
            return "not found"
        else:
            return mylist[0]    
        return mylist
    print findDog("i have a dog but not a cat")
    print findDog("i have a fish but not a cat")
    print findDog("i have a dog but not a dogwood")
    
    
    i have a 
    not found
    i have a 
    

2.7 dict type

A dict, short for dictionary, is a helpful data structure in Python for building mappings between inputs and outputs.

http://code.google.com/edu/languages/google-python-class/images/dict.png

2.7.1 Examples

mydict = dict()
mydict["dogs"] = 14
mydict["fish"] = "slumberland"
mydict["dogs"]+= 3
print mydict
{'fish': 'slumberland', 'dogs': 17}
len(mydict["fish"])

Let's use a dictionary to store word counts from a sentence.

str = "Up and down went the seesaw. Up it went.  Down it went.  Up, up, up!"
print str
for i in [",",".","!"]:
    str = str.replace(i," ")
print str
str = str.lower()
print str
print set(str.lower().split())
Up and down went the seesaw. Up it went.  Down it went.  Up, up, up!
Up and down went the seesaw  Up it went   Down it went   Up  up  up 
up and down went the seesaw  up it went   down it went   up  up  up 
set(['and', 'up', 'it', 'down', 'seesaw', 'went', 'the'])

We see that a set contains an unordered collection of the elements of the list returned by split(). Let's make a dictionary with keys that are pulled from this set.

str = "Up and down went the seesaw. Up it went.  Down it went.  Up, up, up!"
for i in [",",".","!"]:
    str = str.replace(i," ")
words = str.lower().split()
d = dict.fromkeys(set(words),0)
print d
for w in words:
    d[w]+=1
print d
{'and': 0, 'down': 0, 'seesaw': 0, 'went': 0, 'the': 0, 'up': 0, 'it': 0}
{'and': 1, 'down': 2, 'seesaw': 1, 'went': 3, 'the': 1, 'up': 5, 'it': 2}

2.7.2 Writing to CSV

A very useful feature of dictionaries is that there is an easy method to write them out to a CSV (comma-separated variable) file.

import csv
f = open('blah.csv','w')
nums = [1,2,3]
c = csv.DictWriter(f,nums)
for i in range(0,10):
    d = dict()
    for x in nums:
        d[x] = x**i
    c.writerow(d)
f.close()

This writes out the following csv file:

1,1,1
1,2,3
1,4,9
1,8,27
1,16,81
1,32,243
1,64,729
1,128,2187
1,256,6561
1,512,19683    

A more concise way to construct this dictionary is to use a list comprehension, which lets us make a list in one line:

print [(x, x**2) for x in range(0,10)]
[(0, 0), (1, 1), (2, 4), (3, 9), (4, 16), (5, 25), (6, 36), (7, 49), (8, 64), (9, 81)]

We can then make a dict out of this list of tuples:

print dict([(x, x**2) for x in range(0,10)])
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

Finally, we can construct the entire CSV file as we did earlier:

import csv
f = open('blah.csv','w')
nums = [1,2,3]
c = csv.DictWriter(f,nums)
for i in range(0,10):
    c.writerow(dict([(x, x**i) for x in nums]))
f.close()

2.7.3 A Note on File Objects

  • Think about file objects like a book
    • If a file is open, you don't want other people to mess with it
    • Files can be opened for reading or writing
    • There are methods to move around an open file
  • Close the book when you're done reading it!
  • Python documentation on "File I/O" is here
EnglishPythonOutput
Open blah.txt just for readingf = open('blah.txt','r')file object f
Get the next line in a filestr = f.readline()string containing a single line
Get the entire filestr = f.read()string containing entire file
Go to the beginning of a filef.seek(0)None
Close blah.txtf.close()None

To play with this, download this file somewhere on your hard drive. I'm putting it on my hard drive as gaga.txt. On Windows, it may look more like C:\temp\gaga.txt - just make sure you get the path correct when you tell Python where to look!

f = open('gaga.txt','r')
print f
str = f.read()
print "str has length: ", len(str)
str2 = f.read()
print "str2 has length: ", len(str2)
f.seek(0)
str3 = f.readline()
print "str3 has length: ", len(str3)
f.close()

You'll use file objects a lot. As we see them, I'll try to point out what's important about them.

2.7.4 Exercise

  • Exercise 1
    Write a function that counts the number of unique letters in a word.
  • Exercise 2
    Write a function that takes in a string, and returns a dict that tells you how many words of each number of letters there are.
    "Dogs and cats are all animals"
     dogs and cats are al  animls
     4    3   4    3   2   6
     {2: 1, 3: 2, 4: 2, 6: 1}
    
  • Exercise 3
    Write a function that takes as input a list of strings, and for each string writes a row to a csv file that shows how many words of N unique letters there are.

    For example:

     listwriter(["Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation.",
                 "We observe today not a victory of party, but a celebration of freedom -- symbolizing an end, as well as a beginning -- signifying renewal, as well as change.", 
                 "So, first of all, let me assert my firm belief that the only thing we have to fear is fear itself -- nameless, unreasoning, unjustified terror which paralyzes needed efforts to convert retreat into advance."])
    

    And our output file should look something like:

     1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25
     1,2,1,2,5,3,0,1,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
     5,8,4,1,2,4,3,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
     1,7,6,9,4,2,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
    

    You should use csv.DictWriter.

2.7.5 Exercise Solutions

  • Exercise 1
    Write a function that counts the number of unique letters in a word.
    def uniqueletters(w):
        d = dict()
        for char in w:
            d[char] = 1
        return len(d.keys())
    print uniqueletters("dog")
    print uniqueletters("dogged")
    
    
    3
    4
    
  • Exercise 2
    Write a function that takes in a string, and returns a dict that tells you how many words of each number of letters there are.
    "Dogs and cats are all animals"
     dogs and cats are al  animls
     4    3   4    3   2   6
     {2: 1, 3: 2, 4: 2, 6: 1}
    
    def uniqueletters(w):
        d = dict()
        for char in w:
            d[char] = 1
        return len(d.keys())
    
    def wordcounter(str):
        d = dict()
        for w in str.split():
            u = uniqueletters(w)
            if u in d.keys():           
                d[u]+=1
            else:
                d[u] = 1
        return d
    
    print wordcounter("Dogs and cats are all animals")
    
    
    {2: 1, 3: 2, 4: 2, 6: 1}
    
  • Exercise 3
    Write a function that takes as input a list of strings, and for each string runs write a csv that contains a column for each number and a row for each string.
     1,2,3,4,5,6,7,8,9,10,11,12,13
     2,3,2,3,4,5,2,3,2,1 , 0, 0, 0
     5,2,1,0,1,2,0,0,0,0 , 0, 0, 0
     etc.
    
    import csv
    def uniqueletters(w):
        d = dict()
        for char in w:
            d[char] = 1
        return len(d.keys())
    
    def wordcounter(str):
        d = dict()
        for w in str.split():
            u = uniqueletters(w)
            if u in d.keys():           
                d[u]+=1
            else:
                d[u] = 1
        return d
    
    def listwriter(l):
        f = open('blah.csv','w')
        c = csv.DictWriter(f,range(1,27)) 
        c.writeheader()    
        for str in l:
            partialdict = wordcounter(str)
            fulldict = dict.fromkeys(range(1,27),0)  
            for k in partialdict.keys():
                fulldict[k] = partialdict[k]
            c.writerow(fulldict)
        f.close()
    
    listwriter(["Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation.",
                "We observe today not a victory of party, but a celebration of freedom -- symbolizing an end, as well as a beginning -- signifying renewal, as well as change.", 
                "So, first of all, let me assert my firm belief that the only thing we have to fear is fear itself -- nameless, unreasoning, unjustified terror which paralyzes needed efforts to convert retreat into advance."])
    
    

    Here is the resulting CSV file:

    1,1,1
    1,2,3
    1,4,9
    1,8,27
    1,16,81
    1,32,243
    1,64,729
    1,128,2187
    1,256,6561
    1,512,19683
    

3 Regular Expressions

Regular expressions are a framework for doing complicated manipulation on text.

3.1 A first example

For example, consider the following text:

Joseph Schmoe (15), Phone:(934) 292-2390, SSN:295-48-2019

A first guess for a rule to get the area code would be to find a grouping of three numbers. Let's look at the source code for this in python.

import re  
str = "Joseph Schmoe (15), Phone:(934) 292-2390, SSN:295-48-2311"  
print re.findall("\d\d\d",str)
['934', '292', '239', '295', '231']

3.1.1 What the code does

  • import re
    • tells python to use the regular expression library. (Like library(zelig))
  • str = ...
    • defines a string
    • Python will figure out that the type is a string based on the fact that it's in quotes
    • There is a difference between
      foo = '333'
      

      and

      foo = 333     
      
  • re.findall("\d\d\d",str)
    • From the re library, call the findall function
      • When in doubt, Google it.
        • By the way, googling things effectively is the most important modern research skill there is.
    • Finds all of the matched of the regular expression \d\d\d in str
      • Returns it as a list
import re
str = "Joseph Schmoe (15), Phone:(934) 292-2390, SSN:295-48-2311"  
print re.findall("\((\d\d\d)\)",str)
['934']

3.1.2 Different expressions

"Joseph Schmoe (15), Phone:(934) 292-2390, SSN:295-48-2311"  
EnglishRegexfindall Output
Any three numbers\d\d\d['934', '292', '239', '295', '231']
Any three numbers that start with (\(\d\d\d['(934']
One or more adjacent numbers\d+['15', '934', '292', '2390', '295', '48', '2311']
One or more numbers in parenthesis\(\d+\)['(15)', '(934)']
Three numbers in parenthesis\(\d\d\d\)['(934)']
Three numbers in parenthesis, but group only the numbers\((\d\d\d)\)['934']

3.2 Further examples

import re
str = "Joseph Schmoe, Bowling High Score:(225), Phone:(934) 292-2390"  
print re.findall("\w+:\((\d+)\)",str)
['225', '934']
  • The \w is code for any alphanumeric character and the underscore.
  • The : is code for only the character :.
import re
str = "I called his phone after he phoned me, but he has two phones!"  
print re.findall("phone\w*",str)
['phone', 'phoned', 'phones']
  • We match all instances of "phone" with any number of characters after it
    • Note the difference between \w+ (1 or more) and \w* (0 or more)
import re
str = "I called his phone after he phoned me, but he has two phones!"  
print re.findall("phone\w+",str)
['phoned', 'phones']

3.3 Other helpful regex tools

Regular expressions are extremely powerful, and are used extensively for text processing. Here are some good places to look for regex help:

  • Python re library has documentation of how to use regex in python with examples
    • I can never remember regex syntax, so I go here all the time.
  • Regexr is an interactive regex checker
  • Textbooks on regex will tell you not just how to use them, but how they are implemented. Help answer the question "what is the best regex for this situation?"

3.4 Exercises

This file contains 100 blogs about dogs in a structured text format that may be familiar to you.

  • Exercise 1
    Use regular expressions to parse this file and write a csv file containing the article number and the number of words. (I'm going to start by downloading it to my hard drive, but if you're macho you want to figure out how to use the urllib module to parse it without downloading.)
  • Exercise 2
    Write a CSV file that investigates whether articles contain certain words. In particular, do dog bloggers write more about 'pets' or 'companions'?

3.5 Solutions

  • Exercise 1
    import csv, re
    f = open('example.txt')
    fp = open('result.csv','wb')
    
    c = csv.DictWriter(fp,["Article Number","Words"]) 
    articlenum = 0
    for line in f:
        d = dict()
        r = re.match("LENGTH:\s*(\d+)",line)
        if r:
            articlenum+=1
            d["Article Number"] = articlenum
            d["Words"] = r.groups()[0]
            c.writerow(d)        
    f.close()
    fp.close()        
    
    

    The result.csv file is:

    1,305
    2,303
    3,425
    4,275
    5,197
    6,615
    7,281
    8,466
    9,692
    10,656
    11,294
    12,674
    13,1455
    14,1454
    15,1063
    16,1066
    17,512
    18,433
    19,294
    20,528
    21,758
    22,497
    23,598
    24,957
    25,163
    26,661
    27,616
    28,521
    29,331
    30,275
    31,266
    32,762
    33,365
    34,781
    35,753
    36,442
    37,1251
    38,462
    39,230
    40,281
    41,564
    42,510
    43,316
    44,1060
    45,402
    46,990
    47,392
    48,536
    49,509
    50,636
    51,973
    52,234
    53,675
    54,416
    55,488
    56,487
    57,546
    58,596
    59,326
    60,312
    61,369
    62,1507
    63,2398
    64,183
    65,1718
    66,280
    67,302
    68,302
    69,1326
    70,549
    71,460
    72,302
    73,288
    74,288
    75,269
    76,308
    77,2241
    78,515
    79,526
    80,320
    81,400
    82,301
    83,302
    84,263
    85,297
    86,300
    87,953
    88,308
    89,1019
    90,787
    91,307
    92,371
    93,512
    94,303
    95,285
    96,302
    97,666
    98,490
    99,551
    100,411
    
  • Exercise 2

    Let's begin just by checking some basic regular expressions

    import re
    str = "A competition between Pets and Animal Companions!  How do you refer to your dog?"
    print "\w*:"
    print re.findall("\w*",str)
    print "[p]et:"
    print re.findall("[p]et",str)
    print "[pP]et:"
    print re.findall("[pP]et",str)
    
    \w*:
    ['A', '', 'competition', '', 'between', '', 'Pets', '', 'and', '', 'Animal', '', 'Companions', '', '', '', 'How', '', 'do', '', 'you', '', 'refer', '', 'to', '', 'your', '', 'dog', '', '']
    [p]et:
    ['pet']
    [pP]et:
    ['pet', 'Pet']
    

    Great! So we know how to match "pet" or "Pet", but it still matches "competition"! Let's write out some patterns that we would like to match:

    Do Match
    I own a dog - pets are great!
    Do you have a pet?
    Pets are wonderful.
    I've got to tell you–pets are the best!
    Don't Match
    Great competition!
    Petabytes of data are needed.
    I went to the petting zoo with my companion!
    She owns a whippet.

    It looks to me like we need the word "pet" with a space or punctuation at the beginning or the end, with an optional s at the end.

    [-,\s.;][pP]et
    Either a dash a comma whitespace a period or a semicolonEither p or Pthe letters et
    import re
    strlist = ["I own a dog - pets are great!",  "Do you have a pet?", "Pets are wonderful.", "I've got to tell you--pets are the best!", "Great competition!", "Petabytes of data are needed.", "I went to the petting zoo with my companion!", "She owns a whippet."]
    for str in strlist:
        print str    
        print re.findall("[-,\s.;][pP]et",str)
    
    
    I own a dog - pets are great!
    [' pet']
    Do you have a pet?
    [' pet']
    Pets are wonderful.
    []
    I've got to tell you--pets are the best!
    ['-pet']
    Great competition!
    []
    Petabytes of data are needed.
    []
    I went to the petting zoo with my companion!
    [' pet']
    She owns a whippet.
    []
    

    This isn't good enough! We're going to need to change the endings, too.

    [-,\s.;][pP]et[s]?[.\s.;-]
    Either a dash a comma whitespace a period or a semicolonEither p or Pthe letters etan optional sEith a period, whitespace, a semicolon or a dash
    import re
    strlist = ["I own a dog - pets are great!",  "Do you have a pet?", "Pets are wonderful.", "I've got to tell you--pets are the best!", "Great competition!", "Petabytes of data are needed.", "I went to the petting zoo with my companion!", "She owns a whippet."]
    for str in strlist:
        print str    
        print re.findall("[-,\s.;?][pP]et[s]?[,\s.;-?]",str)
    
    
    I own a dog - pets are great!
    [' pets ']
    Do you have a pet?
    [' pet?']
    Pets are wonderful.
    []
    I've got to tell you--pets are the best!
    ['-pets ']
    Great competition!
    []
    Petabytes of data are needed.
    []
    I went to the petting zoo with my companion!
    []
    She owns a whippet.
    []
    

    We're almost there! We just need to make it so a string can also begin with Pets.

    ^[pP]et[s]?[.\s.;-]
    Only match the beginning of a stringEither p or Pthe letters etan optional sEith a period, whitespace, a semicolon or a dash

    So we will either match the regular expression ^[pP]et[s]?[.\s.;-] or the expression [-,\s.;?][pP]et[s]?[,\s.;-?]. The syntax for this is the pipe operator |.

    Our regular expression just to check for pets is:

    [-,\s.;?][pP]et[s]?[,\s.;-?]|^[pP]et[s]?[,\s.;-?]

    This looks like a sloppy mess, but we built it up by hand ourselves, and it's really not so bad!

    import re
    strlist = ["I own a dog - pets are great!",  "Do you have a pet?", "Pets are wonderful.", "I've got to tell you--pets are the best!", "Great competition!", "Petabytes of data are needed.", "I went to the petting zoo with my companion!", "She owns a whippet."]
    for str in strlist:
        print str    
        print re.findall("[-,\s.;?][pP]et[s]?[,\s.;-?]|^[pP]et[s]?[,\s.;-?]",str)  
    
    I own a dog - pets are great!
    [' pets ']
    Do you have a pet?
    [' pet?']
    Pets are wonderful.
    ['Pets ']
    I've got to tell you--pets are the best!
    ['-pets ']
    Great competition!
    []
    Petabytes of data are needed.
    []
    I went to the petting zoo with my companion!
    []
    She owns a whippet.
    []
    

    Having constructed this regex for pets, we can now do the same for companion. Because the word companion isn't going to be inside words the way pet is, we don't have to be as careful. Let's say we need to match companion and companions, but not companionship. We can copy the same regex for pets, but remove the gunk from the beginning (although it probably can't hurt for correctness to include it!)

    Let's try: [cC]ompanion[s]?[,\s.;-?]

    Note: Remember to use re.match to match the beginning of the string only, and re.search to match anywhere!

    import csv, re
    f = open('example.txt')
    fp = open('pets.csv','wb')
    
    c = csv.DictWriter(fp,["Article Number","Words","Pet","Companion"]) 
    articlenum = 0
    for line in f:
        r = re.match("LENGTH:\s*(\d+)",line)
        if r:
            if articlenum>0:
                c.writerow(d)           
            d = dict()    
            articlenum+=1
            d["Article Number"] = articlenum
            d["Words"] = r.groups()[0]
            d["Pet"] = 0
            d["Companion"] = 0
        else:       
            pets = re.search("[-,\s.;?][pP]et[s]?[,\s.;-?]|^[pP]et[s]?[,\s.;-?]",line)
            companions = re.search("[cC]ompanion[s]?[,\s.;-?]",line)
            if pets:
                d["Pet"] = 1
            if companions:
                d["Companion"] = 1
    
    f.close()
    fp.close()  
    
    

    Let's take a look at the csv file.

    1,305,0,0
    2,303,1,0
    3,425,1,0
    4,275,1,0
    5,197,0,0
    6,615,0,0
    7,281,1,1
    8,466,1,0
    9,692,1,0
    10,656,0,0
    11,294,1,0
    12,674,0,0
    13,1455,1,0
    14,1454,1,0
    15,1063,1,0
    16,1066,1,0
    17,512,0,0
    18,433,1,0
    19,294,1,0
    20,528,1,0
    21,758,1,0
    22,497,0,0
    23,598,0,0
    24,957,0,0
    25,163,0,0
    26,661,0,0
    27,616,0,1
    28,521,0,0
    29,331,0,1
    30,275,1,0
    31,266,1,0
    32,762,0,0
    33,365,0,0
    34,781,0,1
    35,753,0,0
    36,442,0,0
    37,1251,0,0
    38,462,0,0
    39,230,0,0
    40,281,0,0
    41,564,0,0
    42,510,1,0
    43,316,1,0
    44,1060,1,1
    45,402,1,0
    46,990,1,0
    47,392,0,0
    48,536,1,0
    49,509,1,0
    50,636,1,0
    51,973,1,0
    52,234,0,0
    53,675,1,0
    54,416,1,0
    55,488,1,0
    56,487,1,0
    57,546,1,0
    58,596,1,0
    59,326,1,0
    60,312,1,0
    61,369,0,0
    62,1507,0,1
    63,2398,1,0
    64,183,1,0
    65,1718,1,0
    66,280,1,0
    67,302,0,0
    68,302,1,0
    69,1326,1,0
    70,549,1,0
    71,460,1,0
    72,302,1,0
    73,288,1,0
    74,288,0,0
    75,269,0,0
    76,308,0,0
    77,2241,0,0
    78,515,1,1
    79,526,0,0
    80,320,1,0
    81,400,0,0
    82,301,1,0
    83,302,1,0
    84,263,1,0
    85,297,1,0
    86,300,0,0
    87,953,0,0
    88,308,1,0
    89,1019,1,0
    90,787,1,0
    91,307,0,0
    92,371,0,0
    93,512,1,0
    94,303,1,0
    95,285,0,0
    96,302,1,0
    97,666,0,0
    98,490,0,0
    99,551,1,1
    

4 Web Sites

4.1 Example: Egypt Independent / المصري اليوم

4.1.1 Aside: "Brittleness"

  • A brittle system is one that is not resistant to change
  • For example, between early April and late April of 2012, Egypt Independent transitioned from
    http://www.egyptindependent.com/node/725861
    

    to a new URL naming scheme that involves the title:

    http://www.egyptindependent.com/news/european-union-will-keep-mubarak-assets-ice-illicit-gains-authority-head-says
    

    All scrapers are brittle.

    • The assumptions you're forced to make about how information is organized on a given website will not hold forever.
    • In fact, the legality of scraping is not entirely clear, and some sites may not be interested in you hammering their servers!

4.1.2 Metadata

Sometimes, metadata is included which tells us important things about our article

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="msvalidate.01" content="F1F61CF0E5EC4EC2940FCA062AB13A53" />
<meta name="google-site-verification" content="Q8FKHdNoQ2EH7SH1MzwH_JNcgVgMYeCgFnzNlXlR4N0" />
<title>European Union will keep Mubarak assets on ice, Illicit Gains Authority head says | Egypt Independent</title>
<!-- tC490Uh18j-7O_rp7nG0_e6U9QY -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="canonical" href="http://www.egyptindependent.com/node/725861" />
<meta name="keywords" content="Assem al-Gohary, corruption, EU, freezing  Mubarak’s assets, Hosni Mubarak, Illicit Gains Authority (IGA), News, Top stories" />
<meta name="description" content="The European Union will continue to freeze the assets of former President Hosni Mubarak, his family and other former officials although Egypt has thus far been unsuccessful in recovering funds siphoned abroad by the regime." />
<meta name="abstract" content="Al-Masry Al-Youm - Egypt&#039;s leading independent media group المصرى اليوم للصحافة والنشر هى مؤسسة إعلامية مصرية مستقلة تأسست عام  ,2003." />
  • Keywords, abstract, description and title are all clear
  • Lots of other gunk that isn't relevant to us!
  • Pulling information out of this document requires that we know how they organize title their metadata!
    • What if keywords were called terms?

4.1.3 Body

The actual body of the article can be found by right-clicking on the text we're interested in from Chrome or Firefox and selecting "Inspect Element"

<div class="panel-region-separator"></div><div class="panel-pane pane-node-body" >    
  <div class="pane-content">
    <p>The European Union will continue to freeze the assets of former President Hosni Mubarak, his family and other former officials although Egypt has thus far been unsuccessful in recovering funds siphoned abroad by the regime.</p>
<p>The Illicit Gains Authority (IGA), the judicial committee responsible for recovering the money, on Wednesday received an official notification from the European Union, confirming its freeze on the assets would be renewed another year as of 19 March, state-run MENA news service reported on Wednesday.&nbsp;</p>
<p>&ldquo;This was in response to a request by Egypt,&rdquo; the state news agency quoted IGA head Assem al-Gohary as saying.&nbsp;</p>
<p>Egypt formally asked European Union countries earlier this month to continue freezing funds belonging to Mubarak, his two sons and other members of his administration.</p>
<p>Shortly after Mubarak was forced to step down in February 2011, the public prosecutor ordered that the foreign assets of the deposed president and his family be frozen.</p>
<p>Mubarak&#39;s actual worth is still unknown after more than a year of investigations into his foreign and domestic assets. Last year claims that Mubarak, in his nearly 30-year reign as head of state, may have amassed a fortune of up to US$70 billion &mdash; greater than that of Microsoft&#39;s Bill Gates &mdash; helped drive the protests that eventually brought him down.</p>
<p>Last year Swiss authorities also froze Mubarak&rsquo;s assets, acting more speedily than when the EU froze the assets of another deposed North African ruler, former Tunisian President Zine al-Abidine Ben Ali.</p>
<p>On Wednesday, the IGA met with the Swiss ambassador in Cairo to discuss the difficulties it faces in recovering those funds, in light of the obligations of the United Nations Convention Against Corruption on the member states, reported MENA.</p>
<p>Gohary once estimated the frozen assets at 410 million Swiss francs (LE2.7 billion), which Egypt is trying to repatriate in cooperation with the Foreign Ministry.</p>
  </div>

All of the body is included in the panel-pane pane-node-body section of this site, within the sub-section pane-content. Our "algorithm" for getting this information out will require finding the exact section of the site that we require pulling this data out from. If you don't do this, any terms that are on the sidebar will end up being in your analysis!

4.1.4 Scraping Articles

Every News Feature is on a page in the following scheme:

http://www.egyptindependent.com/subchannel/News%20features?page=5

And this paper goes back 77 pages, to April, 2009.

Investigating the source for a single search page can tell us what we have to do to get at the relevant information:

<div class="views-row views-row-4 views-row-even">
  <div class="views-field-field-published-date-value">
    <span class="field-content"><span class="date-display-single">09 Feb 2012</span></span>
  </div>
  <div class="views-field-title">
    <span class="field-content"><a href="http://www.egyptindependent.com/node/647936">Parliament Review: A week of comedy and disappointment</a></span>
  </div> 
  <div class="views-field-body">
    <span class="field-content">This week&rsquo;s parliamentary sessions had the public joking about airing future sessions on comedy channels instead of news, and those who abstained from the polls telling those who participated, in hope of having a legitimate authority...</span>
  </div>  

Our algorithm to scrape articles from this page will be as follows:

  1. Initialize FOO=1
  2. Go to http://www.egyptindependent.com/subchannel/News%20features?page=FOO
  3. Repeat until complete:
    1. Find the next occurence of views-row...
    2. Find the sub-field called views-field-field-published-date-value and retrieve its value (the date)
    3. Find the sub-field called views-field-title and retrieve its value (the title)
    4. Follow the link from above
    5. Within the link, find the meta-data keywords and retrieve their values (the keywords)
    6. Within the link, find the panel-pane pane-node-body section, and retrieve the test (the article itself)

4.1.5 Scraping Exercise!

Not all web sites are designed in the same way. Go to the site of your choice, and figure out how to get the articles you're interested in. Write out pseudocode that will tell you:

  1. How to download individual articles
  2. How to get the Author of an article
  3. How to get the Title of an article
  4. How to get the Date of an article
  5. How to get the text of the article

If you need a site to practice on that isn't too challenging, check out Robert Ebert's Blog.

5 Web scraping technology

There are two major ways to actually get our data from the web. One is called selenium, and it pops up a browser that you can control from Python. The other is by navigating using text only from a Python library - the most popular are BeautifulSoup and lxml.

SeleniumBeautifulSouplxml
Can be run from a terminalNoYesYes
SpeedSlowFastVery Fast
JavaScriptRendered in browserNot supportedNot supported
XPathSupportedNot supportedSupported
InstallationEasyEasyChallenging (Mac)

Now that we know how we want to scrape and have some grasp on the tools that are necessary, let's try and pull the articles and their metadata off of this website.

5.1 Package installation

Unfortunately, although Python is platform independent, installing modules can vary a lot depending on whether you use Unix/Mac/Windows. The basic goal is to go to the terminal and type this:

easy_install selenium

And the selenium package should be installed correctly for you.

  • Mac

    You shouldn't need to install anything new or change the path on a Mac. The only concern is that if you've installed a newer version of Python than the one that came with your Mac, you need to use the correct version of easy_install.

    If you get an error about permissions, try the following instead:

    sudo easy_install selenium

5.2 Running on a schedule

A lot of web scrapers are designed to run every day, to harvest new information. Once you start thinking about this genre of Python script, you need to think about System Administration in addition to writing Python code. Essentially, you need a computer that is on at the same time every day that has a job scheduler on it to run given scripts.

I would recommend using Amazon's EC2 could hosting, along with the cron linux utility to schedule when your Python script will run. Describing it is outside of the scope of this discussion, but is not so hard, and we can help you if you decide that it's a service you need.

5.3 Using an API

An API, or an Application Programming Interface, is a way for web sites and other services to let you use their data in a controlled way.

Whenever an API is available, you should use it rather than scraping.

Here are examples of things you can do with an API:

  • Get the friends of a given Twitter user
  • Get the image of a Google Street View camera at a specific lat/lon coordinate
  • Get Amazon's price for Oreo cookies
  • Get demographic data for a neighborhood from Zillow
  • Get the text transcript of the latest XKCD comic

The Sunlight Foundation has an API for looking for requests from the Congressional Library, along with an online class for learning how to use it!

http://www.codecademy.com/courses/python-intermediate-en-D56TP?curriculum_id=50e5d11681c3a77e29002f95

5.4 Web scraping: selenium

5.4.1 Getting Links

from selenium import webdriver  
import time

browser = webdriver.Firefox()
thisurl = 'http://www.egyptindependent.com/subchannel/News%20features'
browser.get(thisurl)

time.sleep(10)
nextpage = [False]
all_links = []

while len(nextpage)>0:
    if nextpage[0]:
        nextpage[0].click()
        time.sleep(10)
    elems = browser.find_elements_by_xpath("//div[@class='view-content']/h3/a")
    for e in elems:
        all_links.append(e.get_attribute('href'))
    nextpage = browser.find_elements_by_xpath("//li[@class='pager-next last']/a")

Let's go through this code in some more detail:

We begin by importing the necessary libraries, and then starting a new Firefox browser. The browser.get() command navigates this browser to a given URL.

from selenium import webdriver  
import time

browser = webdriver.Firefox()
thisurl = 'http://www.egyptindependent.com/subchannel/News%20features'
browser.get(thisurl)

Unfortunately, there is no clear way to tell if the page is done loading. The easiest strategy is to just wait a long-ish amount of time (I choose 10 seconds). We will use a list to hold the URLs of the pages we want to download, focusing now on just building that list.

We also want to know if we need to click on the next page. We will keep these results in a list which will normally contain the clickable browser element for the next page. If it's empty, we're done, and if it says False, we're at the beginning.

time.sleep(10)
nextpage = [False]
all_links = []

As long as there is a next page, we will click on it, and then get the links to news articles. The elements can be found by their XPath - think of it as an address in a tree:

//div[@class='foo']/h3/a

This will find anything with any <a> tags, but only if they are directly beneath a <h3> tag which is directly beneath a <div class="foo"> tag located anywhere in the document.

while len(nextpage)>0:
    if nextpage[0]:
        nextpage[0].click()
        time.sleep(10)
    elems = browser.find_elements_by_xpath("//div[@class='view-content']/h3/a")

Links in HTML are shaped like this:

<a href='http://www.google.com'>My favorite search engine</a>

The href property shows where the link goes to.

For each element in our list of matching a tags, we will get the href attribute and append it to our list.

Finally, we will find the clickable link to the next page, again using the XPath syntax.

for e in elems:
    all_links.append(e.get_attribute('href'))
nextpage = browser.find_elements_by_xpath("//li[@class='pager-next last']/a")

5.4.2 Parsing Contents

Once we have the list of links, it's time to go through them and organize the parts of the data that we care about. We'll make a dict for each page, which will contain information on the author, the content of the page's main article, and the number of tweets.

We'll still use the XPath syntax to pull out our information, which we need to get by inspecting the elements of the pages. Unfortunately, the Twitter part of this is included in what's called an iframe, which is basically another web page, embedded in the original page. We need to find this iframe and then switch to it before using the XPath here. This is done with the switch_to_frame method:

twitterbox = browser.find_elements_by_xpath("//iframe[@class='twitter-share-button twitter-count-horizontal']")[0]
browser.switch_to_frame(twitterbox)

Here is the full code for collecting the information in a list of dictionaries.

alldata = []

for url in all_links:
    d = dict()
    d['url'] = url
    browser.get(url)
    time.sleep(10)
    textelems = browser.find_elements_by_xpath("//div[@class='panel-pane pane-node-body']")
    d['articletext'] = textelems[0].text
    authorelems = browser.find_elements_by_xpath("//div[@class='field field-type-nodereference field-field-source']")
    d['author'] = authorelems[0].text
    twitterbox = browser.find_elements_by_xpath("//iframe[@class='twitter-share-button twitter-count-horizontal']")[0]
    browser.switch_to_frame(twitterbox)
    twitterelems = browser.find_elements_by_xpath("//html")
    d['tweets'] = twitterelems[0].find_element_by_id('count').text
    alldata.append(d)

5.4.3 Saving our scraped information

We will save this as a JSON, which stands for Javascript Object Notation. It is a human readable and concise format for saving information. Here is an example of what JSON looks like:

[{"url": "http://www.egyptindependent.com/news/final-issue-triumph-practice", "articletext": "When I signed my contract with Al-Masry Al-Youm in April four years ago, I was troubled by the thought of committing full time to a job in journalism....", "tweets": "161", "author": "Lina Attalah"}]

Python can save and read these JSON objects, and other programs can as well. JSON is a popular emerging format, and you can use it in, e.g., R:

http://cran.r-project.org/web/packages/rjson/rjson.pdf

import json
f = open('egypt.json','w')
json.dump(alldata,f)
f.close()

5.4.4 Exercise!

Here is the White House blog:

http://www.whitehouse.gov/blog

Download the blogs from the most recent three pages. Save the text of the article, the Author, the date published, the time published and the list of "Related Topics".

5.4.5 Solution

from selenium import webdriver  
import time
import re

browser = webdriver.Firefox()
thisurl = 'http://www.whitehouse.gov/blog'
browser.get(thisurl)

time.sleep(10)
nextpage = [False]
all_links = []
numpages = 0
while numpages<=3:
    numpages = numpages+1
    if nextpage[0]:
        nextpage[0].click()
        time.sleep(10)
    elems = browser.find_elements_by_xpath("//div[@class='blog-home-title']//a")
    for e in elems:
        all_links.append(e.get_attribute('href'))
    nextpage = browser.find_elements_by_xpath("//li[@class='pager-next last']//a")

alldata = []

for url in all_links:
    d = dict()
    d['url'] = url
    browser.get(url)
    time.sleep(5)
    titleelems = browser.find_elements_by_xpath("//div[@id='content']//h2")
    d['articletitle'] = titleelems[0].text

    textelems = browser.find_elements_by_xpath("//div[@class='content-inner']")
    d['articletext'] = textelems[0].text
    authordateelems = browser.find_elements_by_xpath("//div[@class='post-info-user']")
    infotext = authordateelems[0].text.split('\n')
    d['author'] = infotext[0]
    d['date'] = infotext[1]
    d['time'] = infotext[2]    
    alldata.append(d)

5.4.6 Further Topics

  • Handling time-outs
  • Error handling
  • Using the Chrome webdriver and the XPath copier

5.5 Web scraping: lxml

Now that we know how we want to scrape and have some grasp on the tools that are necessary, let's try and pull the articles and their metadata off of this website.

==#+NAME: download pages

import urllib
baseurl = "http://www.egyptindependent.com/subchannel/News%20features?page="
destpath = ""
npages = 10 # should be 10
for i in range(1,npages):
    urllib.urlretrieve (baseurl+str(i),destpath+"page"+str(i)+".html")   

Note: Windows users, you may need your destination to be specified using two slashes, e.g. C://Python27//tmp//

If we take a look at what exists after running this script, we can see that it worked.

bash-3.2$ ls /tmp/page*
/tmp/page1.html      /tmp/page3.html /tmp/page5.html /tmp/page7.html /tmp/page9.html
/tmp/page2.html      /tmp/page4.html /tmp/page6.html /tmp/page8.html   

Aside: The os module

If you're doing lots of things in a script that will involve files or paths, but you want it to work cross-platform, consider using the os and os.path modules. Do things like

  • change the current directory
  • get the directory or filename of a file

5.5.1 Using ElementTree

Here is a very basic html tree which we can work with.

import urllib
fileloc = 'http://www.people.fas.harvard.edu/~astorer/scraping/test.html'
f = urllib.urlopen(fileloc)
print f.read()

<html>
    <head>
        <title>Example page</title>
    </head>
    <body>
        <p>Moved to <a href="http://example.org/">example.org</a>
        or <a href="http://example.com/">example.com</a>.</p>
    </body>
</html>

  • The ElementTree is a hierarchical structure of Elements.
  • list() returns a list of the children of a single Element
  • An Element contains
    • A tag (what kind of element is it)
    • text of what lives in the element
from xml.etree.ElementTree import ElementTree
fileloc = '/Users/astorer/Work/presentations/scraping/test.html'
tree = ElementTree()
tree.parse(fileloc)   
elem = tree.find('body')
print elem
print list(elem)
elem = tree.find('body/p')
print elem
print list(elem)
print elem.tag
print elem.text
<Element 'body' at 0x1004dcc10>
[<Element 'p' at 0x1004dcc50>]
<Element 'p' at 0x1004dcc50>
[<Element 'a' at 0x1004dcc90>, <Element 'a' at 0x1004dccd0>]
p
Moved to 

5.5.2 Using lxml

Now let's see how we can parse out the list of article URLs from an xml page. Our basic approach isn't going to work here, and we need to install an external package.

  • Installing a Package

    External python packages can be easily installed using the easy_install command from the terminal.

    Note: One challenge is in making sure that if you have multiple version of python installed, you are installing the libraries to the correct location. I'm on a mac, but the Python version on my mac is 2.6, and I prefer using 2.7. Make sure you install the setuptools for 2.7 following these instructions.

    The lxml package is a little more complicated to install than other packages. Normally, typing easy_install packagename is sufficient to install a package, but because lxml depends on routines written in C, it needs a few extra tools.

    The most up to date instructions for installing lxml are online here:

    http://lxml.de/installation.html

    To verify that this installed for you, open up python, and type

    import lxml
    

    If you get an error, check your setup and try reinstalling.

  • Using lxml

    lxml will generate an ElementTree for us after parsing the xml. Let's review some of the functions that will be useful for us in this example.

    EnglishPython
    Construct a parserlxml.etree.HTMLParser()
    Parse an HTML filelxml.etree.parse(file,parser)
    Get all instances of <span class="...">MyTree.xpath('.//span[@class="..."]')
    Get all instances of <span class="date"> within <div class="article">MyTree.xpath('.//div[@class="article"]/span[@class="date"]')
    Make a list of tuples that we can iterate overzip(iterable1,iterable2,...)
    Encode a string foo as unicode (UTF-8)foo.encode("UTF-8")

    The xpath syntax is described in more detail here. Briefly, we are finding every occurence of spans with the class date-display-single, no matter where they live in the tree. Then we can iterate over them to get the actual dates. Similarly, we can iterate over all links that are within the <span class="field content"> that are within the <div class="views-field-title"> and zip it with the dates to iterate over both simultaneously. Notice that whenever foreign characters are used, Python will be unable to display them unless we encode the string first as unicode. The following code makes this explicit.

    ==#+NAME: lxml basics

    from lxml import etree
    fname = 'page1.html'
    fp = open(fname, 'rb')
    parser = etree.HTMLParser()
    tree   = etree.parse(fp, parser)
    dateelems = tree.xpath('.//span[@class="date-display-single"]')
    linkelems = tree.xpath('.//div[@class="views-field-title"]/span[@class="field-content"]/a')     
    for (d,l) in zip(dateelems,linkelems):
        print d.text
        print l.get('href')         
        print l.text.encode("utf-8")
    
  • XPath Examples
    • Get all links under <div class="views-field-title">
      from lxml import etree
      fname = 'page1.html'
      fp = open(fname, 'rb')
      parser = etree.HTMLParser()
      tree   = etree.parse(fp, parser)
      elems = tree.xpath('.//div[@class="views-field-title"]//a')
      for e in elems:
          print e.text.encode('utf-8')
      
    • Get all clickable images

      These will look like:

      <a href="www.webpage.com"><img src="laksjdasldkj.jpg"></a>
      
      from lxml import etree
      fname = 'page1.html'
      fp = open(fname, 'rb')
      parser = etree.HTMLParser()
      tree   = etree.parse(fp, parser)
      elems = tree.xpath('.//a/img')
      for e in elems:
          print e.get('src')
      
      /sites/default/files/W300.jpg
      
  • lxml Exercise
    Write a csv file that contains every image along with the location that it links to. If the webpage has:
    <a href="www.webpage.com"><img src="laksjdasldkj.jpg"></a>
    

    Your entry in the csv file would look like:

    www.webpage.com, laksjdasldkj.jpg
    

    Hint: use the elt.getparent() method to query elements 'above' a given element elt.

  • Solutions to exercise
    import csv
    from lxml import etree
    
    fname = 'page1.html'
    fp = open(fname, 'rb')
    f = open('links.csv','w')
    entries = ["Image","Link"]
    c = csv.DictWriter(f,entries)
    
    parser = etree.HTMLParser()
    tree   = etree.parse(fp, parser)
    lnkelems = tree.xpath('.//a/img')
    for lnk in lnkelems:
        d = dict()
        d["Image"] = lnk.get('src')
        d["Link"] = lnk.getparent().get('href')
        c.writerow(d)
    
    fp.close()
    f.close()
    
    

    The resulting file is a CSV file.:

    /sites/default/files/W300.jpg,http://www.almasryalyoum.com/en/your-guide
    
  • Downloading articles from each page
    Goal: A file with the dates, titles and location of each article. Save each article in html form to the hard drive.
    from lxml import etree
    import csv     
    import urllib
    import re
    
    f = open('files.csv','w')
    entries = ["Day","Month","Year","Title","Remote","Local"]
    c = csv.DictWriter(f,entries)
    
    
    destpath = ''
    fname = 'page1.html'
    fp = open(fname, 'rb')
    parser = etree.HTMLParser()
    tree   = etree.parse(fp, parser)
    dateelems = tree.xpath('.//div[@class="views-field-field-published-date-value"]/span[@class="field-content"]/span[@class="date-display-single"]')
    linkelems = tree.xpath('.//div[@class="panel-pane pane-views pane-subchannel-news subchannel-pane"]//div[@class="views-field-title"]/span[@class="field-content"]/a')
    
    for (d,l) in zip(dateelems,linkelems):
        entry = dict()
        myDate = d.text.split()
        urlname = l.get('href')
        print urlname
        entry["Day"] = myDate[0]
        entry["Month"] = myDate[1]
        entry["Year"] = myDate[2]
        remotename = re.match('.*/(.*)',urlname)
        dest = destpath+remotename.group(1)+".html"
        urllib.urlretrieve (urlname,dest)
        entry["Local"] = dest
        entry["Remote"] = urlname
        entry["Title"] = l.text.encode("utf-8")
        c.writerow(entry)
        print entry
    
    f.close()
    fp.close()
    
    

    The resulting file is a CSV file.:

  • Exercise

    Modify the above code so that instead of iterating over only the first page, it iterates over all pages.

    • Consider using the glob library to look for all of the html files in a directory.
    • Can you do this so you don't save the pages, but parse them directly?
      • Use google and the python documentation to help figure it out!

      Now that we've seen lxml in action, let's figure out how to use it to pull out just the text of the article. Recall that all of the original text is in the following tags:

      <div class="panel-pane pane-node-body" >    
      <div class="pane-content">
      

5.5.3 Stripping text

Can be included if there's interest!

5.5.4 Parallelization to increase speed

Can be included if there's interest!

Date: June, 2013

Author: Alex Storer

Validate XHTML 1.0