A couple weeks ago David Ziegler posted an article on how to extract excerpts from articles using Python and BeautifulSoup. It works well, but I would like to suggest some improvements by using lxml instead. It's a fairly simple problem. Get the title and the description out of the head, and if there is no description, try to pull some content out of the body. First two easy and the last one sucks, but Python has tools that make our life easier. BeautifulSoup is the go to for web scraping in Python, but it suffers when it comes to performance. lxml is definitely faster and in this case about 3 times so.
When coding this I pretty much used the exact same method as David, just used lxml's functions instead. To retrieve the link we use the cookielib and urllib2 as so.
import urllib2
import cookielib
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
try:
response = opener.open(url).read()
except urllib2.URLError:
return (None, None)
From there we can user lxml's fromstring method to create an HtmlElement like so:
from lxml.html import fromstring
doc = fromstring(response)
Now the fun part, cleaning the HTML. BeautifulSoup makes you work a little for this, but lxml comes with this functionality built in. By default the clean_html method strips out most everything, including the page structure, leaving you with just content. In this case we leave the page structure and meta in tact and remove all the headers. safe_attrs_only needs to be False as well, otherwise it will strip the content attribute from the meta tags.
from lxml.html.clean import Cleaner
cleaner = Cleaner(
meta=False,
safe_attrs_only=False,
page_structure=False,
remove_tags=['h1','h2','h3','h4','h5','h6'] )
doc = cleaner.clean_html(doc)
This has removed all script tags, style tags, comments and everything else nasty from the page. Once that is complete we can then get the title and description from the head element then remove it.
description = None
try:
path = '/html/head/meta[@content][@name="description"]'
description = doc.xpath(path)[0].get("content")
except IndexError:
pass
title = None
try:
title = doc.xpath('/html/head/title')[0].text_content().strip()
except IndexError:
pass
if not description:
#Get rid of the head element
doc.head.drop_tree()
# Taken from http://bit.ly/zsXZt
p_texts = [p.strip() for p in doc.text_content().split('\n')]
description = max((len(p), p) for p in p_texts)[1].strip()[0:255]
The last part is taken directly from David's post. Now the results, note that I took out the link retrieval time when clocking this, so it's only the time it took to parse the HTML. Results. It takes about 1.32 seconds to process using BeautifulSoup while lxml takes around 0.33 seconds. Not the most scientific study, but it validated the performance increase for me.
You can find the full code on GitHub here: http://gist.github.com/138642
lxml is great! thanks for sharing this script.