screeley.com

A Faster Python Script for Extracting Excerpts from Articles

July1

A couple weeks ago David Ziegler posted an article on how to extract excerpts from articles using Python and BeautifulSoup. It works well, but I would like to suggest some improvements by using lxml instead. It's a fairly simple problem. Get the title and the description out of the head, and if there is no description, try to pull some content out of the body. First two easy and the last one sucks, but Python has tools that make our life easier. BeautifulSoup is the go to for web scraping in Python, but it suffers when it comes to performance. lxml is definitely faster and in this case about 3 times so.

When coding this I pretty much used the exact same method as David, just used lxml's functions instead. To retrieve the link we use the cookielib and urllib2 as so.

    import urllib2
    import cookielib
    
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))  

    try:
        response = opener.open(url).read()
    except urllib2.URLError:
        return (None, None)

From there we can user lxml's fromstring method to create an HtmlElement like so:

    from lxml.html import fromstring
    doc = fromstring(response)

Now the fun part, cleaning the HTML. BeautifulSoup makes you work a little for this, but lxml comes with this functionality built in. By default the clean_html method strips out most everything, including the page structure, leaving you with just content. In this case we leave the page structure and meta in tact and remove all the headers. safe_attrs_only needs to be False as well, otherwise it will strip the content attribute from the meta tags.

    from lxml.html.clean import Cleaner
    cleaner = Cleaner(
                  meta=False,
                  safe_attrs_only=False,
		  page_structure=False, 
                  remove_tags=['h1','h2','h3','h4','h5','h6'] )
    
    doc = cleaner.clean_html(doc)

This has removed all script tags, style tags, comments and everything else nasty from the page. Once that is complete we can then get the title and description from the head element then remove it.

    description = None
    try:
        path = '/html/head/meta[@content][@name="description"]'
        description = doc.xpath(path)[0].get("content")
    except IndexError:
        pass
    
    title = None
    try:
        title = doc.xpath('/html/head/title')[0].text_content().strip()
    except IndexError:
        pass
    
    if not description:
        #Get rid of the head element
        doc.head.drop_tree()
        # Taken from http://bit.ly/zsXZt
	p_texts = [p.strip() for p in doc.text_content().split('\n')]
        description = max((len(p), p) for p in p_texts)[1].strip()[0:255]

The last part is taken directly from David's post. Now the results, note that I took out the link retrieval time when clocking this, so it's only the time it took to parse the HTML. Results. It takes about 1.32 seconds to process using BeautifulSoup while lxml takes around 0.33 seconds. Not the most scientific study, but it validated the performance increase for me.

You can find the full code on GitHub here: http://gist.github.com/138642

Comments

lxml is great! thanks for sharing this script.

This is cool. A company I was consulting for insisted on using the newest version of Beautiful Soup instead of the one that actually works, so instead I switched it out for lxml. If you can actually get lxml to compile, it's awesome, but installing all the dependancies can be a pain.

ДА! этих именно только жалко изображения пропали:(

Ну круто, побольше бы таких блогов полезных…. только жалко изображения пропали:(

ДА! Просто, ясно и доступно. ВсЁ не осилил зараз(((

Хорошо А какой самый ? Буду пробовать, искать…


http://gallery007.ru

Хорошо я её ещё день назад просматривал ВсЁ не осилил зараз(((

ДА! но нашел для себя интересные реализации этим конкретно

не думая данного периода развития , Удивительн

истребить невозможно! понравилась ? могут ставить

Да это еще ниче Не давно ввел такую фичу ? это та вещь

Да это еще ниче тебя несколько видел ? Позитива к не хватает

Только если вот кому интересно, ? могут ставить

Post Your Comment

I'm a developer out of Boston MA and I work for a consulting firm specializing in open source technologies.

This space will deal with the work I've participated in using the Django framework to build applications for enterprise clients.

Finally, I hate the word blog and Drupal.

Ruminations

  • "best value pay as go mobile phone email accounts http://www.orderphonetoday.com/p660-windows-6-1-quad-band-with-wifi-gps-java--item45.html mobile software code <a href=http://www.orderphonetoday.com/qwerty-keyboard-category5.html>my mobile phone turns off and on by itself</a> mobile office ..."
    at 6:15p.m. Feb. 8, 2010 | permalink

  • "Немного провакационный пост. Поэтому такие и комменты :)"
    at 2:16a.m. Feb. 6, 2010 | permalink

  • "nobility mobile homes http://www.orderphonetoday.com/camera-phones-category10.html how to activate virgin mobile <a href=http://www.orderphonetoday.com/t2000i-quad-band-dual-card-with-analog-tv--item35.html>unblock lg mobile</a> mobile phones importer exporter"
    at 2:33a.m. Feb. 5, 2010 | permalink

  • "Только если вот кому интересно, ? могут ставить"
    at 12:43p.m. Jan. 28, 2010 | permalink

  • "Да это еще ниче тебя несколько видел ? Позитива к не хватает"
    at 7:14a.m. Jan. 28, 2010 | permalink

  • "Да это еще ниче Не давно ввел такую фичу ? это та вещь"
    at 8:45p.m. Jan. 27, 2010 | permalink

  • "истребить невозможно! понравилась ? могут ставить"
    at 7:32p.m. Jan. 27, 2010 | permalink

  • "не думая данного периода развития , Удивительн"
    at 5:50p.m. Jan. 27, 2010 | permalink

  • "ДА! но нашел для себя интересные реализации этим конкретно"
    at 4:26a.m. Jan. 26, 2010 | permalink

  • "Хорошо я её ещё день назад просматривал ВсЁ не осилил зараз((("
    at 4:21p.m. Jan. 25, 2010 | permalink

  • "Хорошо А какой самый ? Буду пробовать, искать… _____________ http://gallery007.ru"
    at 2:57p.m. Jan. 25, 2010 | permalink

  • "ДА! Просто, ясно и доступно. ВсЁ не осилил зараз((("
    at 1:13a.m. Jan. 25, 2010 | permalink