A couple weeks ago David Ziegler posted an article on how to extract excerpts from articles using Python and BeautifulSoup. It works well, but I would like to suggest some improvements by using lxml instead. It's a fairly simple problem. Get the title and the description out of the head, and if there is no description, try to pull some content out of the body. First two easy and the last one sucks, but Python has tools that make our life easier. BeautifulSoup is the go to for web scraping in Python, but it suffers when it comes to performance. lxml is definitely faster and in this case about 3 times so.
I'm a developer out of San Francisco CA working at a startup.
This space will deal with the work I've participated in using the Django framework to build applications for enterprise clients.
Finally, you should follow me on twitter.
"GobgoplebeM <a href=http://posterous.com/people/4SDzppk18fMR>сиалис цены</a> undilyday"
at 3:24a.m. Sept. 6, 2010 | permalink
"generic z-pak <a href=http://sefsa.org>buy azithromycin</a>"
at 7:53p.m. Aug. 27, 2010 | permalink
"How do i come up with cash from online gambling? <img>http://shrtn.info/smile/ref.php</img>"
at 2:50a.m. Aug. 25, 2010 | permalink
"http://needman.ru замуж за иностранца <a href=http://needman.ru>знакомства с иностранцами</a>"
at 12:59p.m. May 18, 2010 | permalink
"Yebhewjw <a href="http://yebhewjw.de">yebhewjw</a> http://yebhewjw.de yebhewjw http://yebhewjw.de"
at 11:41p.m. April 29, 2010 | permalink
"Thanks for this, unbelievable our developer has a robots no follow tag on our site, no wonder it wasn't being found by the search engines ..."
at 7:40a.m. March 2, 2010 | permalink
"maybe you are right. but how often robots.txt is actually accessed? and how much overhead there is? I'm curious - quantitatively - how big of ..."
at 7:13p.m. Dec. 12, 2009 | permalink
"Lovely idea! Thanks for sharing. I'm gonna have a closer look at the patch for Django 1.2. This could help switching template engines a lot. ..."
at 9:14a.m. Nov. 2, 2009 | permalink
"That was an inspiring post, I think Drupal is great! how could you hate it so much, Thanks for writing, most people don't bother."
at 11:14a.m. Oct. 28, 2009 | permalink
"@Evgeniy. Yes at: http://code.google.com/p/django-alfresco/"
at 10:42a.m. Oct. 22, 2009 | permalink
"Is this released as an open source project?"
at 1:21a.m. Oct. 22, 2009 | permalink
"Interesting, thanks for the examples that you have shared, these are great... Anyway, thanks for the post"
at 7:55a.m. Oct. 16, 2009 | permalink