feed parsing in calibre (for SONY Reader PRS 500)

P1100010 (by plateaukao)
Olympus E300


Since Sony's elibrary software only runs on Windows platform, I have to find another software to manage ebooks on my Sony Reader PRS-500 under Ubuntu. And so far, I think calibre (used to be named as libprs500) is the best tool I can get. It can not only allow me to transfer documents to my PRS-500, but also be capable of converting documents or RSS feeds to reader compitable formats.

For the RSS feed parsing part, I found it really powerful! If the feed you want to access is quite simple, or of standard format, you can easily add it into calibre by using the UI interface. However, sometimes you have to tweak something to get a better output. For example, you may want to access the "printer version" of each article in the feed. Or you want to prune out some un-necessary information in the page (related links, comments, and etc).

Thanks to the underlying "Universal Feed Parser", Calibre provides you a rich library to customize the output of your feeds. Because I can't read articles that are too long in front of my notebook, I always want to put more things into my PRS500. Today, I tried to customize one feed from Japan, which is from the site of CNET jp.

The feed is generated by feedburner. The link component for each article is different from the real URL. I think CNET does some redirection thing after receiving the request. So the simple Add and Run does not work for this feed. I looked into the feed and found that the original link is actually include in the "feedburner:origLink" component. By entering the Advanced mode in calibre setting, I can see the python class created for this feed. That's where we can do our modifications. First of all, I define a function named get_article_url. The system will call this function to get the content of each article. I can just put in the right URL in this function.

Also, I disabled the css style from original site; I extracted only the div sections I want; and I changed the font size too.
Here is the final result:

class AdvancedUserRecipe1217218562(BasicNewsRecipe):
title = u'Japanese'
oldest_article = 7
max_articles_per_feed = 100
no_stylesheets = True
keep_only_tags = [dict(id=['block_rblog_leaf','block_comment'])]
feeds = [(u'CNET Blog', u'http://feeds.japan.cnet.com/cnet/blog'),(u'CNET News', u'http://feeds.japan.cnet.com/cnet/rss')]
extra_css = 'p {font-size:12pt}'

def get_article_url(self, article):
return article.get('feedburner_origlink',None)
After you get familiar with the python class, the output is at your command. : )

No comments:

Post a Comment

騎士團長殺人事件 --村上春樹