2008/07/28

feed parsing in calibre (for SONY Reader PRS 500)

P1100010 (by plateaukao)
Olympus E300
Annecy.France
2008.01.10

早晨醒來,看到窗外滿是陽光,
儘管有樹林在前頭擋著,
光線還是忍不住從縫隙竄進來。

****
Since Sony's elibrary software only runs on Windows platform, I have to find another software to manage ebooks on my Sony Reader PRS-500 under Ubuntu. And so far, I think calibre (used to be named as libprs500) is the best tool I can get. It can not only allow me to transfer documents to my PRS-500, but also be capable of converting documents or RSS feeds to reader compitable formats.

For the RSS feed parsing part, I found it really powerful! If the feed you want to access is quite simple, or of standard format, you can easily add it into calibre by using the UI interface. However, sometimes you have to tweak something to get a better output. For example, you may want to access the "printer version" of each article in the feed. Or you want to prune out some un-necessary information in the page (related links, comments, and etc).

Thanks to the underlying "Universal Feed Parser", Calibre provides you a rich library to customize the output of your feeds. Because I can't read articles that are too long in front of my notebook, I always want to put more things into my PRS500. Today, I tried to customize one feed from Japan, which is from the site of CNET jp.

The feed is generated by feedburner. The link component for each article is different from the real URL. I think CNET does some redirection thing after receiving the request. So the simple Add and Run does not work for this feed. I looked into the feed and found that the original link is actually include in the "feedburner:origLink" component. By entering the Advanced mode in calibre setting, I can see the python class created for this feed. That's where we can do our modifications. First of all, I define a function named get_article_url. The system will call this function to get the content of each article. I can just put in the right URL in this function.

Also, I disabled the css style from original site; I extracted only the div sections I want; and I changed the font size too.
Here is the final result:

class AdvancedUserRecipe1217218562(BasicNewsRecipe):
title = u'Japanese'
oldest_article = 7
max_articles_per_feed = 100
no_stylesheets = True
keep_only_tags = [dict(id=['block_rblog_leaf','block_comment'])]
feeds = [(u'CNET Blog', u'http://feeds.japan.cnet.com/cnet/blog'),(u'CNET News', u'http://feeds.japan.cnet.com/cnet/rss')]
extra_css = 'p {font-size:12pt}'

def get_article_url(self, article):
return article.get('feedburner_origlink',None)
After you get familiar with the python class, the output is at your command. : )

No comments:

Post a Comment

騎士團長殺人事件 --村上春樹

今年的日文小說看完了。篇幅很長,但規模不大,故事都繞著幾個主要的人物。情節裡有些超現實的場景,但最後卻讓人有點意猶未盡,因為有些想搞懂的疑點,在還沒講清之前就結束了。