2008/07/28

feed parsing in calibre (for SONY Reader PRS 500)

P1100010 (by plateaukao)
Olympus E300
Annecy.France
2008.01.10

早晨醒來,看到窗外滿是陽光,
儘管有樹林在前頭擋著,
光線還是忍不住從縫隙竄進來。

****
Since Sony's elibrary software only runs on Windows platform, I have to find another software to manage ebooks on my Sony Reader PRS-500 under Ubuntu. And so far, I think calibre (used to be named as libprs500) is the best tool I can get. It can not only allow me to transfer documents to my PRS-500, but also be capable of converting documents or RSS feeds to reader compitable formats.

For the RSS feed parsing part, I found it really powerful! If the feed you want to access is quite simple, or of standard format, you can easily add it into calibre by using the UI interface. However, sometimes you have to tweak something to get a better output. For example, you may want to access the "printer version" of each article in the feed. Or you want to prune out some un-necessary information in the page (related links, comments, and etc).

Thanks to the underlying "Universal Feed Parser", Calibre provides you a rich library to customize the output of your feeds. Because I can't read articles that are too long in front of my notebook, I always want to put more things into my PRS500. Today, I tried to customize one feed from Japan, which is from the site of CNET jp.

The feed is generated by feedburner. The link component for each article is different from the real URL. I think CNET does some redirection thing after receiving the request. So the simple Add and Run does not work for this feed. I looked into the feed and found that the original link is actually include in the "feedburner:origLink" component. By entering the Advanced mode in calibre setting, I can see the python class created for this feed. That's where we can do our modifications. First of all, I define a function named get_article_url. The system will call this function to get the content of each article. I can just put in the right URL in this function.

Also, I disabled the css style from original site; I extracted only the div sections I want; and I changed the font size too.
Here is the final result:

class AdvancedUserRecipe1217218562(BasicNewsRecipe):
title = u'Japanese'
oldest_article = 7
max_articles_per_feed = 100
no_stylesheets = True
keep_only_tags = [dict(id=['block_rblog_leaf','block_comment'])]
feeds = [(u'CNET Blog', u'http://feeds.japan.cnet.com/cnet/blog'),(u'CNET News', u'http://feeds.japan.cnet.com/cnet/rss')]
extra_css = 'p {font-size:12pt}'

def get_article_url(self, article):
return article.get('feedburner_origlink',None)
After you get familiar with the python class, the output is at your command. : )

沒有留言:

張貼留言

乩童警探 一二集

接連看了兩集。第一集還算新鮮,到了第二集就看得比較慢了。一來,劇情雖然常會提到乩童,但真的跟乩童又沒有什麼關係。第二集後半大概就猜出是怎麼一回事了,但還是很享受在作者解釋的過程。只是前面舖陳有點久,而且人物有點多,再加上我是斷斷續續的看,看到後來都快忘了誰是誰,或是誰曾經出過什麼...