2008/07/28

feed parsing in calibre (for SONY Reader PRS 500)

P1100010 (by plateaukao)
Olympus E300
Annecy.France
2008.01.10

早晨醒來,看到窗外滿是陽光,
儘管有樹林在前頭擋著,
光線還是忍不住從縫隙竄進來。

****
Since Sony's elibrary software only runs on Windows platform, I have to find another software to manage ebooks on my Sony Reader PRS-500 under Ubuntu. And so far, I think calibre (used to be named as libprs500) is the best tool I can get. It can not only allow me to transfer documents to my PRS-500, but also be capable of converting documents or RSS feeds to reader compitable formats.

For the RSS feed parsing part, I found it really powerful! If the feed you want to access is quite simple, or of standard format, you can easily add it into calibre by using the UI interface. However, sometimes you have to tweak something to get a better output. For example, you may want to access the "printer version" of each article in the feed. Or you want to prune out some un-necessary information in the page (related links, comments, and etc).

Thanks to the underlying "Universal Feed Parser", Calibre provides you a rich library to customize the output of your feeds. Because I can't read articles that are too long in front of my notebook, I always want to put more things into my PRS500. Today, I tried to customize one feed from Japan, which is from the site of CNET jp.

The feed is generated by feedburner. The link component for each article is different from the real URL. I think CNET does some redirection thing after receiving the request. So the simple Add and Run does not work for this feed. I looked into the feed and found that the original link is actually include in the "feedburner:origLink" component. By entering the Advanced mode in calibre setting, I can see the python class created for this feed. That's where we can do our modifications. First of all, I define a function named get_article_url. The system will call this function to get the content of each article. I can just put in the right URL in this function.

Also, I disabled the css style from original site; I extracted only the div sections I want; and I changed the font size too.
Here is the final result:

class AdvancedUserRecipe1217218562(BasicNewsRecipe):
title = u'Japanese'
oldest_article = 7
max_articles_per_feed = 100
no_stylesheets = True
keep_only_tags = [dict(id=['block_rblog_leaf','block_comment'])]
feeds = [(u'CNET Blog', u'http://feeds.japan.cnet.com/cnet/blog'),(u'CNET News', u'http://feeds.japan.cnet.com/cnet/rss')]
extra_css = 'p {font-size:12pt}'

def get_article_url(self, article):
return article.get('feedburner_origlink',None)
After you get familiar with the python class, the output is at your command. : )

沒有留言:

張貼留言

中國 App 商業模式 -- 王泌

很有系統地介紹了中國近幾年比較大的幾十個 app,包含他們主要的商業模式,投資者,和特色在哪裡。對於想要了解中國 App (網路服務) 市場的人來說,會是個很好的入門書。雖然已經是兩三年前的資料了,這兩三年又有了很大的變化,但依然是個很好的起點。