17.11.12

Calibre, Python, reading papers in e-ink

Yesterday I set up the excellent iPython Notebook on my windows machine. This is essentially an interactive web-interface to the Python shell, that lets you record everything you have done, and mark it up with lots of stuff using Markdown and mathjax. In many ways, it is very similar to using RStudio, RMarkdown, and knitr to generate Markdown and html reports from R.

I don't actually do any of my scientific coding in Python, but that may change. My motivation for wanting to learn some Python comes from the fact that Calibre is written in Python. Calibre provides a nice method for taking RSS feeds, parsing them, and spitting out the results as something an e-reader can understand (and really, it supports many different e-reading platforms, including ePub and Kindle formats).

Although I have been using my first generation iPad to read scientific publications from PDF for 2 1/2 years now, the recent experiment of Genome Biology providing the ENCODE publications as ePub made me try reading scientific publications on my 3rd gen Kindle. I loved it! Even without the color figures, and the rather small screen, the experience was simply amazing. Especially given that many papers I am reading more for information intake than for marking up, it really works. And if I really need to mark up a Kindle doc, I can use the Kindle app on my iPad, or my computer. But highlighting works well. I can even retrieve the highlights and notes from the text file that holds them on the Kindle itself.

Most scientific publications are made available as HTML pages, or PDF. So in theory, we should be able to easily generate an ePub or Kindle format using Calibre from the raw HTML. However, for some reason the powers that be in the e-journal publishing world decided that figures and tables should not actually be part of the document. I really don't know why, because they are in the PDF, and it is not that hard to do in the HTML (see an example paper I did here in HTML).

What this ultimately means is that to generate an e-reader compatible document, we need to actually modfiy the HTML. We need to go in, find the elements that tell us where the figure and table pages are, parse them, and get the actual files. I actually figured out how to do this for at least one journal in R using the XML package, and was going to create a package that would take a series of links or DOIs and process them.

But I get many of my papers by RSS feed for specific journals. And Calibre, as I already mentioned, has some nice functions for automatically processing RSS feeds and generating e-reader compatible docs from them. And Calibre is written in Python, and new RSS processing recipes are written in Python. Therefore, I guess I'm going to learn me some Python!

No comments:

Post a Comment