5.4.13

Reproducible publications as R packages... and markdown vignettes!

I hadn't really considered it before, but R packages are an interesting way to distribute reproducible papers. You have functions in the R directory, any required data files in data, a Description file that describes the package and all its dependencies, and finally you have the vignette, generally in inst/doc. Vignettes in R traditionally had to be written in Latex and used Sweave to mix the written and code parts.

It is very easy to imagine that the vignette produced could be a journal article, and therefore to get the reproducible article and access to underlying data and functions, you simply need to install the package. I don't know why I hadn't realized this before. It is actually a really neat, and potentially powerful idea. I would not be surprised that this was actually part of the logic of incorporating the Sweave vignettes in R packages.

One wonders why this isn't more common, to distribute papers as packages? I am guessing that it is likely from the requirement to use Latex for writing the document. I've written a vignette, and I can't say I really cared for the experience, basically because getting the Latex part to work was a painful process.

 However, with the release of R 3.0.0, it is now possible to define alternative vignette engines. Currently the only one I know of is knitr, which currently supports generating PDF from Rnw (latex with an alternative syntax to Sweave) and HTML from Rmd, or R markdown files. Markdown is so much easier to write, and in my mind the HTML generated is much easier to read. In addition to that, customization of the look using CSS is probably much more familiar to many people who are doing programming nowadays as well, another big plus.

In addition to having to write using Latex, the process of changing, building, loading, and documenting packages was pretty cumbersome. However, Hadley Wickham has been doing a lot to change that with his devtools package, that makes it quite easy to re-build a package and load it back up. This has now been integrated into the latest versions of RStudio, making it rather easy to work on a package and immediately work with any changes. In addition, the test_that package makes it easier to run automated tests, and ROxygen makes it easy to also document your custom functions used by your vignette.

 So, I know I will be using Yihui's guide to switch my own package to use a markdown vignette, and will probably try to do my next paper as a self contained R package as well. How about you??

Edit: As Carl pointed out below, pandoc is very useful for converting markdown to other formats that may be required for formal submission to a journal.

6 comments:

  1. Couldn't agree more. My last two papers were done this way (ROxygen, knitr, markdown, etc) https://github.com/cboettig/treeBASE, appearing in MEE (10.1111/j.2041-210X.2012.00247.x)

    and https://github.com/cboettig/rfishbase, appearing in J. Fish Biology (10.1111/j.1095-8649.2012.03464.x)

    Admittedly these were about R packages, but I have resolved to do this for my research papers as well. (drafts online, e.g. multiple_uncertainty & pdg_control)

    I'd add a note that Gentleman and Lang (2007, 10.1198/106186007X178663) propose this idea with Sweave, but it is much easier now with Rmd. I'd also mention pandoc -- J Fish Biology didn't take the pdf, so pandoc let me generate a .docx from the markdown.

    ReplyDelete
  2. You have written what I intended to write for my next blog post :) I said something similar here a while ago: http://permalink.gmane.org/gmane.comp.lang.r.devel/32952

    I need to push kntir 1.2 to CRAN before I can shout out loud why everybody needs to upgrade to R 3.0.0 for a wonderful world of reproducible research.

    ReplyDelete
  3. Sorry to steal your thunder Yihui. It's funny, I realized that Sweave and knitr were great for producing reproducible documents, but I hadn't thought about the data and custom function aspects. IE I thought more about creating a different package with functions, a separate data file, and then the document itself. But the package would make it much easier, and take care of dependencies, etc.

    I really hope this approach catches on for general rep. documents using R, and other engines that knitr will support.

    ReplyDelete
    Replies
    1. You were not stealing from me -- we were both stealing from Duncan Temple Lang and Robert Gentleman :) I made some comments on their paper in a book chapter in http://www.crcpress.com/product/isbn/9781466561595 Basically they were hoping something like XML would rule the world, but I do not think that is going to happen... There are tons of extremely powerful authoring languages in the world like LaTeX and XML, but I believe the "weak" languages like Markdown will win the mass.

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. I like the idea. I took a similar approach with my latest paper that we're working on formatting for submission, but modified it slightly. In our case, we were working on a new software methodology, so I wanted to ensure that we had the code developed as a stand-alone package which could be reused without having to download all sorts of extra data and defunct functions.

    So I released the methods/useful code in one package/repo: https://github.com/QBRC/ENA
    And the research/knit-able document in another: https://github.com/QBRC/ena-research

    I kind of like this approach as it allows me to set aside the functions which I may want to re-use later on down the road, but I can certainly see the benefit of having an entire project wrapped up in one package (data size permitting).

    ReplyDelete