R
packages are an interesting way to distribute reproducible papers. You have functions in the R
directory, any required data files in data
, a Description
file that describes the package and all its dependencies, and finally you have the vignette, generally in inst/doc
. Vignettes in R
traditionally had to be written in Latex and used Sweave to mix the written and code parts.It is very easy to imagine that the vignette produced could be a journal article, and therefore to get the reproducible article and access to underlying data and functions, you simply need to install the package. I don't know why I hadn't realized this before. It is actually a really neat, and potentially powerful idea. I would not be surprised that this was actually part of the logic of incorporating the Sweave vignettes in
R
packages.One wonders why this isn't more common, to distribute papers as packages? I am guessing that it is likely from the requirement to use Latex for writing the document. I've written a vignette, and I can't say I really cared for the experience, basically because getting the Latex part to work was a painful process.
However, with the release of
R 3.0.0
, it is now possible to define alternative vignette engines. Currently the only one I know of is knitr
, which currently supports generating PDF from Rnw
(latex with an alternative syntax to Sweave) and HTML from Rmd
, or R markdown files. Markdown is so much easier to write, and in my mind the HTML generated is much easier to read. In addition to that, customization of the look using CSS is probably much more familiar to many people who are doing programming nowadays as well, another big plus.In addition to having to write using Latex, the process of changing, building, loading, and documenting packages was pretty cumbersome. However, Hadley Wickham has been doing a lot to change that with his
devtools
package, that makes it quite easy to re-build a package and load it back up. This has now been integrated into the latest versions of RStudio, making it rather easy to work on a package and immediately work with any changes. In addition, the test_that
package makes it easier to run automated tests, and ROxygen
makes it easy to also document your custom functions used by your vignette.So, I know I will be using Yihui's guide to switch my own package to use a markdown vignette, and will probably try to do my next paper as a self contained
R
package as well. How about you??Edit: As Carl pointed out below, pandoc is very useful for converting markdown to other formats that may be required for formal submission to a journal.
Couldn't agree more. My last two papers were done this way (ROxygen, knitr, markdown, etc) https://github.com/cboettig/treeBASE, appearing in MEE (10.1111/j.2041-210X.2012.00247.x)
ReplyDeleteand https://github.com/cboettig/rfishbase, appearing in J. Fish Biology (10.1111/j.1095-8649.2012.03464.x)
Admittedly these were about R packages, but I have resolved to do this for my research papers as well. (drafts online, e.g. multiple_uncertainty & pdg_control)
I'd add a note that Gentleman and Lang (2007, 10.1198/106186007X178663) propose this idea with Sweave, but it is much easier now with Rmd. I'd also mention pandoc -- J Fish Biology didn't take the pdf, so pandoc let me generate a .docx from the markdown.
You have written what I intended to write for my next blog post :) I said something similar here a while ago: http://permalink.gmane.org/gmane.comp.lang.r.devel/32952
ReplyDeleteI need to push kntir 1.2 to CRAN before I can shout out loud why everybody needs to upgrade to R 3.0.0 for a wonderful world of reproducible research.
Sorry to steal your thunder Yihui. It's funny, I realized that Sweave and knitr were great for producing reproducible documents, but I hadn't thought about the data and custom function aspects. IE I thought more about creating a different package with functions, a separate data file, and then the document itself. But the package would make it much easier, and take care of dependencies, etc.
ReplyDeleteI really hope this approach catches on for general rep. documents using R, and other engines that knitr will support.
You were not stealing from me -- we were both stealing from Duncan Temple Lang and Robert Gentleman :) I made some comments on their paper in a book chapter in http://www.crcpress.com/product/isbn/9781466561595 Basically they were hoping something like XML would rule the world, but I do not think that is going to happen... There are tons of extremely powerful authoring languages in the world like LaTeX and XML, but I believe the "weak" languages like Markdown will win the mass.
DeleteThis comment has been removed by the author.
ReplyDeleteI like the idea. I took a similar approach with my latest paper that we're working on formatting for submission, but modified it slightly. In our case, we were working on a new software methodology, so I wanted to ensure that we had the code developed as a stand-alone package which could be reused without having to download all sorts of extra data and defunct functions.
ReplyDeleteSo I released the methods/useful code in one package/repo: https://github.com/QBRC/ENA
And the research/knit-able document in another: https://github.com/QBRC/ena-research
I kind of like this approach as it allows me to set aside the functions which I may want to re-use later on down the road, but I can certainly see the benefit of having an entire project wrapped up in one package (data size permitting).