28.11.12

Hive Plots Using R and Cytoscape

Hive Plots??

I found out about HivePlots this past summer, and although I thought they looked incredibly useful and awesome, I didn't have a personal use for them at the time, and therefore put off doing anything with them. That recently changed when I encountered some particularly nasty hairballs of force-directed graphs. Unfortunately, the HiveR package does not create interactive hiveplots (at least for 2D), and that is particularly important for me. I don't necessarily want to be able to compare networks (one of the selling points made by Martin Krzywinski), but I do want to be able to explore the networks that I create. For that reason I have been a big fan of the RCytoscape Bioconductor package since I encountered it, as it allows me to easily create graphs in R, and then interactively and programmatically explore them in Cytoscape

So I decided last week to see how hard it would be to generate a hive plot that could be visualized and interacted with in Cytoscape. For this example I'm going to use the data in the HiveR package, and actually use the structures already encoded, because they are useful.

Load Data

require(RCytoscape)
require(HiveR)
require(graph)
options(stringsAsFactors = F)
dataDir <- file.path(system.file("extdata", package = "HiveR"), "E_coli")
EC1 <- dot2HPD(file = file.path(dataDir, "E_coli_TF.dot"), node.inst = NULL, 
    edge.inst = file.path(dataDir, "EdgeInst_TF.csv"), desc = "E coli gene regulatory network (RegulonDB)", 
    axis.cols = rep("grey", 3))
str(EC1)
## List of 5
##  $ nodes    :'data.frame':   1597 obs. of  6 variables:
##   ..$ id    : int [1:1597] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ lab   : chr [1:1597] "pstB" "hybE" "fadE" "phnF" ...
##   ..$ axis  : int [1:1597] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ radius: num [1:1597] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ size  : num [1:1597] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ color : chr [1:1597] "transparent" "transparent" "transparent" "transparent" ...
##  $ edges    :'data.frame':   3893 obs. of  4 variables:
##   ..$ id1   : int [1:3893] 932 612 932 1510 1510 413 528 652 1396 400 ...
##   ..$ id2   : int [1:3893] 832 620 51 525 797 151 5 1058 1396 1559 ...
##   ..$ weight: num [1:3893] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ color : chr [1:3893] "red" "red" "green" "green" ...
##  $ desc     : chr "E coli gene regulatory network (RegulonDB)"
##  $ axis.cols: chr [1:3] "grey" "grey" "grey"
##  $ type     : chr "2D"
##  - attr(*, "class")= chr "HivePlotData"

Process Data

So here we have the data. The nodes is a data frame with the id, a label describing the node, which axis the node belongs on, and its radius, or how far out on the axis the node should be, as well as a size. These are all modifiable attributes that can be changed depending on how one wants to map different pieces of data. This of course is the beauty of hive plots, because they result in networks that are dependent on attributes that the user decides on.

In this case, we have a transcription factor regulation network. I am going to point you to the previous links as to why a normal force-directed network diagram is not really that informative for these types of networks. I'm not out to convince you that HivePlots are useful, if you don't get it from the publication and examples, then you should stop here. This is more about how to do some calculations to lay them out and work with them in Cytoscape.

Bryan has implemented some nice functions to work with this type of network and perform simple calculations to assign axes and locations based on properties of the nodes. For example, it is easy to locate nodes on an axis based on the total number of edges.

EC2 <- mineHPD(EC1, option = "rad <- tot.edge.count")
sumHPD(EC2)
##  E coli gene regulatory network (RegulonDB)
##  This hive plot data set contains 1597 nodes on 1 axes and 3893 edges.
##  It is a  2D data set.
## 
##      Axis 1 has 1597 nodes spanning radii from 1 to 434 
## 
##      Axes 1 and 1 share 3893 edges

And then to assign the axis to be plotted on based on the whether edges are incoming (sink), outgoing (source), or both (manager). These are the types of decisions that influence whether you get anything insightful or useful out of a HivePlot, and changing these options can of course change the conclusions you will make on a particular network.

EC3 <- mineHPD(EC2, option = "axis <- source.man.sink")
sumHPD(EC3)
##  E coli gene regulatory network (RegulonDB)
##  This hive plot data set contains 1597 nodes on 3 axes and 3893 edges.
##  It is a  2D data set.
## 
##      Axis 1 has 45 nodes spanning radii from 1 to 83 
##      Axis 2 has 1416 nodes spanning radii from 1 to 11 
##      Axis 3 has 136 nodes spanning radii from 2 to 434 
## 
##      Axes 1 and 2 share 400 edges
##      Axes 1 and 3 share 21 edges
##      Axes 3 and 2 share 3158 edges
##      Axes 3 and 3 share 314 edges

We also remove any nodes that have zero edges.

EC4 <- mineHPD(EC3, option = "remove zero edge")
## 
##   125 edges that start and end on the same point were removed

And finally re-order the edges (not sure how this would affect plotting using Cytoscape).

edges <- EC4$edges
edgesR <- subset(edges, color == "red")
edgesG <- subset(edges, color == "green")
edgesO <- subset(edges, color == "orange")
edges <- rbind(edgesO, edgesG, edgesR)
EC4$edges <- edges
EC4$edges$weight = 0.5

Calculate Node Locations

In this case we have three axes, so we are going to calculate the axes locations as 0, 120, and 240 degrees. However, we need to use radians, because the conversion from spherical to cartesian coordinates involves using cosine and sine, which in R is based on radians.

r2xy <- function(inRad, inPhi) {
    x <- inRad * sin(inPhi)
    y <- inRad * cos(inPhi)
    cbind(x, y)
}

tmpDat <- EC4$nodes[, c("id", "axis", "radius")]
tmpDat$radius <- tmpDat$radius * 3  # bump it up as cytoscape coordinates are small
tmpDat$phi <- ((tmpDat$axis - 1) * 120) * (pi/180)

nodeXY <- r2xy(tmpDat$radius, tmpDat$phi)  # contains cartesian coordinates

Create GraphNEL

Initialize the graph with the nodes and the edges.

hiveGraph <- new("graphNEL", nodes = as.character(EC4$nodes$id), edgemode = "directed")
hiveGraph <- addEdge(as.character(EC4$edges$id1), as.character(EC4$edges$id2), 
    hiveGraph)

We also want to put information we know about the nodes and edges in the graph, so that we can modify colors and stuff based on those attributes. For example, in this case we might want to modify the node color based on the axis it is on. Using attributes means we are not stuck using the colors that we previously assigned.

nodeDataDefaults(hiveGraph, "nodeType") <- ""
attr(nodeDataDefaults(hiveGraph, "nodeType"), "class") <- "STRING"

nodeTypes <- c(`1` = "source", `2` = "man", `3` = "sink")
nodeData(hiveGraph, as.character(EC4$nodes$id), "nodeType") <- nodeTypes[as.character(EC4$nodes$axis)]

edgeDataDefaults(hiveGraph, "interactionType") <- ""
attr(edgeDataDefaults(hiveGraph, "interactionType"), "class") <- "STRING"

interactionType <- c(red = "repressor", green = "activator", orange = "dual")
edgeData(hiveGraph, as.character(EC4$edges$id1), as.character(EC4$edges$id2), 
    "interactionType") <- interactionType[EC4$edges$color]

Transfer to Cytoscape

ccHive <- CytoscapeWindow("hiveTest", hiveGraph)
displayGraph(ccHive)
## [1] "nodeType"
## [1] "label"
## [1] "interactionType"
## interactionType 
##            TRUE

Now lets move those nodes to their positions based on the Hive Graph calculations.

setNodePosition(ccHive, as.character(EC4$nodes$id), nodeXY[, 1], nodeXY[, 2])
fitContent(ccHive)
setDefaultNodeSize(ccHive, 5)
## [1] TRUE

And set the colors based on attributes:

nodeColors <- hcl(h = c(0, 120, 240), c = 55, l = 45)  # darker for the nodes
edgeColors <- hcl(h = c(0, 120, 60), c = 45, l = 75)  # lighter for the edges

setNodeColorRule(ccHive, "nodeType", c("source", "man", "sink"), nodeColors, 
    "lookup")
setNodeBorderColorRule(ccHive, "nodeType", c("source", "man", "sink"), nodeColors, 
    "lookup")
setEdgeColorRule(ccHive, "interactionType", c("repressor", "activator", "dual"), 
    edgeColors, "lookup")
setNodeFontSizeDirect(ccHive, as.character(EC4$nodes$id), 0)
redraw(ccHive)
fitContent(ccHive)
saveImage(ccHive, file.path(getwd(), "nonScaled.png"), "PNG")

nonScaled

This view doesn't help us a whole lot, unfortunately. What if we normalize the radii for each axis to use a maximum value of 100?

useMax <- 100
invisible(sapply(c(1, 2, 3), function(inAxis) {
    isCurr <- EC4$nodes$axis == inAxis
    currMax <- max(EC4$nodes$radius[isCurr])
    scaleFact <- useMax/currMax
    EC4$nodes$radius[isCurr] <<- EC4$nodes$radius[isCurr] * scaleFact
}))

tmpDat <- EC4$nodes[, c("id", "axis", "radius")]
tmpDat$radius <- tmpDat$radius * 3  # bump it up as cytoscape coordinates are small
tmpDat$phi <- ((tmpDat$axis - 1) * 120) * (pi/180)

nodeXY <- r2xy(tmpDat$radius, tmpDat$phi)

setNodePosition(ccHive, as.character(EC4$nodes$id), nodeXY[, 1], nodeXY[, 2])
fitContent(ccHive)
redraw(ccHive)
fitContent(ccHive)
saveImage(ccHive, file.path(getwd(), "scaledAxes.png"), "PNG")

scaledAxes

This looks pretty awesome! And I can zoom in on it, and examine it, and look at various properties! And I get the full scripting power of R if I want to do anything else with, such as select sets of edges or nodes and then query who is attached to whom.

Disadvantages

We don't get the arced edges. This kind of sucks, but from what little I have done with these, that actually is not that big a deal. Would be cool if there was a way to do that, however. I do see that the web version of Cytoscape does allow you to set a value for how much “arcness” you want on an edge.

This does mean that any plot with only two axes would need special consideration. Instead of doing two axes end to end (using 180 deg), it might be better to make them parallel to each other.

With more than three axes, line crossings may become a problem. In that case, it may be worth looking to see if there are ways to tell Cytoscape in what order to draw edges. I don't know if that is possible using the XMLRPC pipe that is used by RCy.

RCy Tip

If you want to know how the image will look when saving a network to an image, use showGraphicsDetails(obj, TRUE).

Other Visualizations

Of course, I had just wrapped my head around using HivePlots in my own work, when I encountered ISBs BioFabric. Given how they are representing this, could we find a way to draw this in Cytoscape??

Cleanup

deleteWindow(ccHive)
## [1] TRUE

Session Info

Sys.time()
## [1] "2012-11-28 16:16:41 EST"
sessionInfo()
## R version 2.15.1 (2012-06-22)
## Platform: x86_64-pc-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] HiveR_0.2-7      RCytoscape_1.6.5 XMLRPC_0.2-4     graph_1.34.0    
## [5] knitr_0.8       
## 
## loaded via a namespace (and not attached):
##  [1] BiocGenerics_0.2.0 bipartite_1.18     digest_0.5.2      
##  [4] evaluate_0.4.2     formatR_0.6        grid_2.15.1       
##  [7] igraph_0.6-2       lattice_0.20-10    MASS_7.3-18       
## [10] plyr_1.7.1         RColorBrewer_1.0-5 RCurl_1.95-1.1    
## [13] RFOC_2.0-02        rgl_0.92.892       sna_2.2-0         
## [16] splines_2.15.1     stringr_0.6.1      survival_2.36-14  
## [19] tkrgl_0.7          tnet_3.0.8         tools_2.15.1      
## [22] vegan_2.0-4        XML_3.9-4.1        xtable_1.7-0

Where

Published on Blogger. Source hosted on Github

17.11.12

Calibre, Python, reading papers in e-ink

Yesterday I set up the excellent iPython Notebook on my windows machine. This is essentially an interactive web-interface to the Python shell, that lets you record everything you have done, and mark it up with lots of stuff using Markdown and mathjax. In many ways, it is very similar to using RStudio, RMarkdown, and knitr to generate Markdown and html reports from R.

I don't actually do any of my scientific coding in Python, but that may change. My motivation for wanting to learn some Python comes from the fact that Calibre is written in Python. Calibre provides a nice method for taking RSS feeds, parsing them, and spitting out the results as something an e-reader can understand (and really, it supports many different e-reading platforms, including ePub and Kindle formats).

Although I have been using my first generation iPad to read scientific publications from PDF for 2 1/2 years now, the recent experiment of Genome Biology providing the ENCODE publications as ePub made me try reading scientific publications on my 3rd gen Kindle. I loved it! Even without the color figures, and the rather small screen, the experience was simply amazing. Especially given that many papers I am reading more for information intake than for marking up, it really works. And if I really need to mark up a Kindle doc, I can use the Kindle app on my iPad, or my computer. But highlighting works well. I can even retrieve the highlights and notes from the text file that holds them on the Kindle itself.

Most scientific publications are made available as HTML pages, or PDF. So in theory, we should be able to easily generate an ePub or Kindle format using Calibre from the raw HTML. However, for some reason the powers that be in the e-journal publishing world decided that figures and tables should not actually be part of the document. I really don't know why, because they are in the PDF, and it is not that hard to do in the HTML (see an example paper I did here in HTML).

What this ultimately means is that to generate an e-reader compatible document, we need to actually modfiy the HTML. We need to go in, find the elements that tell us where the figure and table pages are, parse them, and get the actual files. I actually figured out how to do this for at least one journal in R using the XML package, and was going to create a package that would take a series of links or DOIs and process them.

But I get many of my papers by RSS feed for specific journals. And Calibre, as I already mentioned, has some nice functions for automatically processing RSS feeds and generating e-reader compatible docs from them. And Calibre is written in Python, and new RSS processing recipes are written in Python. Therefore, I guess I'm going to learn me some Python!

10.10.12

sfLapply vs lapply in R

I've been using the snowfall package for a while to enable parallel processing in R on my Windows machine, where I have an 8 core processor. I discovered today that the function sfLapply will only work with with an object that has a class of list. This is really important, because there are many things that are “list-like”, and are actually lists at heart, but sfLapply probably won't like it.

Lets whip up an example.

require(snowfall)
require(GenomicRanges)

gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), 
    ranges = IRanges(1:10, width = 10:1, names = head(letters, 10)), strand = Rle(strand(c("-", 
        "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score = 1:10, GC = seq(1, 0, 
        length = 10))

gr
## GRanges with 10 ranges and 2 elementMetadata cols:
##     seqnames    ranges strand |     score                GC
##        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
##   a     chr1  [ 1, 10]      - |         1                 1
##   b     chr2  [ 2, 10]      + |         2 0.888888888888889
##   c     chr2  [ 3, 10]      + |         3 0.777777777777778
##   d     chr2  [ 4, 10]      * |         4 0.666666666666667
##   e     chr1  [ 5, 10]      * |         5 0.555555555555556
##   f     chr1  [ 6, 10]      + |         6 0.444444444444444
##   g     chr3  [ 7, 10]      + |         7 0.333333333333333
##   h     chr3  [ 8, 10]      + |         8 0.222222222222222
##   i     chr3  [ 9, 10]      - |         9 0.111111111111111
##   j     chr3  [10, 10]      - |        10                 0
##   ---
##   seqlengths:
##    chr1 chr2 chr3
##      NA   NA   NA

class(gr)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"

grList <- split(gr, seqnames(gr))
grList
## GRangesList of length 3:
## $chr1 
## GRanges with 3 ranges and 2 elementMetadata cols:
##     seqnames    ranges strand |     score                GC
##        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
##   a     chr1   [1, 10]      - |         1                 1
##   e     chr1   [5, 10]      * |         5 0.555555555555556
##   f     chr1   [6, 10]      + |         6 0.444444444444444
## 
## $chr2 
## GRanges with 3 ranges and 2 elementMetadata cols:
##     seqnames  ranges strand | score                GC
##   b     chr2 [2, 10]      + |     2 0.888888888888889
##   c     chr2 [3, 10]      + |     3 0.777777777777778
##   d     chr2 [4, 10]      * |     4 0.666666666666667
## 
## $chr3 
## GRanges with 4 ranges and 2 elementMetadata cols:
##     seqnames   ranges strand | score                GC
##   g     chr3 [ 7, 10]      + |     7 0.333333333333333
##   h     chr3 [ 8, 10]      + |     8 0.222222222222222
##   i     chr3 [ 9, 10]      - |     9 0.111111111111111
##   j     chr3 [10, 10]      - |    10                 0
## 
## ---
## seqlengths:
##  chr1 chr2 chr3
##    NA   NA   NA
class(grList)
## [1] "GRangesList"
## attr(,"package")
## [1] "GenomicRanges"

So we have the GRanges object gr, and a GRangesList in grList. Now lets try to do some parallel execution of finding overlaps of itself.

This is the function we will use in parallel:

returnOverlaps <- function(inObj1, inObj2) {
    findOverlaps(inObj1, inObj2, type = "any")
}
sfInit(parallel = T, cpus = 2)
## Warning: Unknown option on commandline: options(encoding
## R Version:  R version 2.15.0 (2012-03-30)
## snowfall 1.84 initialized (using snow 0.3-10): parallel execution on 2
## CPUs.
sfLibrary(GenomicRanges)
## Library GenomicRanges loaded.
## Library GenomicRanges loaded in cluster.
## Warning: 'keep.source' is deprecated and will be ignored

overlap <- sfLapply(grList, returnOverlaps, gr)
## Error: 2 nodes produced errors; first error: no method for coercing this
## S4 class to a vector

sfStop()
## Stopping cluster

Ok, we get the error no method for coercing this S4 class to a vector. Seems kind of cryptic, at least it did to me. What about using normal lapply?

overlap <- lapply(grList, returnOverlaps, gr)

overlap
## $chr1
## Hits of length 7
## queryLength: 3
## subjectLength: 10
##   queryHits subjectHits 
##    <integer>   <integer> 
##  1         1           1 
##  2         1           5 
##  3         2           1 
##  4         2           5 
##  5         2           6 
##  6         3           5 
##  7         3           6 
## 
## $chr2
## Hits of length 9
## queryLength: 3
## subjectLength: 10
##   queryHits subjectHits 
##    <integer>   <integer> 
##  1         1           2 
##  2         1           3 
##  3         1           4 
##  4         2           2 
##  5         2           3 
##  6         2           4 
##  7         3           2 
##  8         3           3 
##  9         3           4 
## 
## $chr3
## Hits of length 8
## queryLength: 4
## subjectLength: 10
##   queryHits subjectHits 
##    <integer>   <integer> 
##  1         1           7 
##  2         1           8 
##  3         2           7 
##  4         2           8 
##  5         3           9 
##  6         3          10 
##  7         4           9 
##  8         4          10

This works without any errors. Odd. It was only when I was trying to get this to work using llply from the plyr package that I saw a message about as.default.list or something like that. So maybe we have to convert the grList to a good and proper list first?

sfInit(parallel = T, cpus = 2)
## Warning: Unknown option on commandline: options(encoding
## snowfall 1.84 initialized (using snow 0.3-10): parallel execution on 2
## CPUs.
sfLibrary(GenomicRanges)
## Library GenomicRanges loaded.
## Library GenomicRanges loaded in cluster.
## Warning: 'keep.source' is deprecated and will be ignored

grList <- as.list(grList)
overlap2 <- sfLapply(grList, returnOverlaps, gr)

sfStop()
## Stopping cluster

overlap2
## $chr1
## Hits of length 7
## queryLength: 3
## subjectLength: 10
##   queryHits subjectHits 
##    <integer>   <integer> 
##  1         1           1 
##  2         1           5 
##  3         2           1 
##  4         2           5 
##  5         2           6 
##  6         3           5 
##  7         3           6 
## 
## $chr2
## Hits of length 9
## queryLength: 3
## subjectLength: 10
##   queryHits subjectHits 
##    <integer>   <integer> 
##  1         1           2 
##  2         1           3 
##  3         1           4 
##  4         2           2 
##  5         2           3 
##  6         2           4 
##  7         3           2 
##  8         3           3 
##  9         3           4 
## 
## $chr3
## Hits of length 8
## queryLength: 4
## subjectLength: 10
##   queryHits subjectHits 
##    <integer>   <integer> 
##  1         1           7 
##  2         1           8 
##  3         2           7 
##  4         2           8 
##  5         3           9 
##  6         3          10 
##  7         4           9 
##  8         4          10

It works! I'm not exactly sure why there is this difference, but I thought perhaps I can save someone else a few hours time figuring it out.

SessionInfo:

sessionInfo()

R version 2.15.0 (2012-03-30) Platform: x86_64-pc-mingw32/x64 (64-bit)

locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] GenomicRanges_1.8.13 IRanges_1.14.4 BiocGenerics_0.2.0
[4] snowfall_1.84 snow_0.3-10 knitr_0.8.1

loaded via a namespace (and not attached): [1] digest_0.5.2 evaluate_0.4.2 formatR_0.6 plyr_1.7.1
[5] stats4_2.15.0 stringr_0.6.1 tools_2.15.0

Posted on Blogger. Rmd, md

9.10.12

Writing papers using R Markdown

I have been watching the activity in RStudio and knitr for a while, and have even been using Rmd (R markdown) files in my own work as a way to easily provide commentary on an actual dataset analysis. Yihui has proposed writing papers in markdown and posting them to a blog as a way to host a statistics journal, and lots of people are now using knitr as a way to create reproducible blog posts that include code (including yours truly).

The idea of writing a paper that actually includes the necessary code to perform the analysis, and is actually readable in its raw form, and that someone else could actually run was pretty appealing. Unfortunately, I had not had the time or opportunity to actually try it, until recently our group submitted a conference paper that included a lot of analysis in R that seemed like the perfect opportunity to try this. (I will link to the paper here when I hear more, or get clearance from my PI). Originally we wrote the paper in Microsoft® Word, but after submission I decided to see what it would have taken to write it as an Rmd document that could then generate markdown or html.

It turned out that it was not that hard, but it did force me to do some things differently. This is what I want to discuss here.

Advantages

I actually found it much easier to have the text with the analysis (in contrast to having to be separate in a Word document), and upon doing the conversion, discovered some possible numerical errors that crept in because of having to copy numerical results separately (that is the nice thing about being able to insert variable directly into the text). In addition, the Word template for the submission didn't play nice with automatic table and figure numbering, so our table and figure numbering got messed up in the submission. So overall, I'd say it worked out better with the Rmd file overall, even with the having to create functions to handle table and figure numbering properly myself (see below).

Tables and Figures

As I'm sure most of you know, Word (and other WYSIWYG editors) have ability to keep track of your object numbers, this is especially nice for keeping your figure and table numbers straight. Of course, there is no such ability built into a static text file, but I found it was easy to write a couple of functions for this. The way I came up with is to have a variable that contains a label for the figure or table, a function that increments the counter when new figures or tables are added, and a function that prints the associated number for a particular label. This does require a bit of forethought on the part of the writer, because you may have to add a table or figure label to the variable long before you actually create it, but as long as you use sane (i.e. descriptive) labels, it shouldn't be a big deal. Let me show you what I mean.

Counting

incCount <- function(inObj, useName) {
    nObj <- length(inObj)
    useNum <- max(inObj) + 1
    inObj <- c(inObj, useNum)
    names(inObj)[nObj + 1] <- useName
    inObj
}
figCount <- c(`_` = 0)
tableCount <- c(`_` = 0)

The incCount function is very simple, it takes an object, checks the maximum count, and then adds an incremental value with the supplied name. In this example, I initialized the figCount and tableCount objects with a non-sensical named value of zero.

Now in the process of writing, I decide I'm going to need a table on the amount of time spent by post-docs writing blog posts in different years of their post-doc training. Lets call this t.blogPostDocs. Notice that this is a fairly descriptive name. We can assign it a number like so:

tableCount <- incCount(tableCount, "t.blogPostDocs")
tableCount
##              _ t.blogPostDocs 
##              0              1

Inserting

So now we have a variable with a named number we can refer to. But how do we insert it into the text? We are going to use another function that will let us insert either the text with a link, or just the text itself.

pasteLabel <- function(preText, inObj, objName, insLink = TRUE) {
    objNum <- inObj[objName]

    useText <- paste(preText, objNum, sep = " ")
    if (insLink) {
        useText <- paste("[", useText, "](#", objName, ")", sep = "")
    }
    useText
}

This function allows us to insert the table number like so:

r I(pasteLabel("Table", tableCount, "t.blogPostDocs"))

This would be inserted into a normal inline code block. The I makes sure that the text will appear as normal text, and not get formatted as a code block. The default behavior is to insert as a relative link, thereby enabling the use of relative links to link where a table / figure is mentioned to its actual location. For example, we can insert the anchor link like so:

<a id="t.blogPostDocs"></a>

Markdown Tables

Followed by the actual table text. This brings up the subject of markdown tables. I also wrote a function (thanks to Yihui again) that transforms a normal R data.frame to a markdown table.

tableCat <- function(inFrame) {
    outText <- paste(names(inFrame), collapse = " | ")
    outText <- c(outText, paste(rep("---", ncol(inFrame)), collapse = " | "))
    invisible(apply(inFrame, 1, function(inRow) {
        outText <<- c(outText, paste(inRow, collapse = " | "))
    }))
    return(outText)
}

Lets see it in action.

postDocBlogs <- data.frame(PD = c("p1", "p2", "p3"), NBlog = c(4, 10, 2), Year = c(1, 
    4, 2))
postDocBlogs
##   PD NBlog Year
## 1 p1     4    1
## 2 p2    10    4
## 3 p3     2    2

postDocInsert <- tableCat(postDocBlogs)
postDocInsert
## [1] "PD | NBlog | Year" "--- | --- | ---"   "p1 |  4 | 1"      
## [4] "p2 | 10 | 4"       "p3 |  2 | 2"

To actually insert it into the text, use a code chunk with results='asis' and echo=FALSE.

cat(postDocInsert, sep = "\n")
PD NBlog Year
p1 4 1
p2 10 4
p3 2 2

Before inserting the table though, you might want an inline code with the table number and caption, like this:

I(pasteLabel("Table", tableCount, "t.blogPostDocs", FALSE)) This is the number of blog posts and year of training for post-docs.

Table 1 This is the number of blog posts and year of training for post-docs.

Remember for captions to set the insLink variable to FALSE so that you don't generate a link from the caption.

Figures

Oftentimes, you will have code that generates the figure, and then you want to insert the figure at a different point. This is accomplished by the judicious use of echo and include chunk options.

For example, we can create a ggplot2 figure and store it in a variable in one chunk, and then print it in a later chunk to actually insert it into the text body.

plotData <- data.frame(x = rnorm(1000, 1, 5), y = rnorm(1000, 0, 2))
plotKeep <- ggplot(plotData, aes(x = x, y = y)) + geom_point()
figCounts <- incCount(figCount, "f.randomFigure")

And now we decide to actually insert it using print(plotKeep) with the option of echo=FALSE:

plot of chunk figureInsert

Figure 1. A random figure.

Numerical result formatting

When R prints a number, it normally likes to do so with lots of digits. This is not probably what you want either in a table or when reporting a number in a sentence. You can control that by using the format function. When generating a new variable, the number of digits to display when printing will be saved, and when used on a variable directly, only the defined number of digits will display.

Echo and Include

This brings up the issue of how to keep the code from appearing in the text body. I found depending on the particulars, either using echo=FALSE or include=FALSE would do the job. This is meant to be a paper, a reproducible one, but a paper nonetheless, and therefore the code should not end up in the text body.

References

One thing I haven't done yet is convert all the references. I am planning to try using the knitcitations package. I will probably post on that experience.

HTML generation

Because I use RStudio, I set up a modified function For generating a full html version of the paper, changing the default RStudio markdown render options like so:

htmlOptions <- markdownHTMLOptions(defaults=TRUE)
htmlOptions <- htmlOptions[htmlOptions != "hard_wrap"]
markdownToHTML(inputFile, outputFile, options = htmlOptions)

This should be added to a .Rprofile file either in your home directory or in the directory you start R in (this is especially useful for modification on a per project basis).

I do this because when I write my documents, I want the source to be readable online. If this is a github hosted repo, that means being displayed in the github file browser, which does not do line wrapping. So I set up a 120 character line in my editor, and try very hard to stick to that.

Function source

You can find the previously mentioned functions in a github gist here.

Post source

The source files for this blog post can be found at: Rmd, md, and html.

Posted on October 9, 2012, at http://robertmflight.blogspot.com/2012/10/writing-papers-using-r-markdown.html

Edit: added section on formatting numerical results

Edit: added session info

R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_0.9.2.1 knitr_0.8.1    

loaded via a namespace (and not attached):
 [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2      
 [4] evaluate_0.4.2     formatR_0.6        grid_2.15.0       
 [7] gtable_0.1.1       labeling_0.1       MASS_7.3-21       
[10] memoise_0.1        munsell_0.4        plyr_1.7.1        
[13] proto_0.3-9.2      RColorBrewer_1.0-5 reshape2_1.2.1    
[16] scales_0.2.2       stringr_0.6.1      tools_2.15.0