Deciphering life: One bit at a time: rstats

Showing posts with label rstats. Show all posts

11.7.13

R Interface for Teaching

Kaitlin Thaney asked on Twitter last week about using Ramnath Vaidyanathan's new interactive R notebook 1 2 for teaching.

Liking the look of interactive R notebook by @ramnath_vaidya. Any success stories in using to teach? http://t.co/wmVuFM2Rst (HT @_inundata)
— Kaitlin Thaney (@kaythaney) July 9, 2013

Now, to be clear up front, I am not trying to be mean to Ramnath, discredit his work, or the effort that went into that project. I think it is really cool, and has some rather interesting potential applications, but I don't really think it is the right interface for teaching R. I would argue that the best interface for teaching R right now is RStudio. Keep reading to find out why.

iPython Notebook

First, I believe Ramnath when he says he was inspired by the iPython Notebook that makes it so very nice to do interactive, reproducible Python coding. Software Carpentry has been very successfully using them for helping to teach Python to scientists.

However, the iPython Notebook is an interesting beast for this purpose. You are able to mix markdown blocks and code blocks. In addition, it is extremely simple to break up your calculations into units of related code, and re-run those units as needed. This is particularly useful when writing new functions, because you can write the function definition, and a test that displays output in one block, and then the actual computations in subsequent blocks. It makes it very easy to keep re-running the same block of code over and over until it is correct, which allows one to interactively explore changes to functions. This is awesome for learning Python and prototyping functions.

In addition to being able to repeatedly write -> run -> modify in a loop, you can also insert prose describing what is going on in the form of markdown. This is a nice lightweight syntax that generates html. So it becomes relatively easy to document the why of something.

R Notebook

Unfortunately, the R notebook that Ramnath has put up is not quite the same beast. It is an Ace editor window coupled to an R process that knits the markdown and displays the resultant html. This is really cool, and I think will be useful in many other applications, but not for teaching in an interactive environment.

RStudio as a Teaching Environment

Lets think. We want something that lets us repeatedly write -> run -> modify on small code blocks in R, but would be great if it was some kind of document that could be shared, and re-run.

I would argue that the editor environment in RStudio when writing R markdown (Rmd) files is the solution. R code blocks behave much the same as in iPython notebook, in that they are colored differently, set apart, have syntax highlighting, and can be easily repeatedly run using the code chunk menu. Outside of code blocks is assumed to be markdown, making it easy to insert documentation and explanation. The code from the code blocks is sent to an attached R session, where objects can be further investigated if required, and results are displayed.

This infrastructure supplies an interactive back and forth between editor and execution environment, with the ability to easily group together units of code.

In addition, RStudio has git integration baked in, so it becomes easy to get started with some basic version control.

Finally, RStudio is cross-platform, has tab completion among other standard IDE goodies, and its free.

Feedback

I've gotten some feedback on twitter about this, and I want to update this post to address it.

Hard to Install

One comment was that installing R, RStudio and necessary packages might be hard. True, it might be. However, I have done multiple installs of R, RStudio, Python, and iPython Notebook in both Linux and Windows, and I would argue that the level of difficulty is at least the same.

Moving from Presentation to Coding

I think this is always difficult, especially if you have a powerpoint, and your code is in another application. However, the latest dev version of RStudio (download) now includes the ability to view markdown based presentations in an attached window. This is probably one of the potentially nicest things for doing presentations that actually involve editing actual code.

Edit: added download links for Rstudio preview

5.6.13

Tim Horton's Density

Inspired by this post, I wanted to examine the locations and density of Tim Hortons restaurants in Canada. Using Stats Canada data, each census tract is queried on Foursquare for Tims locations.

Setup

options(stringsAsFactors = F)
require(timmysDensity)
require(plyr)
require(maps)
require(ggplot2)
require(geosphere)

Statistics Canada Census Data

The actual Statistics Canada data at the dissemination block level can be downloaded from here. You will want to download the Excel format, read it, and then save it as either tab-delimited or CSV using a non-standard delimiter, I used a semi-colon (;).

censusData <- read.table("../timmysData/2011_92-151_XBB_XLSX.csv", header = F, 
    sep = ";", quote = "")
censusData <- censusData[, 1:17]
names(censusData) <- c("DBuid", "DBpop2011", "DBtdwell2011", "DBurdwell2011", 
    "DBarea", "DB_ir2011", "DAuid", "DAlamx", "DAlamy", "DAlat", "DAlong", "PRuid", 
    "PRname", "PRename", "PRfname", "PReabbr", "PRfabbr")
censusData$DBpop2011 <- as.numeric(censusData$DBpop2011)
censusData$DBpop2011[is.na(censusData$DBpop2011)] <- 0

censusData$DBtdwell2011 <- as.numeric(censusData$DBtdwell2011)
censusData$DBtdwell2011[is.na(censusData$DBtdwell2011)] <- 0

From this data we get block level:

populations (DBpop2011)
total private dwellings (DBtdwell2011)
privale dwellings occupied by usual residents (DBurdwell2011)
block land area (DBarea)
dissemination area id (DAuid)
representative point x coordinate in Lambert projection (DAlamx)
rep. point y coordinate in Lambert projection (DAlamy)
rep. point latitude (DAlat)
rep. point longitude (DAlong)

This should be everything we need to do the investigation we want.

Dissemination Area Long. and Lat.

We need to find the unique dissemination areas, and get out their latitudes and longitudes for querying in other databases. Note that the longitude and latitude provided here actually are weighted representative locations based on population. However, given the size of them, I don't think using them will be a problem for Foursquare. Because areas are what we have location data for, we will summarize everything at the area level, summing the population counts for all the blocks within an area.

uniqAreas <- unique(censusData$DAuid)

summarizeArea <- function(areaID) {
    areaData <- censusData[(censusData$DAuid == areaID), ]
    outData <- data.frame(uid = areaID, lamx = areaData[1, "DAlamx"], lamy = areaData[1, 
        "DAlamy"], lat = areaData[1, "DAlat"], long = areaData[1, "DAlong"], 
        pop = sum(areaData[, "DBpop2011"]), dwell = sum(areaData[, "DBtdwell2011"]), 
        prov = areaData[1, "PRename"])
    return(outData)
}
areaData <- adply(uniqAreas, 1, summarizeArea)
.sessionInfo <- sessionInfo()
.timedate <- Sys.time()
write.table(areaData, file = "../timmysData/areaData.txt", sep = "\t", row.names = F, 
    col.names = T)
save(areaData, .sessionInfo, .timedate, file = "../timmysData/areaDataFile.RData", 
    compress = "xz")

Run queries on Foursquare

Load up the data and verify what we have.

load("../timmysData/areaDataFile.RData")
head(areaData)

Generate queries and run

For each dissemination area (DA), we are going to use as the location for the query the latitude and longitude of each DA, as well as the search string "tim horton".

Because Foursquare limits the number of userless requests to 5000 / hr. To make sure we stay under this limit, the runQueries function will only 5000 queries an hour.

runQueries(areaData, idFile = "../timmysData/clientid.txt", secretFile = "../timmysData/clientsecret.txt", 
    outFile = "../timmysData/timmysLocs2.txt")

Clean up the results

Due to the small size of the DAs, we have a lot of duplicate entries. Now lets remove all the duplicate entries.

cleanUpResults("../timmysData/timmysLocs2.txt")

Visualize Locations

First lets read in the data and make sure that we have Tims locations.

# read in and clean up the data
timsLocs <- scan(file = "../timmysData/timmysLocs2.txt", what = character(), 
    sep = "\n")
timsLocs <- strsplit(timsLocs, ":")

timsName <- sapply(timsLocs, function(x) {
    x[1]
})
timsLat <- sapply(timsLocs, function(x) {
    x[2]
})
timsLong <- sapply(timsLocs, function(x) {
    x[3]
})

locData <- data.frame(description = timsName, lat = as.numeric(timsLat), long = as.numeric(timsLong))
hasNA <- is.na(locData[, "lat"]) | is.na(locData[, "long"])
locData <- locData[!(hasNA), ]

timsStr <- c("tim hortons", "tim horton's")

hasTims <- (grepl(timsStr[1], locData$description, ignore.case = T)) | (grepl(timsStr[2], 
    locData$description, ignore.case = T))

locData <- locData[hasTims, ]
timsLocs <- locData
rm(timsName, timsLat, timsLong, hasNA, locData, hasTims, timsStr)
.timedate <- Sys.time()
.sessionInfo <- sessionInfo()
save(timsLocs, .timedate, .sessionInfo, file = "../timmysData/timsLocs.RData", 
    compress = "xz")

Put them on a map

data(timsLocs)
data(areaDataFile)
canada <- map_data("world", "canada")

p <- ggplot(legend = FALSE) + geom_polygon(data = canada, aes(x = long, y = lat, 
    group = group)) + theme(panel.background = element_blank()) + theme(panel.grid.major = element_blank()) + 
    theme(panel.grid.minor = element_blank()) + theme(axis.text.x = element_blank(), 
    axis.text.y = element_blank()) + theme(axis.ticks = element_blank()) + xlab("") + 
    ylab("")

sp <- timsLocs[1, c("lat", "long")]

p2 <- p + geom_point(data = timsLocs[, c("lat", "long")], aes(x = long, y = lat), 
    colour = "green", size = 1, alpha = 0.5)

print(p2)

plot of chunk mapIt

How far??

And now lets also calculate the minimum distance of a given DA from Timmys locations.

queryLocs <- matrix(c(timsLocs$long, timsLocs$lat), nrow = nrow(timsLocs), ncol = 2, 
    byrow = F)  # these are the tims locations
distLocs <- matrix(c(areaData$long, areaData$lat), nrow = nrow(areaData), ncol = 2, 
    byrow = F)  # the census centers
allDists <- apply(queryLocs, 1, function(x) {
    min(distHaversine(x, distLocs))  # only need the minimum value to determine
})

From the allDists variable above, we can determine that the maximum distance any census dissemination area (DA) is from a Tim Hortons is 51.5 km (31.9815 miles). This is based on distances calculated "as the crow flies", but still, that is pretty close. Assuming roads, the furthest a Canadian should have to travel is less than an hour to get their Timmys fix.

totPopulation <- sum(areaData$pop, na.rm = T)
lessDist <- seq(50, 51.6 * 1000, 50)  # distances are in meters, so multiply by 1000 to get reasonable km

percPop <- sapply(lessDist, function(inDist) {
    isLess <- allDists < inDist
    sum(areaData$pop[isLess], na.rm = T)/totPopulation * 100
})

plotDistPerc <- data.frame(distance = lessDist, population = percPop, logDist = log10(lessDist))
ggplot(plotDistPerc, aes(x = logDist, y = population)) + geom_point() + xlab("Log10 Distance") + 
    ylab("% Population")

plot of chunk percPopulation

What gets really interesting, is how much of the population lives within a given distance of a Timmys. By summing up the percentage of the population within given distances. The plot above shows that 50% of the population is within 316.2278 meters of a Tim Hortons location.

I guess Canadians really do like their Tim Hortons Coffee (and donuts!).

Replication

All of the necessary processed data and code is available in the R package timmysDensity. You can install it using devtools. The original data files are linked in the relevant sections above.

library(devtools)
install_github("timmysDensity", "rmflight")

Caveats

I originally did this work based on a different set of data, that I have not been able to locate the original source for. I have not compared these results to that data to verify their accuracy. When I do so, I will update the package, vignette and blog post.

Posted

This work exists as the vignette of timmysDensity, on my web-blog, and independently as the front page for the GitHub repo.

Disclaimer

Tim Hortons was not involved in the creation or preparation of this work. I am not regularly updating the location information obtained from Foursquare, it is only valid for May 31, 2013. All code used in preparing these results was written by me, except in the case where code from other R packages was used. All opinions and conclusions are my own, and do not reflect the views of anyone else or any institution I may be associated with.

Other information when this vignette was generated

Session Info

sessionInfo()

## R version 3.0.0 (2013-04-03)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=C                 LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.2           markdown_0.5.4      geosphere_1.2-28   
## [4] sp_1.0-9            ggplot2_0.9.3.1     maps_2.3-2         
## [7] plyr_1.8            timmysDensity_0.0.3 devtools_1.2       
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-2   dichromat_2.0-0    digest_0.6.3      
##  [4] evaluate_0.4.3     formatR_0.7        grid_3.0.0        
##  [7] gtable_0.1.2       httr_0.2           labeling_0.1      
## [10] lattice_0.20-15    lubridate_1.3.0    MASS_7.3-26       
## [13] memoise_0.1        munsell_0.4        parallel_3.0.0    
## [16] proto_0.3-10       RColorBrewer_1.0-5 RCurl_1.95-4.1    
## [19] reshape2_1.2.2     RJSONIO_1.0-3      scales_0.2.3      
## [22] stringr_0.6.2      tools_3.0.0        whisker_0.3-2

Time and Date

Sys.time()

## [1] "2013-06-05 14:04:49 EDT"

30.5.13

Storing package data in custom environments

If you do R package development, sometimes you want to be able to store variables specific to your package, without cluttering up the users workspace. One way to do this is by modifying the global options. This is done by packages grDevices and parallel. Sometimes this doesn't seem to work quite right (see this issue for example.

Another way to do this is to create an environment within your package, that only package functions will be able to see, and therefore read from and modify. You get a space to put package specific stuff, the user can't see it or modify it directly, and you just need to write functions that do the appropriate things to that environment (adding variables, reading them, etc). This sounds great in practice, but I wasn't clear on how to do this, even after reading the help page on environments, the R documentation, or even Hadley's excellent writeup. From all these sources, I could glean that one can create environments, name them, modify them, etc, but wasn't sure how to work with this within a package.

I checked out the knitcitations package to see how it was done. When I looked, I realized that it was pretty obvious in retrospect. In zzz.R, initialize the environment, assigning it to a variable. When you need to work with the variables inside, this variable will be accessible to your package, and you simply use the get and assign functions like you would if you were doing anything on the command line.

To make sure I had it figured out, I created a very tiny package to create a custom environment and functions for modifying it. Please feel free to examine, download, install (using devtools) and see for yourself.

I have at least two projects where I know I will use this, and I'm sure others might find it useful as well.

5.4.13

Reproducible publications as R packages... and markdown vignettes!

I hadn't really considered it before, but R packages are an interesting way to distribute reproducible papers. You have functions in the R directory, any required data files in data, a Description file that describes the package and all its dependencies, and finally you have the vignette, generally in inst/doc. Vignettes in R traditionally had to be written in Latex and used Sweave to mix the written and code parts.

It is very easy to imagine that the vignette produced could be a journal article, and therefore to get the reproducible article and access to underlying data and functions, you simply need to install the package. I don't know why I hadn't realized this before. It is actually a really neat, and potentially powerful idea. I would not be surprised that this was actually part of the logic of incorporating the Sweave vignettes in R packages.

One wonders why this isn't more common, to distribute papers as packages? I am guessing that it is likely from the requirement to use Latex for writing the document. I've written a vignette, and I can't say I really cared for the experience, basically because getting the Latex part to work was a painful process.

However, with the release of R 3.0.0, it is now possible to define alternative vignette engines. Currently the only one I know of is knitr, which currently supports generating PDF from Rnw (latex with an alternative syntax to Sweave) and HTML from Rmd, or R markdown files. Markdown is so much easier to write, and in my mind the HTML generated is much easier to read. In addition to that, customization of the look using CSS is probably much more familiar to many people who are doing programming nowadays as well, another big plus.

In addition to having to write using Latex, the process of changing, building, loading, and documenting packages was pretty cumbersome. However, Hadley Wickham has been doing a lot to change that with his devtools package, that makes it quite easy to re-build a package and load it back up. This has now been integrated into the latest versions of RStudio, making it rather easy to work on a package and immediately work with any changes. In addition, the test_that package makes it easier to run automated tests, and ROxygen makes it easy to also document your custom functions used by your vignette.

So, I know I will be using Yihui's guide to switch my own package to use a markdown vignette, and will probably try to do my next paper as a self contained R package as well. How about you??

Edit: As Carl pointed out below, pandoc is very useful for converting markdown to other formats that may be required for formal submission to a journal.

28.11.12

Hive Plots Using R and Cytoscape

Hive Plots??

I found out about HivePlots this past summer, and although I thought they looked incredibly useful and awesome, I didn't have a personal use for them at the time, and therefore put off doing anything with them. That recently changed when I encountered some particularly nasty hairballs of force-directed graphs. Unfortunately, the HiveR package does not create interactive hiveplots (at least for 2D), and that is particularly important for me. I don't necessarily want to be able to compare networks (one of the selling points made by Martin Krzywinski), but I do want to be able to explore the networks that I create. For that reason I have been a big fan of the RCytoscape Bioconductor package since I encountered it, as it allows me to easily create graphs in R, and then interactively and programmatically explore them in Cytoscape

So I decided last week to see how hard it would be to generate a hive plot that could be visualized and interacted with in Cytoscape. For this example I'm going to use the data in the HiveR package, and actually use the structures already encoded, because they are useful.

Load Data

require(RCytoscape)
require(HiveR)
require(graph)
options(stringsAsFactors = F)

dataDir <- file.path(system.file("extdata", package = "HiveR"), "E_coli")
EC1 <- dot2HPD(file = file.path(dataDir, "E_coli_TF.dot"), node.inst = NULL, 
    edge.inst = file.path(dataDir, "EdgeInst_TF.csv"), desc = "E coli gene regulatory network (RegulonDB)", 
    axis.cols = rep("grey", 3))
str(EC1)

## List of 5
##  $ nodes    :'data.frame':   1597 obs. of  6 variables:
##   ..$ id    : int [1:1597] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ lab   : chr [1:1597] "pstB" "hybE" "fadE" "phnF" ...
##   ..$ axis  : int [1:1597] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ radius: num [1:1597] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ size  : num [1:1597] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ color : chr [1:1597] "transparent" "transparent" "transparent" "transparent" ...
##  $ edges    :'data.frame':   3893 obs. of  4 variables:
##   ..$ id1   : int [1:3893] 932 612 932 1510 1510 413 528 652 1396 400 ...
##   ..$ id2   : int [1:3893] 832 620 51 525 797 151 5 1058 1396 1559 ...
##   ..$ weight: num [1:3893] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ color : chr [1:3893] "red" "red" "green" "green" ...
##  $ desc     : chr "E coli gene regulatory network (RegulonDB)"
##  $ axis.cols: chr [1:3] "grey" "grey" "grey"
##  $ type     : chr "2D"
##  - attr(*, "class")= chr "HivePlotData"

Process Data

So here we have the data. The nodes is a data frame with the id, a label describing the node, which axis the node belongs on, and its radius, or how far out on the axis the node should be, as well as a size. These are all modifiable attributes that can be changed depending on how one wants to map different pieces of data. This of course is the beauty of hive plots, because they result in networks that are dependent on attributes that the user decides on.

In this case, we have a transcription factor regulation network. I am going to point you to the previous links as to why a normal force-directed network diagram is not really that informative for these types of networks. I'm not out to convince you that HivePlots are useful, if you don't get it from the publication and examples, then you should stop here. This is more about how to do some calculations to lay them out and work with them in Cytoscape.

Bryan has implemented some nice functions to work with this type of network and perform simple calculations to assign axes and locations based on properties of the nodes. For example, it is easy to locate nodes on an axis based on the total number of edges.

EC2 <- mineHPD(EC1, option = "rad <- tot.edge.count")
sumHPD(EC2)

##  E coli gene regulatory network (RegulonDB)
##  This hive plot data set contains 1597 nodes on 1 axes and 3893 edges.
##  It is a  2D data set.
## 
##      Axis 1 has 1597 nodes spanning radii from 1 to 434 
## 
##      Axes 1 and 1 share 3893 edges

And then to assign the axis to be plotted on based on the whether edges are incoming (sink), outgoing (source), or both (manager). These are the types of decisions that influence whether you get anything insightful or useful out of a HivePlot, and changing these options can of course change the conclusions you will make on a particular network.

EC3 <- mineHPD(EC2, option = "axis <- source.man.sink")
sumHPD(EC3)

##  E coli gene regulatory network (RegulonDB)
##  This hive plot data set contains 1597 nodes on 3 axes and 3893 edges.
##  It is a  2D data set.
## 
##      Axis 1 has 45 nodes spanning radii from 1 to 83 
##      Axis 2 has 1416 nodes spanning radii from 1 to 11 
##      Axis 3 has 136 nodes spanning radii from 2 to 434 
## 
##      Axes 1 and 2 share 400 edges
##      Axes 1 and 3 share 21 edges
##      Axes 3 and 2 share 3158 edges
##      Axes 3 and 3 share 314 edges

We also remove any nodes that have zero edges.

EC4 <- mineHPD(EC3, option = "remove zero edge")

## 
##   125 edges that start and end on the same point were removed

And finally re-order the edges (not sure how this would affect plotting using Cytoscape).

edges <- EC4$edges
edgesR <- subset(edges, color == "red")
edgesG <- subset(edges, color == "green")
edgesO <- subset(edges, color == "orange")
edges <- rbind(edgesO, edgesG, edgesR)
EC4$edges <- edges
EC4$edges$weight = 0.5

Calculate Node Locations

In this case we have three axes, so we are going to calculate the axes locations as 0, 120, and 240 degrees. However, we need to use radians, because the conversion from spherical to cartesian coordinates involves using cosine and sine, which in R is based on radians.

r2xy <- function(inRad, inPhi) {
    x <- inRad * sin(inPhi)
    y <- inRad * cos(inPhi)
    cbind(x, y)
}

tmpDat <- EC4$nodes[, c("id", "axis", "radius")]
tmpDat$radius <- tmpDat$radius * 3  # bump it up as cytoscape coordinates are small
tmpDat$phi <- ((tmpDat$axis - 1) * 120) * (pi/180)

nodeXY <- r2xy(tmpDat$radius, tmpDat$phi)  # contains cartesian coordinates

Create GraphNEL

Initialize the graph with the nodes and the edges.

hiveGraph <- new("graphNEL", nodes = as.character(EC4$nodes$id), edgemode = "directed")
hiveGraph <- addEdge(as.character(EC4$edges$id1), as.character(EC4$edges$id2), 
    hiveGraph)

We also want to put information we know about the nodes and edges in the graph, so that we can modify colors and stuff based on those attributes. For example, in this case we might want to modify the node color based on the axis it is on. Using attributes means we are not stuck using the colors that we previously assigned.

nodeDataDefaults(hiveGraph, "nodeType") <- ""
attr(nodeDataDefaults(hiveGraph, "nodeType"), "class") <- "STRING"

nodeTypes <- c(`1` = "source", `2` = "man", `3` = "sink")
nodeData(hiveGraph, as.character(EC4$nodes$id), "nodeType") <- nodeTypes[as.character(EC4$nodes$axis)]

edgeDataDefaults(hiveGraph, "interactionType") <- ""
attr(edgeDataDefaults(hiveGraph, "interactionType"), "class") <- "STRING"

interactionType <- c(red = "repressor", green = "activator", orange = "dual")
edgeData(hiveGraph, as.character(EC4$edges$id1), as.character(EC4$edges$id2), 
    "interactionType") <- interactionType[EC4$edges$color]

Transfer to Cytoscape

ccHive <- CytoscapeWindow("hiveTest", hiveGraph)
displayGraph(ccHive)

## [1] "nodeType"
## [1] "label"
## [1] "interactionType"

## interactionType 
##            TRUE

Now lets move those nodes to their positions based on the Hive Graph calculations.

setNodePosition(ccHive, as.character(EC4$nodes$id), nodeXY[, 1], nodeXY[, 2])
fitContent(ccHive)
setDefaultNodeSize(ccHive, 5)

## [1] TRUE

And set the colors based on attributes:

nodeColors <- hcl(h = c(0, 120, 240), c = 55, l = 45)  # darker for the nodes
edgeColors <- hcl(h = c(0, 120, 60), c = 45, l = 75)  # lighter for the edges

setNodeColorRule(ccHive, "nodeType", c("source", "man", "sink"), nodeColors, 
    "lookup")
setNodeBorderColorRule(ccHive, "nodeType", c("source", "man", "sink"), nodeColors, 
    "lookup")
setEdgeColorRule(ccHive, "interactionType", c("repressor", "activator", "dual"), 
    edgeColors, "lookup")
setNodeFontSizeDirect(ccHive, as.character(EC4$nodes$id), 0)
redraw(ccHive)

fitContent(ccHive)
saveImage(ccHive, file.path(getwd(), "nonScaled.png"), "PNG")

nonScaled

This view doesn't help us a whole lot, unfortunately. What if we normalize the radii for each axis to use a maximum value of 100?

useMax <- 100
invisible(sapply(c(1, 2, 3), function(inAxis) {
    isCurr <- EC4$nodes$axis == inAxis
    currMax <- max(EC4$nodes$radius[isCurr])
    scaleFact <- useMax/currMax
    EC4$nodes$radius[isCurr] <<- EC4$nodes$radius[isCurr] * scaleFact
}))

tmpDat <- EC4$nodes[, c("id", "axis", "radius")]
tmpDat$radius <- tmpDat$radius * 3  # bump it up as cytoscape coordinates are small
tmpDat$phi <- ((tmpDat$axis - 1) * 120) * (pi/180)

nodeXY <- r2xy(tmpDat$radius, tmpDat$phi)

setNodePosition(ccHive, as.character(EC4$nodes$id), nodeXY[, 1], nodeXY[, 2])
fitContent(ccHive)
redraw(ccHive)

fitContent(ccHive)
saveImage(ccHive, file.path(getwd(), "scaledAxes.png"), "PNG")

scaledAxes

This looks pretty awesome! And I can zoom in on it, and examine it, and look at various properties! And I get the full scripting power of R if I want to do anything else with, such as select sets of edges or nodes and then query who is attached to whom.

Disadvantages

We don't get the arced edges. This kind of sucks, but from what little I have done with these, that actually is not that big a deal. Would be cool if there was a way to do that, however. I do see that the web version of Cytoscape does allow you to set a value for how much “arcness” you want on an edge.

This does mean that any plot with only two axes would need special consideration. Instead of doing two axes end to end (using 180 deg), it might be better to make them parallel to each other.

With more than three axes, line crossings may become a problem. In that case, it may be worth looking to see if there are ways to tell Cytoscape in what order to draw edges. I don't know if that is possible using the XMLRPC pipe that is used by RCy.

RCy Tip

If you want to know how the image will look when saving a network to an image, use showGraphicsDetails(obj, TRUE).

Other Visualizations

Of course, I had just wrapped my head around using HivePlots in my own work, when I encountered ISBs BioFabric. Given how they are representing this, could we find a way to draw this in Cytoscape??

Cleanup

deleteWindow(ccHive)

## [1] TRUE

Session Info

Sys.time()

## [1] "2012-11-28 16:16:41 EST"

sessionInfo()

## R version 2.15.1 (2012-06-22)
## Platform: x86_64-pc-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] HiveR_0.2-7      RCytoscape_1.6.5 XMLRPC_0.2-4     graph_1.34.0    
## [5] knitr_0.8       
## 
## loaded via a namespace (and not attached):
##  [1] BiocGenerics_0.2.0 bipartite_1.18     digest_0.5.2      
##  [4] evaluate_0.4.2     formatR_0.6        grid_2.15.1       
##  [7] igraph_0.6-2       lattice_0.20-10    MASS_7.3-18       
## [10] plyr_1.7.1         RColorBrewer_1.0-5 RCurl_1.95-1.1    
## [13] RFOC_2.0-02        rgl_0.92.892       sna_2.2-0         
## [16] splines_2.15.1     stringr_0.6.1      survival_2.36-14  
## [19] tkrgl_0.7          tnet_3.0.8         tools_2.15.1      
## [22] vegan_2.0-4        XML_3.9-4.1        xtable_1.7-0

Where

Published on Blogger. Source hosted on Github

Pages

11.7.13

R Interface for Teaching

iPython Notebook

R Notebook

RStudio as a Teaching Environment

Feedback

Hard to Install

Moving from Presentation to Coding

5.6.13

Tim Horton's Density

Setup

Statistics Canada Census Data

Dissemination Area Long. and Lat.

Run queries on Foursquare

Load up the data and verify what we have.

Generate queries and run

Clean up the results

Visualize Locations

Put them on a map

How far??

Replication

Caveats

Posted

Disclaimer

Other information when this vignette was generated

Session Info

Time and Date

30.5.13

Storing package data in custom environments

5.4.13

Reproducible publications as R packages... and markdown vignettes!

28.11.12

Hive Plots Using R and Cytoscape

Hive Plots??

Load Data

Process Data

Calculate Node Locations

Create GraphNEL

Transfer to Cytoscape

Disadvantages

RCy Tip

Other Visualizations

Cleanup

Session Info

Where