16.8.12

Show me yours

romunov at “danganothererror” recently posted about his personal setup for working with R, and challenged others to post as well. Here is my setup.

I use RStudio, maximized on one monitor (I have a two monitor setup). This gives me multiple editor windows for scripts / function writing / package development, an integrated R command window, workspace & history browser, as well as files, plots, packages, and help. For working on multiple projects, I use the RStudio project feature, that keeps project specific information (directory, saved sessions if you want it, integrated git repos), and multiple desktops using dexpot on Windows.

RStudio also has markdown to html support directly, and they are adding a bunch of package development support, using a lot of the work Hadley Wickham has done with the excelent devtools package.

I like it a lot, and much prefer it over my previous Notepad++, npptoR, R gui setup.

Source markdown at https://github.com/rmflight/blogPosts/blob/master/showmeyours.md

Posted at: http://robertmflight.blogspot.com/2012/08/show-me-yours.html

15.8.12

Loving Markdown!

Ok, so for those who don't know, the guys from RStudio recently teamed up with Yihui to add some really nice report authoring options in RStudio using the packages knitr ability to turn a combination of markdown and R code into html.

I have to admit, this has really changed how I work. Previously, I generally had R scripts, that I would then run, and summarize the results in a separate document as a report on what I had done. I know, many like to talk about Sweave, the language that R uses to generate vignettes demonstrating package functionality, but have you ever tried to write a Sweave document?

You need to know a fair amount about Latex, and even then it can be difficult to get the output you want. In addition, reading the raw file can be quite painful (I know, I have my own Bioconductor package that I wrote a Sweave vignette for).

Writing R Markdown documents just feels different. When I read the raw source of a Markdown document, I can actually read it, code and all. What is really sweet is that instead of writing about what I am doing in the comments, I write it out in full in the document, and then have the code blocks doing the actual calculations. What is really great is to regenerate the report, I simply re-knit it to generate a new html file.

It is so much easier to work with, that I am probably going to switch even how I write my blog posts, using a Markdown document as the source. For right now, that means writing a .md file, and then converting it to html using the R Markdown package, and then writing in the html to Blogger. You can see a good explanation of that process from Jeffrey Horner's blog here and here.

When I combine this with a github repo for storage, it also means I have some other place to keep the raw source of my blog posts, as well as easily read and edit the text. For example, you can read the raw text that was used for this post.

Source of this post is at https://github.com/rmflight/blogPosts/blob/master/rmarkdown_post_150812.md. Published at http://robertmflight.blogspot.com/2012/08/loving-markdown.html

Journal Club: 15.08.12

I just came back from our Bioinformatic group (a rather loose association of various researchers at UofL interested in and doing bioinformatics) journal club, where we discussed this recent paper:

Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes

Besides the catchy title that makes one believe that perhaps Google is getting into cancer research (maybe they are and we don't know it yet), there were some interesting aspects to this paper.

Premise

The premise is that they can combine gene expression data and network data to find better associations between gene expression data and a particular disease endpoint. The way this is carried out is through the use of the TRANSFAC transcription factor - gene target database for the network, the correlation of the gene expression with the disease status as the importance of a gene with the disease, and the Google PageRank as the means to transfer the network knowledge to the gene expression data. They call their method NetRank.

Note that the general idea had already been tried in this paper on GeneRank.

Implementation

Rank the genes with disease status (poor or good prognosis) using a method (SAM, t-test, fold-change, correlation, NetRank). Pick n top genes, and develop a predictive model using a support vector machine. Wash, rinse, repeat several times to find the best set, varying the number of top genes, and the number of samples used in the training set.

For NetRank, the top genes were decided by using a sub-optimization based on varying d, the dampening factor in the PageRank algorithm that determines how much information can be transferred to other genes. The best value of d determined in this study was 0.3.

All other methods used just the 8000 genes that passed filtering, but NetRank used all the genes on the array, with those that were filtered out had their initial correlations set to 0, so that they were still in the network representation.

Monte Carlo cross-validation

Did it work?

From the paper, it appears to have worked. Using a monte-carlo cross-validation, they were able to achieve over 70% prediction rates. And this was better than any of the other methods they used to associate genes with the disease, including SAM, t-test, fold-change, and raw correlations.

NetRank feature selection performance

Issues

As we discussed the article, some questions did come up.

  1. What was the variation in d depending on the size of the training set?
  2. How consistent were the genes that came out as biomarkers?
    • It would be nice to try this methodology on a series of independent, but related cancer datasets (ie breast or lung cancer) and see how consistent the lists are. This was done here.
  3. What happens if the genes that don't pass filtering are removed from the network entirely?
  4. Were the problems reported with not-filtering genes due to having only two disease points (poor and good prognosis) to calculate a correlation of expression with?
  5. How many iterations does it take to achieve convergence?
  6. The list of genes they come up with are fairly well known cancer genes. We were kindof surprised that they didn't seem to come up novel genes associated directly with pancreatic cancer.
  7. Why is d so variable depending on the cancer examined?

Things to try

  • Could we improve on this by instead of taking just the top-ranked genes, look for the top ranked cliques, i.e. take the top gene, remove anything in its immediate neighborhood, and then go to the next one?
  • What would happen if we used a directed network based on connected Reactome or KEGG pathways?

The markdown source of this post is here.

13.7.12

Creating custom CDF for Affy chips in R / Bioconductor






What?

For those who don't know, CDF files are chip definition format files that define which probes belong to which probesets, and are necessary to use any of the standard summarization methods such as RMA, and others.

Why?

Because we can, and because custom definitions have been shown to be quite useful. See the information over at Brainarray.

Why not somewhere else?

A lot of times other people create custom CDF files based on their own criteria, and make it subsequently available for others to use (see the Brainarray for an example of what some are doing, as well as [PlandbAffy][linkplandb])

You have a really nifty idea for a way to reorganize the probesets on an Affymetrix chip to perform a custom analysis, but you don't want to go to the trouble of actually creating the CDF files and Bioconductor packages normally required to do the analysis, and yet you want to test and develop your analysis method.

How?

It turns out you are in luck. At least for AffyBatch objects in Bioconductor (created by calling ReadAffy), the CDF information is stored as an attached environment that can be easily hacked and modified to your hearts content. Environments in R are quite important and useful, and I wouldn't have come up with this if I hadn't been working in R for the past couple of years, but figured someone else might benefit from this knowledge.

The environment

In R, one can access an environment like so:

get("objName", envName)  # get the value of object in the environment
ls(envName)

What is also very cool, is that one can extract the objects in an environment to a list, and also create their own environment from a list using list2env. Using this methodology, we can create our own definition of probesets that can be used by standard Bioconductor routines to summarize the probes into probesets.

A couple of disclaimers:

  • I have only tried this on 3' expression arrays
  • There might be a better way to do this, but I couldn't find it (let me know in the comments)

Example

require(affy)
require(estrogen)
require(hgu95av2cdf)

datadir = system.file("extdata", package = "estrogen")

pd = read.AnnotatedDataFrame(file.path(datadir, "estrogen.txt"), 
    header = TRUE, sep = "", row.names = 1)
pData(pd)
##              estrogen time.h
## low10-1.cel    absent     10
## low10-2.cel    absent     10
## high10-1.cel  present     10
## high10-2.cel  present     10
## low48-1.cel    absent     48
## low48-2.cel    absent     48
## high48-1.cel  present     48
## high48-2.cel  present     48

celDat = ReadAffy(filenames = rownames(pData(pd)), phenoData = pd, 
    verbose = TRUE, celfile.path = datadir)
## 1 reading J:/R150_libraries/estrogen/extdata/low10-1.cel ...instantiating an AffyBatch (intensity a 409600x8 matrix)...done.
## Reading in : J:/R150_libraries/estrogen/extdata/low10-1.cel
## Reading in : J:/R150_libraries/estrogen/extdata/low10-2.cel
## Reading in : J:/R150_libraries/estrogen/extdata/high10-1.cel
## Reading in : J:/R150_libraries/estrogen/extdata/high10-2.cel
## Reading in : J:/R150_libraries/estrogen/extdata/low48-1.cel
## Reading in : J:/R150_libraries/estrogen/extdata/low48-2.cel
## Reading in : J:/R150_libraries/estrogen/extdata/high48-1.cel
## Reading in : J:/R150_libraries/estrogen/extdata/high48-2.cel

This loads up the data, reads in the raw data, and gets it ready for us to use. Now, lets see what is in the actual CDF environment.

topProbes <- head(ls(hgu95av2cdf))  # get a list of probesets
topProbes
## [1] "100_g_at"  "1000_at"   "1001_at"   "1002_f_at" "1003_s_at" "1004_at"  

exSet <- get(topProbes[1], hgu95av2cdf)
exSet
##           pm     mm
##  [1,] 175218 175858
##  [2,] 356689 357329
##  [3,] 227696 228336
##  [4,] 237919 238559
##  [5,] 275173 275813
##  [6,] 203444 204084
##  [7,] 357984 358624
##  [8,] 368524 369164
##  [9,] 285352 285992
## [10,] 304510 305150
## [11,] 159937 160577
## [12,] 223929 224569
## [13,] 282764 283404
## [14,] 270003 270643
## [15,] 303343 303983
## [16,] 389048 389688

We can see here that the first probe set 100_g_at has 16 perfect-match and mis-match probes in associated with it.

Lets summarize the original data using RMA.

rma1 <- exprs(rma(celDat))
## Background correcting
## Normalizing
## Calculating Expression

head(rma1)
##           low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel
## 100_g_at        9.643       9.741        9.537        9.354       9.592
## 1000_at        10.398      10.254       10.004        9.904      10.375
## 1001_at         5.718       5.881        5.860        5.954       5.961
## 1002_f_at       5.513       5.802        5.571        5.608       5.390
## 1003_s_at       7.784       8.008        8.038        7.835       7.926
## 1004_at         7.289       7.604        7.489        7.772       7.522
##           low48-2.cel high48-1.cel high48-2.cel
## 100_g_at        9.571        9.476        9.531
## 1000_at        10.034       10.345        9.863
## 1001_at         6.021        5.981        6.285
## 1002_f_at       5.495        5.508        5.630
## 1003_s_at       8.139        7.995        8.233
## 1004_at         7.600        7.456        7.675

Now lets get the data as a list, and then create a new environment to be used for summarization.

allSets <- ls(hgu95av2cdf)
allSetDat <- mget(allSets, hgu95av2cdf)

allSetDat[1]
## $`100_g_at`
##           pm     mm
##  [1,] 175218 175858
##  [2,] 356689 357329
##  [3,] 227696 228336
##  [4,] 237919 238559
##  [5,] 275173 275813
##  [6,] 203444 204084
##  [7,] 357984 358624
##  [8,] 368524 369164
##  [9,] 285352 285992
## [10,] 304510 305150
## [11,] 159937 160577
## [12,] 223929 224569
## [13,] 282764 283404
## [14,] 270003 270643
## [15,] 303343 303983
## [16,] 389048 389688
## 

hgu2 <- list2env(allSetDat)
celDat@cdfName <- "hgu2"

rma2 <- exprs(rma(celDat))
## Background correcting
## Normalizing
## Calculating Expression
head(rma2)
##           low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel
## 100_g_at        9.643       9.741        9.537        9.354       9.592
## 1000_at        10.398      10.254       10.004        9.904      10.375
## 1001_at         5.718       5.881        5.860        5.954       5.961
## 1002_f_at       5.513       5.802        5.571        5.608       5.390
## 1003_s_at       7.784       8.008        8.038        7.835       7.926
## 1004_at         7.289       7.604        7.489        7.772       7.522
##           low48-2.cel high48-1.cel high48-2.cel
## 100_g_at        9.571        9.476        9.531
## 1000_at        10.034       10.345        9.863
## 1001_at         6.021        5.981        6.285
## 1002_f_at       5.495        5.508        5.630
## 1003_s_at       8.139        7.995        8.233
## 1004_at         7.600        7.456        7.675

What about removing the MM columns? RMA only uses the PM, so it should still work.

allSetDat <- lapply(allSetDat, function(x) {
    x[, 1, drop = F]
})

allSetDat[1]
## $`100_g_at`
##           pm
##  [1,] 175218
##  [2,] 356689
##  [3,] 227696
##  [4,] 237919
##  [5,] 275173
##  [6,] 203444
##  [7,] 357984
##  [8,] 368524
##  [9,] 285352
## [10,] 304510
## [11,] 159937
## [12,] 223929
## [13,] 282764
## [14,] 270003
## [15,] 303343
## [16,] 389048
## 

hgu3 <- list2env(allSetDat)
celDat@cdfName <- "hgu3"
rma3 <- exprs(rma(celDat))
## Background correcting
## Normalizing
## Calculating Expression
head(rma3)
##           low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel
## 100_g_at        9.643       9.741        9.537        9.354       9.592
## 1000_at        10.398      10.254       10.004        9.904      10.375
## 1001_at         5.718       5.881        5.860        5.954       5.961
## 1002_f_at       5.513       5.802        5.571        5.608       5.390
## 1003_s_at       7.784       8.008        8.038        7.835       7.926
## 1004_at         7.289       7.604        7.489        7.772       7.522
##           low48-2.cel high48-1.cel high48-2.cel
## 100_g_at        9.571        9.476        9.531
## 1000_at        10.034       10.345        9.863
## 1001_at         6.021        5.981        6.285
## 1002_f_at       5.495        5.508        5.630
## 1003_s_at       8.139        7.995        8.233
## 1004_at         7.600        7.456        7.675

What if we only want to use the first 5 probesets?

allSetDat <- allSetDat[1:5]
hgu4 <- list2env(allSetDat)
celDat@cdfName <- "hgu4"
celDat
## AffyBatch object
## size of arrays=640x640 features (22 kb)
## cdf=hgu4 (5 affyids)
## number of samples=8
## number of genes=5
## annotation=hgu95av2
## notes=
rma4 <- exprs(rma(celDat))
## Background correcting
## Normalizing
## Calculating Expression
rma4
##           low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel
## 100_g_at        9.463       9.555        9.449        9.402       9.448
## 1000_at        10.183      10.010       10.010        9.970      10.102
## 1001_at         5.944       6.005        5.944        6.090       6.237
## 1002_f_at       5.787       5.846        5.817        5.815       5.763
## 1003_s_at       7.751       7.769        7.913        7.864       7.861
##           low48-2.cel high48-1.cel high48-2.cel
## 100_g_at        9.458        9.401        9.431
## 1000_at        10.010       10.197        9.890
## 1001_at         6.148        6.189        6.207
## 1002_f_at       5.763        5.741        5.755
## 1003_s_at       7.918        7.863        7.929
dim(rma4)
## [1] 5 8

Custom CDF

To generate our custom CDF, we are going to set our own names, and take random probes from all of the probes on the chip. The actual criteria of which probes should be together can be made using any method the author chooses.

maxIndx <- 640 * 640

customCDF <- lapply(seq(1, 100), function(x) {
    tmp <- matrix(sample(maxIndx, 20), nrow = 20, ncol = 1)
    colnames(tmp) <- "pm"
    return(tmp)
})

names(customCDF) <- seq(1, 100)

hgu5 <- list2env(customCDF)
celDat@cdfName <- "hgu5"
rma5 <- exprs(rma(celDat))
## Background correcting
## Normalizing
## Calculating Expression

head(rma5)
##     low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel
## 1         6.938       6.851        6.831        6.742       6.894
## 10        6.663       6.610        6.615        6.612       6.610
## 100       5.256       5.499        5.420        5.205       5.352
## 11        7.529       7.807        7.552        7.539       7.728
## 12        6.163       6.216        6.230        6.091       6.071
## 13        6.311       6.404        6.264        6.290       6.257
##     low48-2.cel high48-1.cel high48-2.cel
## 1         6.636        6.862        6.576
## 10        6.572        6.685        6.481
## 100       5.256        5.281        5.258
## 11        7.792        7.682        7.590
## 12        6.058        6.204        6.159
## 13        6.249        6.275        6.313

I hope this information is useful to someone else. I know it made my life a lot easier.

sessionInfo()
## R version 2.15.0 (2012-03-30)
## Platform: x86_64-pc-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_0.6.3          hgu95av2cdf_2.10.0   estrogen_1.8.8      
## [4] BiocInstaller_1.4.7  rat2302cdf_2.10.0    AnnotationDbi_1.18.1
## [7] affy_1.34.0          Biobase_2.16.0       BiocGenerics_0.2.0  
## 
## loaded via a namespace (and not attached):
##  [1] affyio_1.24.0         DBI_0.2-5             digest_0.5.2         
##  [4] evaluate_0.4.2        formatR_0.5           IRanges_1.14.4       
##  [7] markdown_0.5.2        parser_0.0-16         plyr_1.7.1           
## [10] preprocessCore_1.18.0 Rcpp_0.9.10           RSQLite_0.11.1       
## [13] stats4_2.15.0         stringr_0.6           tools_2.15.0         
## [16] zlibbioc_1.2.0       

Edit: added session information.

Source of the code that generated the output is on Github

16.4.12

Firefox Keywords + Javascript = Magic!

Here is something particularly useful that I keep rediscovering for various reasons. Firefox allows you to refer to any bookmark with a keyword, and use that keyword as a shortcut in the address bar to access it. For example, if I create a bookmark that points to "http://www.google.com", and go to "Show all Bookmarks", and click on that bookmark, then I can modify the "Keyword" field, and I will then be able to simply type "google" in the address bar and go there. This has been in Firefox since a very early version.

This gets cooler in that you can easily set up these keywords to actually bookmark search services that will then take a search string following the keyword and redirect you to the search results.

From a Bioinformatics perspective, I could set up a custom search on the UCSC Genome browser that uses this string as the location:

And then give this bookmark the keyword "ratgenome". I could then search for genes in the Rat genome from the Firefox address "ratgenome 'genename'". For example, I could do the query 'ratgenome brca1' to search for "brca1" in Rat.

Now, how often do you look in just one place for something though? What if I wanted to search NCBI's gene database, Ensembl, GeneCards, and UCSC Genome browser? Now we create a simple javascript command, and save that as our bookmark with a keyword:

If I supply the keyword "genesearch", now I can simply do "genesearch brca1" and have all four windows open as tabs with my search results. This can be set up for almost any site as long as you can figure out what the search string parameters are, replacing the part of the URL that defines your query with "%s".

Inspiration from here.

2.4.12

Bioconductor, Git and SVN multiple branches

So the Bioconductor SVN repo is set up in the standard trunk / branches way, but as a developer, you only have write access to your package directory, which is either in "trunk/x/x/packageName" (dev) or "branches/releaseNum/x/x/packageName" (release). This is not what Git is expecting if you want to enable keeping track of release and dev in the same Git repository. How do you set it up so you can keep track of everything in a single repository like you might normally want?

I found some suggestions here, and show them below:

Clone your SVN repo:
git svn clone https://svnRepo/trunk/x/x/packageName packageName
Now add information about the branches:
git config --add svn-remote.releaseNum.url https://svnRepo/branches/releaseNum/x/x/packageName/
git config --add svn-remote.releaseNum.fetch :refs/remotes/releaseNum/
Now fetch and create a local branch tied to the remote:
git svn fetch releaseNum
git checkout -b local-releaseNum -t releaseNum
From these two branches, you should be able to create normal Git branches, work and modify them, and then go back and merge changes, rebase, and dcommit as usual. Although I'm sure you could clone each of these into separate git repos, you would then lose the ability to do a diff between them. I hope this saves someone else from searching over the web for a couple of hours.

Originally posted on my Wiki.

14.3.12

Tweetdeck oddities

I've started using Tweetdeck to keep track of multiple twitter accounts and Facebook, and it's pretty cool. One thing that seems to be odd though, is that if you add multiple accounts, the "Home" column will show the feed from the first twitter account added and the Facebook feed, but not a second twitter feed. It took me a while to figure this out. I decided the easiest thing to do then was to have three columns, a timeline for each twitter account, and the facebook news feed.

I didn't see anything about this in the help, and there don't appear to be any options to change what appears in the "Home" column.