4.9.13

Moved!

For those who didn't know, previously I was working with Hunter Moseley at the Center for Regulatory and Environmental Analytical Metabolomics (CREAM) at the University of Louisville. In July, the principal investigators associated with CREAM were offered an opportunity to create the Center for Systems Biology in the Markey Cancer Center at the University of Kentucky. I was offered the chance to move too, and I did. We officially moved our computing facilities on August 30, and are temporarily housed in a wet lab while we await the completion of renovations in our new space.

With the move, I'm also planning to attempt a move of my blogging platform to a github hosted version written completely in markdown, probably using samatha. This will take some work, but I think it will be better overall for what I want in a blogging platform overall. By generating static HTML, I shoudn't be dependent on what Blogger is doing, and it integrates better with the other things I am trying to do as well (especially with keeping the source of blog posts hosted on github too, you did know I was doing that right??).

16.8.13

Reproducible Methods (or bad bioinformatics methods sections)

Science is built on the whole idea of being able to reproduce results, i.e. if I publish something, it should be possible for someone else to reproduce it, using the description of the methods used in the publication. As biological sciences have become increasingly reliant on computational methods, this has become a bigger and bigger issue, especially as the results of experiments become dependent on independently developed computational code, or use rather sophisticated computer packages that have a variety of settings that can affect output, and multiple versions. For further discussion on this issue, you might want to read 1, 2.


I recently read a couple of different publications that really made me realize how big a problem this is. I want to spend some time showing what the problem is in these publications, and why we should be concerned about the current state of computational analytical reproducibility in life-sciences.


In both the articles mentioned below, I do not believe that I, or anyone not associated with the project, would be able to generate even approximately similar results based solely on the raw data and the description of methods provided. Ultimately, this is a failure of both those doing the analysis, and the reviewers who reviewed the work, and is a rather deplorable situation for a field that prides itself verification of results. This is why I'm saying these are bad bioinformatics methods sections.


Puthanveettil et al., Synaptic Transcriptome


Puthanveettil et al, 2013 had a paper out earlier titled “A strategy to capture and characterize the synaptic transcriptome” in PNAS. Although the primary development reported is a new method of characterizing RNA complexes that are carried by kinesin, much of the following analysis is bioinformatic in nature.


For example, they used BLAST searches to identify the RNA molecules, a cutoff value is reported in the results. However, functional characterization using Gene Ontology (GO) was carried out by “Bioinformatics analyses” (see the top of pg3 in the PDF). No mention of where the GO terms came from, which annotation source was used, or any software mentioned. Not in the results, discussion, or methods, or the supplemental methods. The microarray data analysis isn't too badly described, but the 454 sequencing data processing isn't really described at all.


My point is, that even given their raw data, I'm not sure I would be able to even approximate their results based on the methods reported in the methods section.


Gulsuner et al., Schizophrenia SNPs


Gulsuner et al published a paper in Cell in August 2013 titled “Spatial and Temporal Mapping of De Novo Mutations in Schizophrenia to a Fetal Prefrontal Cortical Network”. This one also looks really nice, they look for de novo mutations (i.e. new mutations in offspring not present in parents or siblings) that mess up genes that are in a heavily connected network, and also examine gene co-expression over brain development time-scales. Sounds really cool, and the results seem like they are legit, based on my reading of the manuscript. I was really impressed that they even used randomly generated networks to control the false discovery rate!


However, almost all of the analysis again depends on a lot of different bioinformatic software. I do have to give the authors props, they actually give the full version of each tool used. But no mention of tool specific settings (which can generate vastly different results, see Exome Sequencing of the methods).


Then there is this bombshell: “The predicted functional impact of each candidate de novo missense variant was assessed with in silico tools.” (near top of pg 525 of the PDF). Rrrreeeaaaalllly now. No actual quote of which tools were used, although the subsequent wording and references provided imply that they were PolyPhen2, SIFT, and the Grantham Method. But shouldn't that have been stated up front? Along with any settings that were changed from default??


There is no raw data available, only their reported SNPs. Not even a list of all the SNPs that were potentially considered, so that I could at least go from those and re-run the later analysis. I have to take their word for it (although I am glad at least the SNPs they used in later analyses are reported).


Finally, the random network generation. I'd like to be able to see that code, go over it, and see what exactly it was doing to verify it was done correctly. It likely was, based on the histograms provided, but still, these are where small errors creep in and result in invalid results.


As above, even if the raw data was available (didn't see an SRA accession or any other download link), I'm not sure I could reproduce or verify the results.


What to do??


How do we fix this problem? I think scripts and workflows used to run any type of bioinformatic analyses have to become first class research objects. And we have to teach scientists to write them and use them in a way that makes them first class research objects. So in the same way that a biologist might ask for verification of immunostaining, etc, bioinformaticians should ask that given known input, a script generates reasonable output.


I know there has been discussion on this before, and disagreement, especially with the exploratory nature of research. However, once you've got something working right, you should be able to test it. Reviewers should be asking if it is testable, or the code should be available for others to test.


I also think we as a community should do more to point out the problem. i.e. when we see it, point it out to others. I've done that here, but I don't know how much should be formal. Maybe we need a new hashtag, #badbioinfomethodsection, and point links to papers that do this. Conversely, we should also point to examples when it is done right (#goodbioinfomethodsection??), and if you are bioinformatician or biologist who does a lot of coding, share your code, and at least supply it as supplemental materials. Oh, and maybe take a SoftwareCarpentry class, and look up git.


Posted on August 16, 2013 at http://robertmflight.blogspot.com/2013/08/reproducible-methods-or-bad.html, raw markdown at https://github.com/rmflight/blogPosts/blob/master/reproducible_methods.md

11.7.13

R Interface for Teaching

Kaitlin Thaney asked on Twitter last week about using Ramnath Vaidyanathan's new interactive R notebook 1 2 for teaching.

Now, to be clear up front, I am not trying to be mean to Ramnath, discredit his work, or the effort that went into that project. I think it is really cool, and has some rather interesting potential applications, but I don't really think it is the right interface for teaching R. I would argue that the best interface for teaching R right now is RStudio. Keep reading to find out why.

iPython Notebook

First, I believe Ramnath when he says he was inspired by the iPython Notebook that makes it so very nice to do interactive, reproducible Python coding. Software Carpentry has been very successfully using them for helping to teach Python to scientists.

However, the iPython Notebook is an interesting beast for this purpose. You are able to mix markdown blocks and code blocks. In addition, it is extremely simple to break up your calculations into units of related code, and re-run those units as needed. This is particularly useful when writing new functions, because you can write the function definition, and a test that displays output in one block, and then the actual computations in subsequent blocks. It makes it very easy to keep re-running the same block of code over and over until it is correct, which allows one to interactively explore changes to functions. This is awesome for learning Python and prototyping functions.

In addition to being able to repeatedly write -> run -> modify in a loop, you can also insert prose describing what is going on in the form of markdown. This is a nice lightweight syntax that generates html. So it becomes relatively easy to document the why of something.

R Notebook

Unfortunately, the R notebook that Ramnath has put up is not quite the same beast. It is an Ace editor window coupled to an R process that knits the markdown and displays the resultant html. This is really cool, and I think will be useful in many other applications, but not for teaching in an interactive environment.

RStudio as a Teaching Environment

Lets think. We want something that lets us repeatedly write -> run -> modify on small code blocks in R, but would be great if it was some kind of document that could be shared, and re-run.

I would argue that the editor environment in RStudio when writing R markdown (Rmd) files is the solution. R code blocks behave much the same as in iPython notebook, in that they are colored differently, set apart, have syntax highlighting, and can be easily repeatedly run using the code chunk menu. Outside of code blocks is assumed to be markdown, making it easy to insert documentation and explanation. The code from the code blocks is sent to an attached R session, where objects can be further investigated if required, and results are displayed.

This infrastructure supplies an interactive back and forth between editor and execution environment, with the ability to easily group together units of code.

In addition, RStudio has git integration baked in, so it becomes easy to get started with some basic version control.

Finally, RStudio is cross-platform, has tab completion among other standard IDE goodies, and its free.

Feedback

I've gotten some feedback on twitter about this, and I want to update this post to address it.

Hard to Install

One comment was that installing R, RStudio and necessary packages might be hard. True, it might be. However, I have done multiple installs of R, RStudio, Python, and iPython Notebook in both Linux and Windows, and I would argue that the level of difficulty is at least the same.

Moving from Presentation to Coding

I think this is always difficult, especially if you have a powerpoint, and your code is in another application. However, the latest dev version of RStudio (download) now includes the ability to view markdown based presentations in an attached window. This is probably one of the potentially nicest things for doing presentations that actually involve editing actual code.

Edit: added download links for Rstudio preview

5.6.13

Tim Horton's Density

Inspired by this post, I wanted to examine the locations and density of Tim Hortons restaurants in Canada. Using Stats Canada data, each census tract is queried on Foursquare for Tims locations.


Setup


options(stringsAsFactors = F)
require(timmysDensity)
require(plyr)
require(maps)
require(ggplot2)
require(geosphere)

Statistics Canada Census Data


The actual Statistics Canada data at the dissemination block level can be downloaded from here. You will want to download the Excel format, read it, and then save it as either tab-delimited or CSV using a non-standard delimiter, I used a semi-colon (;).


censusData <- read.table("../timmysData/2011_92-151_XBB_XLSX.csv", header = F, 
    sep = ";", quote = "")
censusData <- censusData[, 1:17]
names(censusData) <- c("DBuid", "DBpop2011", "DBtdwell2011", "DBurdwell2011", 
    "DBarea", "DB_ir2011", "DAuid", "DAlamx", "DAlamy", "DAlat", "DAlong", "PRuid", 
    "PRname", "PRename", "PRfname", "PReabbr", "PRfabbr")
censusData$DBpop2011 <- as.numeric(censusData$DBpop2011)
censusData$DBpop2011[is.na(censusData$DBpop2011)] <- 0

censusData$DBtdwell2011 <- as.numeric(censusData$DBtdwell2011)
censusData$DBtdwell2011[is.na(censusData$DBtdwell2011)] <- 0

From this data we get block level:


  • populations (DBpop2011)
  • total private dwellings (DBtdwell2011)
  • privale dwellings occupied by usual residents (DBurdwell2011)
  • block land area (DBarea)
  • dissemination area id (DAuid)
  • representative point x coordinate in Lambert projection (DAlamx)
  • rep. point y coordinate in Lambert projection (DAlamy)
  • rep. point latitude (DAlat)
  • rep. point longitude (DAlong)

This should be everything we need to do the investigation we want.


Dissemination Area Long. and Lat.


We need to find the unique dissemination areas, and get out their latitudes and longitudes for querying in other databases. Note that the longitude and latitude provided here actually are weighted representative locations based on population. However, given the size of them, I don't think using them will be a problem for Foursquare. Because areas are what we have location data for, we will summarize everything at the area level, summing the population counts for all the blocks within an area.


uniqAreas <- unique(censusData$DAuid)

summarizeArea <- function(areaID) {
    areaData <- censusData[(censusData$DAuid == areaID), ]
    outData <- data.frame(uid = areaID, lamx = areaData[1, "DAlamx"], lamy = areaData[1, 
        "DAlamy"], lat = areaData[1, "DAlat"], long = areaData[1, "DAlong"], 
        pop = sum(areaData[, "DBpop2011"]), dwell = sum(areaData[, "DBtdwell2011"]), 
        prov = areaData[1, "PRename"])
    return(outData)
}
areaData <- adply(uniqAreas, 1, summarizeArea)
.sessionInfo <- sessionInfo()
.timedate <- Sys.time()
write.table(areaData, file = "../timmysData/areaData.txt", sep = "\t", row.names = F, 
    col.names = T)
save(areaData, .sessionInfo, .timedate, file = "../timmysData/areaDataFile.RData", 
    compress = "xz")

Run queries on Foursquare


Load up the data and verify what we have.


load("../timmysData/areaDataFile.RData")
head(areaData)

Generate queries and run


For each dissemination area (DA), we are going to use as the location for the query the latitude and longitude of each DA, as well as the search string "tim horton".


Because Foursquare limits the number of userless requests to 5000 / hr. To make sure we stay under this limit, the runQueries function will only 5000 queries an hour.


runQueries(areaData, idFile = "../timmysData/clientid.txt", secretFile = "../timmysData/clientsecret.txt", 
    outFile = "../timmysData/timmysLocs2.txt")

Clean up the results


Due to the small size of the DAs, we have a lot of duplicate entries. Now lets remove all the duplicate entries.


cleanUpResults("../timmysData/timmysLocs2.txt")

Visualize Locations


First lets read in the data and make sure that we have Tims locations.


# read in and clean up the data
timsLocs <- scan(file = "../timmysData/timmysLocs2.txt", what = character(), 
    sep = "\n")
timsLocs <- strsplit(timsLocs, ":")

timsName <- sapply(timsLocs, function(x) {
    x[1]
})
timsLat <- sapply(timsLocs, function(x) {
    x[2]
})
timsLong <- sapply(timsLocs, function(x) {
    x[3]
})

locData <- data.frame(description = timsName, lat = as.numeric(timsLat), long = as.numeric(timsLong))
hasNA <- is.na(locData[, "lat"]) | is.na(locData[, "long"])
locData <- locData[!(hasNA), ]

timsStr <- c("tim hortons", "tim horton's")

hasTims <- (grepl(timsStr[1], locData$description, ignore.case = T)) | (grepl(timsStr[2], 
    locData$description, ignore.case = T))

locData <- locData[hasTims, ]
timsLocs <- locData
rm(timsName, timsLat, timsLong, hasNA, locData, hasTims, timsStr)
.timedate <- Sys.time()
.sessionInfo <- sessionInfo()
save(timsLocs, .timedate, .sessionInfo, file = "../timmysData/timsLocs.RData", 
    compress = "xz")

Put them on a map


data(timsLocs)
data(areaDataFile)
canada <- map_data("world", "canada")

p <- ggplot(legend = FALSE) + geom_polygon(data = canada, aes(x = long, y = lat, 
    group = group)) + theme(panel.background = element_blank()) + theme(panel.grid.major = element_blank()) + 
    theme(panel.grid.minor = element_blank()) + theme(axis.text.x = element_blank(), 
    axis.text.y = element_blank()) + theme(axis.ticks = element_blank()) + xlab("") + 
    ylab("")

sp <- timsLocs[1, c("lat", "long")]

p2 <- p + geom_point(data = timsLocs[, c("lat", "long")], aes(x = long, y = lat), 
    colour = "green", size = 1, alpha = 0.5)

print(p2)

plot of chunk mapIt


How far??


And now lets also calculate the minimum distance of a given DA from Timmys locations.


queryLocs <- matrix(c(timsLocs$long, timsLocs$lat), nrow = nrow(timsLocs), ncol = 2, 
    byrow = F)  # these are the tims locations
distLocs <- matrix(c(areaData$long, areaData$lat), nrow = nrow(areaData), ncol = 2, 
    byrow = F)  # the census centers
allDists <- apply(queryLocs, 1, function(x) {
    min(distHaversine(x, distLocs))  # only need the minimum value to determine
})

From the allDists variable above, we can determine that the maximum distance any census dissemination area (DA) is from a Tim Hortons is 51.5 km (31.9815 miles). This is based on distances calculated "as the crow flies", but still, that is pretty close. Assuming roads, the furthest a Canadian should have to travel is less than an hour to get their Timmys fix.


totPopulation <- sum(areaData$pop, na.rm = T)
lessDist <- seq(50, 51.6 * 1000, 50)  # distances are in meters, so multiply by 1000 to get reasonable km

percPop <- sapply(lessDist, function(inDist) {
    isLess <- allDists < inDist
    sum(areaData$pop[isLess], na.rm = T)/totPopulation * 100
})

plotDistPerc <- data.frame(distance = lessDist, population = percPop, logDist = log10(lessDist))
ggplot(plotDistPerc, aes(x = logDist, y = population)) + geom_point() + xlab("Log10 Distance") + 
    ylab("% Population")

plot of chunk percPopulation


What gets really interesting, is how much of the population lives within a given distance of a Timmys. By summing up the percentage of the population within given distances. The plot above shows that 50% of the population is within 316.2278 meters of a Tim Hortons location.


I guess Canadians really do like their Tim Hortons Coffee (and donuts!).


Replication


All of the necessary processed data and code is available in the R package timmysDensity. You can install it using devtools. The original data files are linked in the relevant sections above.


library(devtools)
install_github("timmysDensity", "rmflight")

Caveats


I originally did this work based on a different set of data, that I have not been able to locate the original source for. I have not compared these results to that data to verify their accuracy. When I do so, I will update the package, vignette and blog post.


Posted


This work exists as the vignette of timmysDensity, on my web-blog, and independently as the front page for the GitHub repo.


Disclaimer


Tim Hortons was not involved in the creation or preparation of this work. I am not regularly updating the location information obtained from Foursquare, it is only valid for May 31, 2013. All code used in preparing these results was written by me, except in the case where code from other R packages was used. All opinions and conclusions are my own, and do not reflect the views of anyone else or any institution I may be associated with.


Other information when this vignette was generated


Session Info


sessionInfo()

## R version 3.0.0 (2013-04-03)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=C                 LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.2           markdown_0.5.4      geosphere_1.2-28   
## [4] sp_1.0-9            ggplot2_0.9.3.1     maps_2.3-2         
## [7] plyr_1.8            timmysDensity_0.0.3 devtools_1.2       
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-2   dichromat_2.0-0    digest_0.6.3      
##  [4] evaluate_0.4.3     formatR_0.7        grid_3.0.0        
##  [7] gtable_0.1.2       httr_0.2           labeling_0.1      
## [10] lattice_0.20-15    lubridate_1.3.0    MASS_7.3-26       
## [13] memoise_0.1        munsell_0.4        parallel_3.0.0    
## [16] proto_0.3-10       RColorBrewer_1.0-5 RCurl_1.95-4.1    
## [19] reshape2_1.2.2     RJSONIO_1.0-3      scales_0.2.3      
## [22] stringr_0.6.2      tools_3.0.0        whisker_0.3-2

Time and Date


Sys.time()

## [1] "2013-06-05 14:04:49 EDT"

30.5.13

Storing package data in custom environments

If you do R package development, sometimes you want to be able to store variables specific to your package, without cluttering up the users workspace. One way to do this is by modifying the global options. This is done by packages grDevices and parallel. Sometimes this doesn't seem to work quite right (see this issue for example.

Another way to do this is to create an environment within your package, that only package functions will be able to see, and therefore read from and modify. You get a space to put package specific stuff, the user can't see it or modify it directly, and you just need to write functions that do the appropriate things to that environment (adding variables, reading them, etc). This sounds great in practice, but I wasn't clear on how to do this, even after reading the help page on environments, the R documentation, or even Hadley's excellent writeup. From all these sources, I could glean that one can create environments, name them, modify them, etc, but wasn't sure how to work with this within a package.

I checked out the knitcitations package to see how it was done. When I looked, I realized that it was pretty obvious in retrospect. In zzz.R, initialize the environment, assigning it to a variable. When you need to work with the variables inside, this variable will be accessible to your package, and you simply use the get and assign functions like you would if you were doing anything on the command line.

To make sure I had it figured out, I created a very tiny package to create a custom environment and functions for modifying it. Please feel free to examine, download, install (using devtools) and see for yourself.

I have at least two projects where I know I will use this, and I'm sure others might find it useful as well.

10.5.13

Writing up scientific results and literate programming

As an academic researcher, my primary purpose is to find some new insight, and subsequently communicate this insight to the general public. The process of doing this is traditionally thought to be:


  1. from observations of the world, generate a hypothesis
  2. design experiments to test hypothesis
  3. analyse results of the experiments to determine if hypothesis correct
  4. write report to communicate results to others (academics and / or general public)

And then repeat.


This is the way people envision it happening. And I would imagine that in some rare cases, this is what actually happens. However, I think many researchers would agree that this is not what normally happens. In the process of doing steps 3 and 4, your hypothesis in 1 will be modified, which modifies the experiments in 2, and so on and so forth. This makes the process of scientific discovery a very iterative process, often times right up to the report writing.


For some of this, it takes a long time to figure this out. I'll never forget a professor during my PhD who suggested that you write the paper, and then figure out what experiments you should do to generate the results that would support or disprove the hypothesis you made in the paper. At the time I thought he was nuts, but when you start writing stuff, and looking at how all the steps of experiment and reporting can become intertwined, it doesn't seem like a bad idea. note1


Literate programming


What does this have to do with literate programming? For those who don't know, literate programming is a way to mix code and prose together in one document (in R we use knitr & sweave, python now has the iPython notebook as an option). This literate programming paradigm (combined with markdown as the markup language instead of latex thanks to knitr) is changing how I actually write my papers and do research in general.


How that changes writing


As I've previously described 12, RStudio makes the use of knitr and generation of literate documents using computations in R incredibly easy. Because my writing environment and programming environment are tightly coupled together, I can easily start writing what looks like a shareable, readable publication as soon as I start writing code. Couple this together with a CVS like git, and you have a way to follow the complete providence of a publication from start to finish.


Instead of commenting my code to explain why I am doing something, I explain what I am doing in the prose, and then write the code to carry out that analysis. This changes my writing and coding style, and makes the interplay among the steps of writing scientific reports above much more explicit. I think it is a good thing, and will hopefully make my writing and research more productive.


Notes


1. I am not suggesting that one only do experiments that will support the experiment, but writing out the paper at least gives you a framework for doing the experiments, and doing initial analysis. One should always be willing to modify the publication / hypothesis based on what the experiments tell you.


Sources


Published 10.05.13 here


The source markdown for this document is available here

9.4.13

Why I wear barefoot shoes


Two weeks ago I ordered SoftStar Shoes Runamoc Dash, and this evening, I laced up a new pair of Xero Shoes huarches. Why would an overweight, non-runner, have ordered two pairs of "barefoot" shoes?? In the spring of 2010, a lot of publicity was given to the barefoot running phenomenon. I was interested, because I had a lot of knee problems growing up, and of course one of the claims of barefoot running was to help with various foot, leg, and knee problems. I was doing a lot of reading about it, and found "Invisible Shoes", a newly created enterprise out of Denver, Colorado. For $20, they would send a sheet of Vibram sole that you then cut out to make your shoes, keeping them on with the included nylon string.

That first summer, I didn't do a lot of walking, but mostly drove to work, and any time I had to walk any distance in my new shoes ended up with a lot of stretched skin on the balls of my feet. But I still really liked wearing them. It required me to walk in a completely different way, and seemed to help with some foot issues.

By the second summer I had figured out the bus system in the city, and was walking 0.7 miles each way to the bus stop (1.4 miles each day, sometimes more). Very quickly I got used to the shoes, and I started to notice that I was able to walk further, without pain, and stand for much longer periods. Seriously, prior to this I could maybe be on my feet for an hour before developing serious pain in my knees, but by mid summer I could easily be on my feet for 3 hours without experiencing major pain.

I put so many miles on those shoes that I started to develop a much thinner spot on one where the ball of my foot struck the ground first, and I broke the laces on each shoe twice where they are held on by a knot on the bottom. I wore them everywhere, and really enjoyed them. It was amazing to actually feel the ground you were walking on. The one time I went hiking on nature trails with them was awesome!

Unfortunately, winter came, and my feet do not like the cold (even the little bit of cold we get here in Louisville, KY). I reluctantly put up my Invisible Shoes, and went back to my sneakers. The following summer, I remained in my sneakers, dealing with sweaty feet, and sore feet, legs and knees. Last fall, I resolved to find a solution that would enable me to wear barefoot type shoes most of the year.

I went to my local outdoor gear stores and looked at options, finding nothing I really liked, or considered truly barefoot (no more than 6mm of contact, and no drop from heel to toe). I did however find SoftStar Shoes, and their various options for barefoot style shoes. This past Christmas season, I had some extra money, and four weeks ago I finally ordered the Runamoc Dash. It took about two weeks to have it made and ship, but it was worth the wait. I have an extremely comfortable, good looking barefoot shoe, that I can wear pretty much anywhere. This includes the gym, as my gym has a no open toe shoe policy.

I had also kept abreast of the changes at Invisible Shoes, with their switch to Xero Shoes, and a more curved sole that kept one from catching the toe on the ground (a problem I had, especially when I got tired). Last week I ordered a pair, and I have just finished punching the toe hole and lacing them up. This is good, because the temperature just started climbing into the 80s today, and it will be nice to have a very cool pair of barefoot shoes.

If you want to try barefoot shoes, why not give the Xero option a try. It is relatively inexpensive, and you might like them so much you want to get a pair of SoftStar's. Oh, and they don't tend to develop a funky smell like some other brands.

Some Caveats

As with most things, there are always some warnings, things to take note of that I forgot to mention. If you look around at other barefoot sites, you will probably see some of these, but these are 
  • If you don't modify how you walk in these, your feet and legs will probably take a beating from heel striking on pavement. Without any cushion, all the force gets transmitted up your leg, and it will hurt.
  • If you do change how you walk to a mid- or fore-foot landing, it will take time for your legs to adjust. Man did my feet and my upper calves (just behind my kneecap) ache the first few days in my Dash's, and my calves would start to cramp after just 1/2 mile of walking. But after a week or so, I don't have any problems. Don't expect to walk 2 miles or run a marathon immediately after switching.
  • Some people may develop more problems. As usual, as more people are trying barefoot running / walking, some people are finding it is probably not for them. Whether this is really true, or they just never figure out how to change their stride, is open. But that's why I would try the inexpensive option first.
  • You will probably walk slower, and / or take more steps to move the same distance. If you are coming down on your mid- to fore-foot, you can't take as big a step. So it may take effort to keep up with others who are walking in regular shoes. On the other hand, more steps means more energy, which probably amounts to a higher rate of caloric consumption (or those of use with a spare tire to lose can hope anyway).

5.4.13

Reproducible publications as R packages... and markdown vignettes!

I hadn't really considered it before, but R packages are an interesting way to distribute reproducible papers. You have functions in the R directory, any required data files in data, a Description file that describes the package and all its dependencies, and finally you have the vignette, generally in inst/doc. Vignettes in R traditionally had to be written in Latex and used Sweave to mix the written and code parts.

It is very easy to imagine that the vignette produced could be a journal article, and therefore to get the reproducible article and access to underlying data and functions, you simply need to install the package. I don't know why I hadn't realized this before. It is actually a really neat, and potentially powerful idea. I would not be surprised that this was actually part of the logic of incorporating the Sweave vignettes in R packages.

One wonders why this isn't more common, to distribute papers as packages? I am guessing that it is likely from the requirement to use Latex for writing the document. I've written a vignette, and I can't say I really cared for the experience, basically because getting the Latex part to work was a painful process.

 However, with the release of R 3.0.0, it is now possible to define alternative vignette engines. Currently the only one I know of is knitr, which currently supports generating PDF from Rnw (latex with an alternative syntax to Sweave) and HTML from Rmd, or R markdown files. Markdown is so much easier to write, and in my mind the HTML generated is much easier to read. In addition to that, customization of the look using CSS is probably much more familiar to many people who are doing programming nowadays as well, another big plus.

In addition to having to write using Latex, the process of changing, building, loading, and documenting packages was pretty cumbersome. However, Hadley Wickham has been doing a lot to change that with his devtools package, that makes it quite easy to re-build a package and load it back up. This has now been integrated into the latest versions of RStudio, making it rather easy to work on a package and immediately work with any changes. In addition, the test_that package makes it easier to run automated tests, and ROxygen makes it easy to also document your custom functions used by your vignette.

 So, I know I will be using Yihui's guide to switch my own package to use a markdown vignette, and will probably try to do my next paper as a self contained R package as well. How about you??

Edit: As Carl pointed out below, pandoc is very useful for converting markdown to other formats that may be required for formal submission to a journal.

10.1.13

Jan 10, 2013 Roundup

Here are some things I noticed today that others might benefit from:

Probe to Probe Mapping

Neil Saunders highlighted a problem in software that is supposed to work with different array types, CNAmet:
"the problem of mapping measurements between different array types is not dealt with by CNAmet": or anything else in my experience (link)
There is a software package that does these type of interconversions, Absolute ID Convert, which uses genomic coordinate mapping to interconvert between array IDs. The paper was published in BMC Bioinformatics last year (link) by a PhD student from my group, who has gone on to PostDoc at Harvard.

Sequencing Errors due to Sample Processing

Nick Loman (twitter, blog) has an interesting post highlighting recent developments in errors in Next-Generation Sequencing arising from sample preparation protocols. Casey Bergman is maintaining a collection of papers related to sequencing errors.

Protein Annotation Biases

Iddo Friedberg has a new arXiv preprint examining the effect of biases in the experimental annotations of protein function, and their effect on our understanding of protein function. I haven't read it yet, but from the abstract, it looks like it is definitely something to check out.

Free Statistical Methods Ebook

If you work with any amount of data (likely if you read this blog regularly), then you might be interested in the free ebook of "The Elements of Statistical Learning" (Hastie, Tibshirani, and Friedman), which is a book on a wide variety of machine learning methods, written by experts in the field. I downloaded my copy, you should too.