An article in this weeks edition of Nature on providing source code for journal articles that depend on new or original computer programs to analyze data (link) led to the discovery of two new resources:
1 - An article on scientists ability to write code (link)
2 - An actual course focused on teaching scientists how to write computer code, known as "Software Carpentry". The materials for the course are posted on-line, with full lecture content, as well as videos. If all you have is a basic introduction to programming, this might be useful.
My own programming experience, I have one intro to programming course from my Masters, and then did a lot of Matlab programming during my PhD (including GUI development), and then learned R, object-oriented and R packages during my PostDoc, and I think I am going to work through this course. I have finally been implementing unit-tests and using version control, but I am sure there are lessons to be learned from this course.
23.2.12
17.2.12
Debugging R using "recover"
I just discovered today that one way to easily insert oneself into a mis-performing function in R is to use "recover", via options(error=recover).
This allows you to enter any of the functions involved in the error, at the point where the error occurred. Hope someone else finds this useful.
This allows you to enter any of the functions involved in the error, at the point where the error occurred. Hope someone else finds this useful.
13.2.12
Celebration of teaching and learning
The Delphi center at UofL puts on a teaching conference I believe each year, and this year the focus is on the incorporation of digital media into class instruction. Some trepidation regarding some of the language (wide vs deep reading).
Have been excited by a lot of the stuff being talked about, however. Check out @DelphiCelebrate, and #CLbrT2012 for some of the twitter chatter.
Have been excited by a lot of the stuff being talked about, however. Check out @DelphiCelebrate, and #CLbrT2012 for some of the twitter chatter.
9.2.12
Bioconductor Packaging: Lessons Learned
1. Having non-standard library locations is a pain
If you don't know, R functionality is generally enabled by loading packages, self-contained folders of function definitions. Many packages depend on others, so to enable functionality contained in one package, you have to load a bunch of others. This normally isn't a big problem, unless you store your installed packages in a non-default location. For some practical reasons, this was my situation. Normally, it is not a big deal, except that the same files that tell R where to find the packages are not necessarily read when "building" and "checking" the built package. Best way to avoid this problem: either install the packages to the default location, or define the package locations "R_LIBS" in the "Renviron.site" file in R_HOME/etc/.
2. Long examples will really slow things down.
The Bioconductor guidelines suggest that running "R CMD check" shouldn't take longer than 4 or 5 minutes, including running examples and running the code in the package vignette. If you have examples that take any length of time to run, or the code in your vignette takes any length of time, you will quickly run over time. I got around this by pre-generating the stuff that took a long time and saving it in the package data.
Which actually brings up another point. I originally did not do this due to issues that R 2.12 had with remembering the classes of objects when I would reload them from the associated data file. This does not seem to be an issue with R 2.14.1 or 2.15 (current dev version).
And that brings up the issue of size. If I do the basic "build" process, my package size is ~2MB, but there is an option "--resave-data" to compress the associated data files more than the default.
Do you have any thing you've learned the hard way about writing R or Bioconductor packages?
If you don't know, R functionality is generally enabled by loading packages, self-contained folders of function definitions. Many packages depend on others, so to enable functionality contained in one package, you have to load a bunch of others. This normally isn't a big problem, unless you store your installed packages in a non-default location. For some practical reasons, this was my situation. Normally, it is not a big deal, except that the same files that tell R where to find the packages are not necessarily read when "building" and "checking" the built package. Best way to avoid this problem: either install the packages to the default location, or define the package locations "R_LIBS" in the "Renviron.site" file in R_HOME/etc/.
2. Long examples will really slow things down.
The Bioconductor guidelines suggest that running "R CMD check" shouldn't take longer than 4 or 5 minutes, including running examples and running the code in the package vignette. If you have examples that take any length of time to run, or the code in your vignette takes any length of time, you will quickly run over time. I got around this by pre-generating the stuff that took a long time and saving it in the package data.
Which actually brings up another point. I originally did not do this due to issues that R 2.12 had with remembering the classes of objects when I would reload them from the associated data file. This does not seem to be an issue with R 2.14.1 or 2.15 (current dev version).
And that brings up the issue of size. If I do the basic "build" process, my package size is ~2MB, but there is an option "--resave-data" to compress the associated data files more than the default.
Do you have any thing you've learned the hard way about writing R or Bioconductor packages?
Labels:
bioconductor,
packages,
R
26.1.12
Gene Ontology flattened
In our weekly journal club last week, we looked at an interesting method for discovering genes that are related to one another functionally (link). One of the things they did in the paper was to use
This is relatively easy to do using the GO.db database in Bioconductor:
This will generate a list structure, each of which has a logical vector that indicates the presence or absence of each GO ID. Note that this takes advantage of the "GO2ALLEGS" table in the organism database, that has the GO annotations for all the genes based on annotation to any ancestor GO IDs as well as direct annotation. It would be easy to verify that this does match the flattened representation by getting the direct annotation, and then using "GOBPANCESTOR" to generate a full annotation list. Remember, a gene is indirectly annotated to all the ancestor terms in the GO directed acyclic graph. "GO2ALLEGS" is the easiest way to get this information that I know of in Bioconductor.
A flattened representation of the GO hierarchy ... and stores the annotations as Boolean arrays in which the presence and absence of annotations is recorded (Huang et al., 2007). This representation implicitly contains the ontological relations and allows the inclusion of non-ontological annotations as part of the array. This avoids the inference of relationships through the hierarchical structure of GO.
This is relatively easy to do using the GO.db database in Bioconductor:
This will generate a list structure, each of which has a logical vector that indicates the presence or absence of each GO ID. Note that this takes advantage of the "GO2ALLEGS" table in the organism database, that has the GO annotations for all the genes based on annotation to any ancestor GO IDs as well as direct annotation. It would be easy to verify that this does match the flattened representation by getting the direct annotation, and then using "GOBPANCESTOR" to generate a full annotation list. Remember, a gene is indirectly annotated to all the ancestor terms in the GO directed acyclic graph. "GO2ALLEGS" is the easiest way to get this information that I know of in Bioconductor.
Labels:
bioconductor,
R,
rstats
24.1.12
R NameSpaces and Classes
I have been developing a Bioconductor package as part of my research at UofL, and a lot of it depends on other classes from another package. The classes in the other package had a function call as part of their initialization, that tended to break whenever these classes were extended as part of new classes.
So in R 2.13.0, it wasn't too hard to get around it:
This generates the following error:
This is due to how the "HyperGParams" class is defined. This happens if you try and extend any of the Category classes, but there is a workaround: give a valid "annotation" to the prototype:
Now, this will work in R 2.13.0, and it will work in R 2.14.1, but what if you want to combine this class with another in a new class?
Note that "GOHyperGParams" also has the "annotation" slot initialized to something useful to avoid the error above.
Now what happens in R 2.14.1?
We get the same error as before! Why?
In R 2.13.0, we were actually creating a new definition of "GOHyperGParams" in the local workspace. However, in 2.14.1, we are no longer allowed to create a duplicate class with the same name. Therefore, to modify it we need to explicitly modify the copy in the original package "Category", like so:
A final twist for embedding this in our new package, is that we have to explicitly import the classes from the other package using "ImportClassFrom" in the NAMESPACE file:
This was much looser in previous implementations, and I'm guessing this is going to help make coding better for developers in R as they have to be a little more explicit about what is going on. It has taken some time for me to learn, as I have only one programming course way back in C, and no formal OO, everything I have learned has been in writing my own package for Bioconductor.
So in R 2.13.0, it wasn't too hard to get around it:
This generates the following error:
This is due to how the "HyperGParams" class is defined. This happens if you try and extend any of the Category classes, but there is a workaround: give a valid "annotation" to the prototype:
Now, this will work in R 2.13.0, and it will work in R 2.14.1, but what if you want to combine this class with another in a new class?
Note that "GOHyperGParams" also has the "annotation" slot initialized to something useful to avoid the error above.
Now what happens in R 2.14.1?
We get the same error as before! Why?
In R 2.13.0, we were actually creating a new definition of "GOHyperGParams" in the local workspace. However, in 2.14.1, we are no longer allowed to create a duplicate class with the same name. Therefore, to modify it we need to explicitly modify the copy in the original package "Category", like so:
A final twist for embedding this in our new package, is that we have to explicitly import the classes from the other package using "ImportClassFrom" in the NAMESPACE file:
This was much looser in previous implementations, and I'm guessing this is going to help make coding better for developers in R as they have to be a little more explicit about what is going on. It has taken some time for me to learn, as I have only one programming course way back in C, and no formal OO, everything I have learned has been in writing my own package for Bioconductor.
Labels:
classes,
object oriented,
R,
rstats
20.1.12
RStudio New Features!
For anyone who is using RStudio, there are some new features to note in the next release (0.95).
1 - Multiple projects, multiple RStudio instances
If you are like me and often have large things running, and want to work on something else, then it is nice to be able to have multiple instances (copies if you will) of the editor and programming environment running. I often find myself working on multiple projects, and I don't like shutting down and coming back to something when all I really need is 10 minutes to work on the other thing. So I was very happy to see that you can now fire up multiple copies of RStudio. You can even tie particular instances of RStudio to a particular project, with its associated files, and history.
2 - Integrated version control
I admit, I don't use version control nearly as much as I should, I still tend to depend on keeping old bits of code in files and then running the pieces that I need. But version control has gotten a lot easier with integrated versioning in RStudio using Git or SVN. I had been using Mercurial, but due to the built in integration with Git, am switching over (not hard, I only had like one directory that was actually using vc). What is really sweet about it, is that you can just create a project based on a current directory, and say you want to use VC, and it will initialize it and restart RStudio, and you are good to go.
One reason I will be using Git over SVN is the ability to stick with local repositories, whereas SVN would require a whole server set up on my machine or somewhere else. And it seems pretty easy to use.
As of 1500 EST on 20/01/12, the download page still showed the old version, but the documentation for version control is up. If you want to get the version with projects and version control as a preview release here http://www.rstudio.org/download/preview.
Edit: The new version is available on the main download page.
1 - Multiple projects, multiple RStudio instances
If you are like me and often have large things running, and want to work on something else, then it is nice to be able to have multiple instances (copies if you will) of the editor and programming environment running. I often find myself working on multiple projects, and I don't like shutting down and coming back to something when all I really need is 10 minutes to work on the other thing. So I was very happy to see that you can now fire up multiple copies of RStudio. You can even tie particular instances of RStudio to a particular project, with its associated files, and history.
2 - Integrated version control
I admit, I don't use version control nearly as much as I should, I still tend to depend on keeping old bits of code in files and then running the pieces that I need. But version control has gotten a lot easier with integrated versioning in RStudio using Git or SVN. I had been using Mercurial, but due to the built in integration with Git, am switching over (not hard, I only had like one directory that was actually using vc). What is really sweet about it, is that you can just create a project based on a current directory, and say you want to use VC, and it will initialize it and restart RStudio, and you are good to go.
One reason I will be using Git over SVN is the ability to stick with local repositories, whereas SVN would require a whole server set up on my machine or somewhere else. And it seems pretty easy to use.
As of 1500 EST on 20/01/12, the download page still showed the old version, but the documentation for version control is up. If you want to get the version with projects and version control as a preview release here http://www.rstudio.org/download/preview.
Edit: The new version is available on the main download page.
Subscribe to:
Posts (Atom)