I just came back from our Bioinformatic group (a rather loose association of various
researchers at UofL interested in and doing bioinformatics) journal club, where we
discussed this recent paper:
Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes
Besides the catchy title that makes one believe that perhaps Google is getting into
cancer research (maybe they are and we don't know it yet), there were some interesting
aspects to this paper.
Premise
The premise is that they can combine gene expression data and network data to find
better associations between gene expression data and a particular disease endpoint.
The way this is carried out is through the use of the TRANSFAC transcription factor -
gene target database for the network, the correlation of the gene expression with
the disease status as the importance of a gene with the disease, and the Google
PageRank as the means to transfer the network knowledge to the gene expression
data. They call their method NetRank.
Note that the general idea had already been tried in this paper on GeneRank.
Implementation
Rank the genes with disease status (poor or good prognosis) using a method (SAM,
t-test, fold-change, correlation, NetRank). Pick n top genes, and develop a
predictive model using a support vector machine. Wash, rinse, repeat several times
to find the best set, varying the number of top genes, and the number of samples used
in the training set.
For NetRank, the top genes were decided by using a sub-optimization based on
varying d, the dampening factor in the PageRank algorithm that determines how
much information can be transferred to other genes. The best value of d determined
in this study was 0.3.
All other methods used just the 8000 genes that passed filtering, but NetRank used
all the genes on the array, with those that were filtered out had their initial
correlations set to 0, so that they were still in the network representation.
Did it work?
From the paper, it appears to have worked. Using a monte-carlo cross-validation,
they were able to achieve over 70% prediction rates. And this was better than any
of the other methods they used to associate genes with the disease, including SAM,
t-test, fold-change, and raw correlations.
Issues
As we discussed the article, some questions did come up.
- What was the variation in d depending on the size of the training set?
- How consistent were the genes that came out as biomarkers?
- It would be nice to try this methodology on a series of independent, but
related cancer datasets (ie breast or lung cancer) and see how consistent the lists
are. This was done here.
- What happens if the genes that don't pass filtering are removed from the network
entirely?
- Were the problems reported with not-filtering genes due to having only two
disease points (poor and good prognosis) to calculate a correlation of expression
with?
- How many iterations does it take to achieve convergence?
- The list of genes they come up with are fairly well known cancer genes. We
were kindof surprised that they didn't seem to come up novel genes associated
directly with pancreatic cancer.
- Why is d so variable depending on the cancer examined?
Things to try
- Could we improve on this by instead of taking just the top-ranked genes, look for
the top ranked cliques, i.e. take the top gene, remove anything in its immediate
neighborhood, and then go to the next one?
- What would happen if we used a directed network based on connected Reactome
or KEGG pathways?
The markdown source of this post is here.