Diary of R: February 2015

Thursday, February 12, 2015

Heatmaps in R for protein expression data

To use the heatmaps functions, I need to have a numeric matrix of the data. So I will have to have a separate file for each ratio. I will try to do it using reshape.

For the log2 of normalized Tumor/Distant ratios:

df13<- df11b[,c(1,2,11)]
df13b<-reshape(df13,idvar="GeneID",timevar="pt.num",direction='wide')
rownames(df13b.m) <- df13b.m[ ,1]
df13b.m <- as.matrix(df13b.m[c(2:9)])

Ok. So now I am in business for the heat maps. There are several packages to use. heatmaps is built-in. gplot has heatmap.2 and there are some made for expression data. But I have tried to use these packages and my input data may look too different to use these packages.

heatmap(df13b.m)
heatmap.2(df13b.m)

We can scale the data to mean=0 and sd=1 with the following:

df13.scale<-as.matrix(scale(df13b.m))
heatmap(df13.scale)

The clustering looks different using the scaled data. It is difficult to compare the cluseting in detail due to the large matrix. We can look at smaller clueters and specify the number of cluseter using hclust as done here http://www.r-bloggers.com/drawing-heatmaps-in-r/

Separate the clusters. First, cluster by row:

df13.hc<-hclust(dist(df13.scale))
plot(df13.hc)

But I could not get the clustering selection to work with my set in a meaningful way. Need to play with this option. At least we have a dataset to work with.

Friday, February 6, 2015

Normalize the data- coding

Now I am back to normalizing the data. I have a datafile I think I can work with. I need to figure out how to normalize all data to a single gene.

http://stackoverflow.com/questions/28334066/normalize-all-data-to-single-gene-observation-in-r

Great! another solution to my problem. I just love the stack overflow community. Great advice and a fun model to encourage participation.

So I normalized my data as suggested:

df11<- group_by(df10g,type,pt.num)
df11<- mutate(df11, normIGF2R= value/value[GeneID=="IGF2R"])
df11<-mutate(df11, log2normIGF2R = log2(normIGF2R))

But then I realized that I was going to try the T/D ratios of the normalized data and converted back to the format with T and D in separate columns sorting by patient to allow for the ratio columns

df11b<- mutate(df10b, DnormIGF2R= D/D[GeneID=="IGF2R"], TnormIGF2R=T/T[GeneID=="IGF2R"])
df11b<-mutate(df11b, NormT.D = TnormIGF2R/DnormIGF2R, log2NormT.D=log2(NormT.D))

So my data is normalized, log2, and T/D ratios. I can compare all the data by heatmaps.

Wednesday, February 4, 2015

Tidy up the data

Ok. So i found some interesting ways to do the Normalization, ratios, and log of the data, but I think i need to tidy my data first. I loaded the tidyr package. And want to use gather and separate to make a tall skinny list rather than a fat wide list.

I posted a question about it here http://stackoverflow.com/questions/28310730/using-gather-to-tidy-dataset-in-r-attributes-are-not-identical

Some problem with the data is preventing me from using the gather function. When looking at the detail sof the data it says there are Factors w/616, levels for one sample and Factor with 612 levels for another. Not sure why there is so much variability. There are no nulls in my dataset.

I tried another tactic. I used the reshape function like this:

sample1 <- reshape(sample, varying= 2:17, direction= 'long', sep= "", timevar="pt.num")

head(sample1)
GeneID pt.num D T id
1.1 A2M 1 8876.5 8857.9 1
2.1 ABL1 1 2120.8 1664.9 2
3.1 ACP1 1 1266.6 1347.1 3

Now that I have a decent file. I tried to do the ratio of T/D by using the mutate function

df9b<-mutate(df9a, Ratio = T/D)

Warning message:

In Ops.factor(c(572L, 139L, 82L, 220L, 553L, 508L, 281L, 306L, 387L, :
/ not meaningful for factors

GeneID pt.num D T id Ratio
1 A2M 1 8876.5 8857.9 1 NA
2 ABL1 1 2120.8 1664.9 2 NA
3 ACP1 1 1266.6 1347.1 3 NA
4 ACP5 1 67797.6 24218.2 4 NA
5 ACVRL1 1 650.1 822.8 5 NA
6 ACY1 1 6264.8 7112.9 6 NA

139 CHEK2 1 571.9 524.9 139 NA

Another warning message! No particular reason that L139, L82, etc are giving errors. Or why the ratio is NA in all cases! Ok. looks like the columns are factors and not numeric. df8 had numeric values. So I will need to go back to that file and use it! df8 is numeric.

df10<-reshape(df8, varying= 1:16, direction= 'long', sep= "", timevar="pt.num")
Error in split.default(nms, factor(nn[, 1L], levels = vn)) :
first argument must be a vector

> df10<-df10[c(1:16)]
> df10<- cbind(GeneID= rownames(df10), df10)
> rownames(df10) <- NULL
>df10a<- reshape(df10, varying= 2:17, direction= 'long', sep= "", timevar="pt.num")
>df10b<-mutate(df10a, Ratio = T/D)

OK. It worked. It was just a factor problem that got introduced!

Now the original approach works.

df10g<-gather(df10, pt.num.type, value, 2:17, -GeneID)
df10g<-separate(df10g,pt.num.type, c("type","pt.num"), sep=1)