Diary of R

Thursday, February 12, 2015

Heatmaps in R for protein expression data

To use the heatmaps functions, I need to have a numeric matrix of the data. So I will have to have a separate file for each ratio. I will try to do it using reshape.

For the log2 of normalized Tumor/Distant ratios:

df13<- df11b[,c(1,2,11)]
df13b<-reshape(df13,idvar="GeneID",timevar="pt.num",direction='wide')
rownames(df13b.m) <- df13b.m[ ,1]
df13b.m <- as.matrix(df13b.m[c(2:9)])

Ok. So now I am in business for the heat maps. There are several packages to use. heatmaps is built-in. gplot has heatmap.2 and there are some made for expression data. But I have tried to use these packages and my input data may look too different to use these packages.

heatmap(df13b.m)
heatmap.2(df13b.m)

We can scale the data to mean=0 and sd=1 with the following:

df13.scale<-as.matrix(scale(df13b.m))
heatmap(df13.scale)

The clustering looks different using the scaled data. It is difficult to compare the cluseting in detail due to the large matrix. We can look at smaller clueters and specify the number of cluseter using hclust as done here http://www.r-bloggers.com/drawing-heatmaps-in-r/

Separate the clusters. First, cluster by row:

df13.hc<-hclust(dist(df13.scale))
plot(df13.hc)

But I could not get the clustering selection to work with my set in a meaningful way. Need to play with this option. At least we have a dataset to work with.

Friday, February 6, 2015

Normalize the data- coding

Now I am back to normalizing the data. I have a datafile I think I can work with. I need to figure out how to normalize all data to a single gene.

http://stackoverflow.com/questions/28334066/normalize-all-data-to-single-gene-observation-in-r

Great! another solution to my problem. I just love the stack overflow community. Great advice and a fun model to encourage participation.

So I normalized my data as suggested:

df11<- group_by(df10g,type,pt.num)
df11<- mutate(df11, normIGF2R= value/value[GeneID=="IGF2R"])
df11<-mutate(df11, log2normIGF2R = log2(normIGF2R))

But then I realized that I was going to try the T/D ratios of the normalized data and converted back to the format with T and D in separate columns sorting by patient to allow for the ratio columns

df11b<- mutate(df10b, DnormIGF2R= D/D[GeneID=="IGF2R"], TnormIGF2R=T/T[GeneID=="IGF2R"])
df11b<-mutate(df11b, NormT.D = TnormIGF2R/DnormIGF2R, log2NormT.D=log2(NormT.D))

So my data is normalized, log2, and T/D ratios. I can compare all the data by heatmaps.

Wednesday, February 4, 2015

Tidy up the data

Ok. So i found some interesting ways to do the Normalization, ratios, and log of the data, but I think i need to tidy my data first. I loaded the tidyr package. And want to use gather and separate to make a tall skinny list rather than a fat wide list.

I posted a question about it here http://stackoverflow.com/questions/28310730/using-gather-to-tidy-dataset-in-r-attributes-are-not-identical

Some problem with the data is preventing me from using the gather function. When looking at the detail sof the data it says there are Factors w/616, levels for one sample and Factor with 612 levels for another. Not sure why there is so much variability. There are no nulls in my dataset.

I tried another tactic. I used the reshape function like this:

sample1 <- reshape(sample, varying= 2:17, direction= 'long', sep= "", timevar="pt.num")

head(sample1)
GeneID pt.num D T id
1.1 A2M 1 8876.5 8857.9 1
2.1 ABL1 1 2120.8 1664.9 2
3.1 ACP1 1 1266.6 1347.1 3

Now that I have a decent file. I tried to do the ratio of T/D by using the mutate function

df9b<-mutate(df9a, Ratio = T/D)

Warning message:

In Ops.factor(c(572L, 139L, 82L, 220L, 553L, 508L, 281L, 306L, 387L, :
/ not meaningful for factors

GeneID pt.num D T id Ratio
1 A2M 1 8876.5 8857.9 1 NA
2 ABL1 1 2120.8 1664.9 2 NA
3 ACP1 1 1266.6 1347.1 3 NA
4 ACP5 1 67797.6 24218.2 4 NA
5 ACVRL1 1 650.1 822.8 5 NA
6 ACY1 1 6264.8 7112.9 6 NA

139 CHEK2 1 571.9 524.9 139 NA

Another warning message! No particular reason that L139, L82, etc are giving errors. Or why the ratio is NA in all cases! Ok. looks like the columns are factors and not numeric. df8 had numeric values. So I will need to go back to that file and use it! df8 is numeric.

df10<-reshape(df8, varying= 1:16, direction= 'long', sep= "", timevar="pt.num")
Error in split.default(nms, factor(nn[, 1L], levels = vn)) :
first argument must be a vector

> df10<-df10[c(1:16)]
> df10<- cbind(GeneID= rownames(df10), df10)
> rownames(df10) <- NULL
>df10a<- reshape(df10, varying= 2:17, direction= 'long', sep= "", timevar="pt.num")
>df10b<-mutate(df10a, Ratio = T/D)

OK. It worked. It was just a factor problem that got introduced!

Now the original approach works.

df10g<-gather(df10, pt.num.type, value, 2:17, -GeneID)
df10g<-separate(df10g,pt.num.type, c("type","pt.num"), sep=1)

Friday, January 30, 2015

Import complete dataset and clean it

Ok. So I was able to get the data into the right format. But then I realized it was a handselected subset. I think I would like to look at a heatmap of all the data normalized. First I will set the colnames and remove the unused rows at top.

colnames(somaTD)<-c("ID","1D","1T","2D","2T","3D","3T","4D","4T","5D","5T","6D","6T","7D","7T","8D","8T")
df2<- df1[c(3:859), ]

So I will clean up the entire dataset. First I will need to eliminate teh duplicate values. Which is easier said than done. I can find how many duplicates there are based on the gene Id using this:

duplicated(somaTD[,1])

and make a list of them using this

somaTD[duplicated(somaTD[,1]),1]

But in reality I have to decide which duplicates to keep in the set or rename when possible. So I did this manually in excel. I reimported the data and want to select only T and D like this:

somaTD <- subset(somdataT, select= c(2,6,7,9,10,12,13,15,16,18,19,21,22,24,25,27,28))
somaTD <- somaTD[2:859]

Now i have a complete, clean and non-duplicated set. I can make a set with background values eliminated by selecting only those with mean above 1 std dev above background. Make matrix with row.means added.

df7.means<-rowMeans(df7)
df7.m<-cbind(df7,df7.means)
df8 <- subset(df7.m, df7.means>450)

#eliminated the following genes-
df8.bkgd <- subset(df7.m, df7.means<450)

Next up is learning the mathmatical functions to create ratios and do normalization to a single gene.

Thursday, January 29, 2015

Importing normalized data

If I were better at R, I could just do all the math in R. But I am still better at identifying patterns using basic excel functions and color coding. So now, I have a dataset that I would like to import into R Studio.

Norm.data <- read.csv("Z:/Data/Lung data/somdataT-norm-ig2b-TD.csv")
Norm.data <- na.omit(Norm.data)

Turn it into a data matrix using

Norm.data.m<- (data.matrix(Norm.data))

But there are not character row names?! I found two duplicate entries that may have caused errors in doing this and assigned them unique names. But I still got the error. Try setting names using this

rownames(Norm.data) <- Norm.data[ ,1]

Simpler to use this function in one line (and works)

Norm.data.rn.m <- data.frame(Norm.data[,-1], row.names=Norm.data[,1])
Norm.data.m<- (data.matrix(Norm.data.rn.m))

Now I can finally apply the simple version of heatmap to view the data:

heatmap(Norm.data.m)

To look at only a subset of the data for the patients with overlapping data:

Norm.data.OL <- subset(Norm.data.m, select=c(2,4:8))
heatmap(Norm.data.OL)

Other notes:

I am using the subset function to select parts of the data.

x <- subset(somdataT, select = c(4:28))
gene <- subset(somdataT, select =c(2))

This can also be used conditionally to select above background

somdat.bkgrd <- subset(somdataT, pt5>350, select = c(pt5,pt8))

Wednesday, January 21, 2015

Normalization of Soma data

Verified my choice for normaliation gene using the Normfinder from MOMA. It compares variability among all candidate genes to determine the best candidate to use for producing the least systematic error.

The IGF2R has a stability value of 0.09 that ranks #52 but the other lower candidates are mostly background (low values). Also looked at the values for the common normalization factors.

TCTP	0.145
GAPDH, liver	0.173
UBC9	0.317
PSA-ACT	0.412

TPT1 has a mean of 34915 and SD 4437 and will be quite a bit higher than most proteins in the dataset.

http://cancerres.aacrjournals.org/content/64/15/5245.full

From the paper:

Stability Value.

Having estimated both the intra- and intergroup variation, we combine the two into a stability value, which intuitively adds the two sources of variation and thus represents a practical measure of the systematic error that will be introduced when using the investigated gene.

Let α_i be the mean of α_ig. For a future experiment, it is the distribution of z_ig − θ_g − α_i that defines the stability of gene i. However, we find a distribution to be an impractical stability measure and therefore reduce it to a one-dimensional value by defining the stability value ρ_ig to be the absolute value of the mean + 1 SD. We use a Bayesian argument to find the distribution of z_ig − θ_g − α_i in a future experiment (see supplementary data for details). Finally, to implement our stability measure, we need to estimate z_ig − θ_g − α_i. To this end, we make the assumption that the average of α_ig over the genes i = 1,… k is independent of the group g. Then z_ig − θ_g − α_i is naturally estimated by d_ig = z_ig − z̄_i· − z̄_·g + z̄_··, where a bar indicates an average, and the stability value becomes $\text{[math]}$ where γ² is the variance of α_ig. We estimate γ² by $\text{[math]}$ if this is positive; otherwise, we set it to 0.

To get a single value for gene i, we let ρ_i be the average of ρ_ig, g = 1,…, G. For details on the derivation and for a stability measure based on an average of several control genes, see the supplementary data.

Learning R with lung cancer dataset

In the first paper the protein expression data was used to look for global patterns with tumor expression versus normal tissue. But now we have additional data for genetic sequencing and can look for more specific differences in patterns rather than population patterns or identifying potential biomarkers. In the first paper much of the data was analyzed as a ratio- T/D. But here we can look for a normalization factor closer to a housekeeping gene. IGF2R (X3482) was selected for the expression level was close to the mean of the RFU and had the smallest standard deviation. There is some evidence suggesting IGF1R may be important in cancer signalling networks, but IGF2R binds more specifically to IGF2 and lacks an intracellular kinase binding domain making it less likely to participate in the signalling cascade. But little data is available to suggest it is expressed in relatively similar amounts in tumor and normal tissue. GAPDH is present in our dataset but is at the top of expression range making it a poor choice for a normalization factor.

Then I looked at the log2 of the normalized data. I selected proteins that showed variability between tumor and distant, as well as the varibility between patients. This narrowed the pool from 850 to ~130. Now I want to plot the selected data.

I loaded the gplots package to use both Venn diagrams and heatmaps. There is a vignette for making Venn diagrams. I will have to explore it. And I want to use some hierarchical clustering for the heatmaps. But it does not look like the standard heatmap can do that part.

Here is a cute description of mapping using the packages
http://mannheimiagoesprogramming.blogspot.com/2012/06/drawing-heatmaps-in-r-with-heatmap2.html

But this description is more useful for clustering approaches (plus R code)
https://www.biostars.org/p/14156/