Diary of R: January 2015

Friday, January 30, 2015

Import complete dataset and clean it

Ok. So I was able to get the data into the right format. But then I realized it was a handselected subset. I think I would like to look at a heatmap of all the data normalized. First I will set the colnames and remove the unused rows at top.

colnames(somaTD)<-c("ID","1D","1T","2D","2T","3D","3T","4D","4T","5D","5T","6D","6T","7D","7T","8D","8T")
df2<- df1[c(3:859), ]

So I will clean up the entire dataset. First I will need to eliminate teh duplicate values. Which is easier said than done. I can find how many duplicates there are based on the gene Id using this:

duplicated(somaTD[,1])

and make a list of them using this

somaTD[duplicated(somaTD[,1]),1]

But in reality I have to decide which duplicates to keep in the set or rename when possible. So I did this manually in excel. I reimported the data and want to select only T and D like this:

somaTD <- subset(somdataT, select= c(2,6,7,9,10,12,13,15,16,18,19,21,22,24,25,27,28))
somaTD <- somaTD[2:859]

Now i have a complete, clean and non-duplicated set. I can make a set with background values eliminated by selecting only those with mean above 1 std dev above background. Make matrix with row.means added.

df7.means<-rowMeans(df7)
df7.m<-cbind(df7,df7.means)
df8 <- subset(df7.m, df7.means>450)

#eliminated the following genes-
df8.bkgd <- subset(df7.m, df7.means<450)

Next up is learning the mathmatical functions to create ratios and do normalization to a single gene.

Thursday, January 29, 2015

Importing normalized data

If I were better at R, I could just do all the math in R. But I am still better at identifying patterns using basic excel functions and color coding. So now, I have a dataset that I would like to import into R Studio.

Norm.data <- read.csv("Z:/Data/Lung data/somdataT-norm-ig2b-TD.csv")
Norm.data <- na.omit(Norm.data)

Turn it into a data matrix using

Norm.data.m<- (data.matrix(Norm.data))

But there are not character row names?! I found two duplicate entries that may have caused errors in doing this and assigned them unique names. But I still got the error. Try setting names using this

rownames(Norm.data) <- Norm.data[ ,1]

Simpler to use this function in one line (and works)

Norm.data.rn.m <- data.frame(Norm.data[,-1], row.names=Norm.data[,1])
Norm.data.m<- (data.matrix(Norm.data.rn.m))

Now I can finally apply the simple version of heatmap to view the data:

heatmap(Norm.data.m)

To look at only a subset of the data for the patients with overlapping data:

Norm.data.OL <- subset(Norm.data.m, select=c(2,4:8))
heatmap(Norm.data.OL)

Other notes:

I am using the subset function to select parts of the data.

x <- subset(somdataT, select = c(4:28))
gene <- subset(somdataT, select =c(2))

This can also be used conditionally to select above background

somdat.bkgrd <- subset(somdataT, pt5>350, select = c(pt5,pt8))

Wednesday, January 21, 2015

Normalization of Soma data

Verified my choice for normaliation gene using the Normfinder from MOMA. It compares variability among all candidate genes to determine the best candidate to use for producing the least systematic error.

The IGF2R has a stability value of 0.09 that ranks #52 but the other lower candidates are mostly background (low values). Also looked at the values for the common normalization factors.

TCTP	0.145
GAPDH, liver	0.173
UBC9	0.317
PSA-ACT	0.412

TPT1 has a mean of 34915 and SD 4437 and will be quite a bit higher than most proteins in the dataset.

http://cancerres.aacrjournals.org/content/64/15/5245.full

From the paper:

Stability Value.

Having estimated both the intra- and intergroup variation, we combine the two into a stability value, which intuitively adds the two sources of variation and thus represents a practical measure of the systematic error that will be introduced when using the investigated gene.

Let α_i be the mean of α_ig. For a future experiment, it is the distribution of z_ig − θ_g − α_i that defines the stability of gene i. However, we find a distribution to be an impractical stability measure and therefore reduce it to a one-dimensional value by defining the stability value ρ_ig to be the absolute value of the mean + 1 SD. We use a Bayesian argument to find the distribution of z_ig − θ_g − α_i in a future experiment (see supplementary data for details). Finally, to implement our stability measure, we need to estimate z_ig − θ_g − α_i. To this end, we make the assumption that the average of α_ig over the genes i = 1,… k is independent of the group g. Then z_ig − θ_g − α_i is naturally estimated by d_ig = z_ig − z̄_i· − z̄_·g + z̄_··, where a bar indicates an average, and the stability value becomes $\text{[math]}$ where γ² is the variance of α_ig. We estimate γ² by $\text{[math]}$ if this is positive; otherwise, we set it to 0.

To get a single value for gene i, we let ρ_i be the average of ρ_ig, g = 1,…, G. For details on the derivation and for a stability measure based on an average of several control genes, see the supplementary data.

Learning R with lung cancer dataset

In the first paper the protein expression data was used to look for global patterns with tumor expression versus normal tissue. But now we have additional data for genetic sequencing and can look for more specific differences in patterns rather than population patterns or identifying potential biomarkers. In the first paper much of the data was analyzed as a ratio- T/D. But here we can look for a normalization factor closer to a housekeeping gene. IGF2R (X3482) was selected for the expression level was close to the mean of the RFU and had the smallest standard deviation. There is some evidence suggesting IGF1R may be important in cancer signalling networks, but IGF2R binds more specifically to IGF2 and lacks an intracellular kinase binding domain making it less likely to participate in the signalling cascade. But little data is available to suggest it is expressed in relatively similar amounts in tumor and normal tissue. GAPDH is present in our dataset but is at the top of expression range making it a poor choice for a normalization factor.

Then I looked at the log2 of the normalized data. I selected proteins that showed variability between tumor and distant, as well as the varibility between patients. This narrowed the pool from 850 to ~130. Now I want to plot the selected data.

I loaded the gplots package to use both Venn diagrams and heatmaps. There is a vignette for making Venn diagrams. I will have to explore it. And I want to use some hierarchical clustering for the heatmaps. But it does not look like the standard heatmap can do that part.

Here is a cute description of mapping using the packages
http://mannheimiagoesprogramming.blogspot.com/2012/06/drawing-heatmaps-in-r-with-heatmap2.html

But this description is more useful for clustering approaches (plus R code)
https://www.biostars.org/p/14156/