Friday, January 30, 2015

Import complete dataset and clean it

Ok. So I was able to get the data into the right format. But then I realized it was a handselected subset. I think I would like to look at a heatmap of all the data normalized. First I will set the colnames and remove the unused rows at top.

colnames(somaTD)<-c("ID","1D","1T","2D","2T","3D","3T","4D","4T","5D","5T","6D","6T","7D","7T","8D","8T")
df2<- df1[c(3:859), ]

So I will clean up the entire dataset. First I will need to eliminate teh duplicate values. Which is easier said than done. I can find how many duplicates there are based on the gene Id using this:

 duplicated(somaTD[,1])
and make a list of them using this

somaTD[duplicated(somaTD[,1]),1]
But in reality I have to decide which duplicates to keep in the set or rename when possible. So I did this manually in excel. I reimported the data and want to select only T and D like this:

somaTD <- subset(somdataT, select= c(2,6,7,9,10,12,13,15,16,18,19,21,22,24,25,27,28))
somaTD <- somaTD[2:859]

Now i have a complete, clean and non-duplicated set. I can make a set with background values eliminated by selecting only those with mean above 1 std dev above background. Make matrix with row.means added.
df7.means<-rowMeans(df7)
df7.m<-cbind(df7,df7.means)
df8 <- subset(df7.m, df7.means>450)
#eliminated the following genes-
        df8.bkgd <- subset(df7.m, df7.means<450)

Next up is learning the mathmatical functions to create ratios and do normalization to a single gene.

Thursday, January 29, 2015

Importing normalized data

If I were better at R, I could just do all the math in R. But I am still better at identifying patterns using basic excel functions and color coding. So now, I have a dataset that I would like to import into R Studio.
    Norm.data <- read.csv("Z:/Data/Lung data/somdataT-norm-ig2b-TD.csv")
    Norm.data <- na.omit(Norm.data)
Turn it into a data matrix using
    Norm.data.m<- (data.matrix(Norm.data))
But there are not character row names?! I found two duplicate entries that may have caused errors in doing this and assigned them unique names. But I still got the error. Try setting names using this

     rownames(Norm.data) <- Norm.data[ ,1]
Simpler to use this function in one line (and works)

Norm.data.rn.m <- data.frame(Norm.data[,-1], row.names=Norm.data[,1])
Norm.data.m<- (data.matrix(Norm.data.rn.m))
Now I can finally apply the simple version of heatmap to view the data:

heatmap(Norm.data.m)

To look at only a subset of the data for the patients with overlapping data:

Norm.data.OL <- subset(Norm.data.m, select=c(2,4:8))
 heatmap(Norm.data.OL)

Other notes:

I am using the subset function to select parts of the data.

      x <- subset(somdataT, select = c(4:28))
     gene <-  subset(somdataT, select =c(2))

This can also be used conditionally to select above background 

somdat.bkgrd <- subset(somdataT, pt5>350, select = c(pt5,pt8))

Wednesday, January 21, 2015

Normalization of Soma data

Verified my choice for normaliation gene using the Normfinder from MOMA. It compares variability among all candidate genes to determine the best candidate to use for producing the least systematic error.

The IGF2R has a stability value of 0.09 that ranks #52 but the other lower candidates are mostly background (low values). Also looked at the values for the common normalization factors.

TCTP 0.145
GAPDH, liver 0.173
UBC9 0.317
PSA-ACT 0.412

TPT1 has a mean of 34915 and SD 4437 and will be quite a bit higher than most proteins in the dataset.

http://cancerres.aacrjournals.org/content/64/15/5245.full

From the paper:


Stability Value.

Having estimated both the intra- and intergroup variation, we combine the two into a stability value, which intuitively adds the two sources of variation and thus represents a practical measure of the systematic error that will be introduced when using the investigated gene.
Let αi be the mean of αig. For a future experiment, it is the distribution of zig − θg − αi that defines the stability of gene i. However, we find a distribution to be an impractical stability measure and therefore reduce it to a one-dimensional value by defining the stability value ρig to be the absolute value of the mean + 1 SD. We use a Bayesian argument to find the distribution of zig − θg − αi in a future experiment (see supplementary data for details). Finally, to implement our stability measure, we need to estimate zig − θg − αi. To this end, we make the assumption that the average of αig over the genes i = 1,… k is independent of the group g. Then zig − θg − αi is naturally estimated by dig = zig − i· − ·g + ··, where a bar indicates an average, and the stability value becomesFormulawhere γ2 is the variance of αig. We estimate γ2 byFormulaif this is positive; otherwise, we set it to 0.
To get a single value for gene i, we let ρi be the average of ρigg = 1,…, G. For details on the derivation and for a stability measure based on an average of several control genes, see the supplementary data.

Learning R with lung cancer dataset

In the first paper the protein expression data was used to look for global patterns with tumor expression versus normal tissue. But now we have additional data for genetic sequencing and can look for more specific differences in patterns rather than population patterns or identifying potential biomarkers. In the first paper much of the data was analyzed as a ratio- T/D. But here we can look for a normalization factor closer to a housekeeping gene. IGF2R (X3482) was selected for the expression level was close to the mean of the RFU and had the smallest standard deviation. There is some evidence suggesting IGF1R may be important in cancer signalling networks, but IGF2R binds more specifically to IGF2 and lacks an intracellular kinase binding domain making it less likely to participate in the signalling cascade. But little data is available to suggest it is expressed in relatively similar amounts in tumor and normal tissue. GAPDH is present in our dataset but is at the top of expression range making it a poor choice for a normalization factor.

Then I looked at the log2 of the normalized data. I selected proteins that showed variability between tumor and distant, as well as the varibility between patients. This narrowed the pool from 850 to ~130. Now I want to plot the selected data.

I loaded the gplots package to use both Venn diagrams and heatmaps. There is a vignette for making Venn diagrams. I will have to explore it. And I want to use some hierarchical clustering for the heatmaps. But it does not look like the standard heatmap can do that part.

Here is a cute description of mapping using the packages
http://mannheimiagoesprogramming.blogspot.com/2012/06/drawing-heatmaps-in-r-with-heatmap2.html

But this description is more useful for clustering approaches (plus R code)
https://www.biostars.org/p/14156/