Friday, January 30, 2015

Import complete dataset and clean it

Ok. So I was able to get the data into the right format. But then I realized it was a handselected subset. I think I would like to look at a heatmap of all the data normalized. First I will set the colnames and remove the unused rows at top.

colnames(somaTD)<-c("ID","1D","1T","2D","2T","3D","3T","4D","4T","5D","5T","6D","6T","7D","7T","8D","8T")
df2<- df1[c(3:859), ]

So I will clean up the entire dataset. First I will need to eliminate teh duplicate values. Which is easier said than done. I can find how many duplicates there are based on the gene Id using this:

 duplicated(somaTD[,1])
and make a list of them using this

somaTD[duplicated(somaTD[,1]),1]
But in reality I have to decide which duplicates to keep in the set or rename when possible. So I did this manually in excel. I reimported the data and want to select only T and D like this:

somaTD <- subset(somdataT, select= c(2,6,7,9,10,12,13,15,16,18,19,21,22,24,25,27,28))
somaTD <- somaTD[2:859]

Now i have a complete, clean and non-duplicated set. I can make a set with background values eliminated by selecting only those with mean above 1 std dev above background. Make matrix with row.means added.
df7.means<-rowMeans(df7)
df7.m<-cbind(df7,df7.means)
df8 <- subset(df7.m, df7.means>450)
#eliminated the following genes-
        df8.bkgd <- subset(df7.m, df7.means<450)

Next up is learning the mathmatical functions to create ratios and do normalization to a single gene.

No comments:

Post a Comment