Wednesday, February 4, 2015

Tidy up the data

Ok. So i found some interesting ways to do the Normalization, ratios, and log of the data, but I think i need to tidy my data first. I loaded the tidyr package. And want to use gather and separate to make a tall skinny list rather than a fat wide list.

I posted a question about it here http://stackoverflow.com/questions/28310730/using-gather-to-tidy-dataset-in-r-attributes-are-not-identical

Some problem with the data is preventing me from using the gather function. When looking at the detail sof the data it says there are Factors w/616, levels for one sample and Factor with 612 levels for another. Not sure why there is so much variability. There are no nulls in my dataset.

I tried another tactic. I used the reshape function like this:

sample1 <- reshape(sample, varying= 2:17, direction= 'long', sep= "", timevar="pt.num")
head(sample1)
   GeneID pt.num       D       T id
1.1    A2M           1  8876.5  8857.9  1
2.1   ABL1           1  2120.8  1664.9  2
3.1   ACP1           1  1266.6  1347.1  3 
 
Now that I have a decent file. I tried to do the ratio of T/D by using the mutate function

 df9b<-mutate(df9a, Ratio = T/D)
Warning message:
In Ops.factor(c(572L, 139L, 82L, 220L, 553L, 508L, 281L, 306L, 387L,  :
  / not meaningful for factors
 GeneID pt.num       D       T id Ratio
1    A2M      1  8876.5  8857.9  1    NA
2   ABL1      1  2120.8  1664.9  2    NA
3   ACP1      1  1266.6  1347.1  3    NA
4   ACP5      1 67797.6 24218.2  4    NA
5 ACVRL1      1   650.1   822.8  5    NA
6   ACY1      1  6264.8  7112.9  6    NA
139            CHEK2      1    571.9    524.9 139    NA 

Another warning message!  No particular reason that L139, L82, etc are giving errors. Or why the ratio is NA in all cases! Ok. looks like the columns are factors and not numeric. df8 had numeric values. So I will need to go back to that file and use it! df8 is numeric.
 df10<-reshape(df8, varying= 1:16, direction= 'long', sep= "", timevar="pt.num")
Error in split.default(nms, factor(nn[, 1L], levels = vn)) :
  first argument must be a vector
> df10<-df10[c(1:16)]
> df10<- cbind(GeneID= rownames(df10), df10)
> rownames(df10) <- NULL
>df10a<-  reshape(df10, varying= 2:17, direction= 'long', sep= "", timevar="pt.num")
>df10b<-mutate(df10a, Ratio = T/D)

OK. It worked. It was just a factor problem that got introduced!

Now the original approach works.

df10g<-gather(df10, pt.num.type, value, 2:17, -GeneID)
df10g<-separate(df10g,pt.num.type, c("type","pt.num"), sep=1)

No comments:

Post a Comment