average_triplicates_collapse_duplicates.Rmd

---
title: "Average triplicate columns with duplicate rows in R"
output: html_notebook
---

Below data frame contains triplicate columns for each gene and duplicate rows for some of the genes. We need to take average (mean) of the triplicates and at the same time, collapse duplicate rows (again take mean).
#### Create a data frame
```{r}
df2=structure(list(Gene = c("mut", "ACTIN", "ACTIN", "Pq", "UBQ", "UBQ", "Xa"), X_T0_R1 = c(0.344814469, 1.209073623, 1.071457953, 0.362842359, 1.014392244, 1.571055788, 0.570729408), X_T0_R2 = c(0.449930853, 1.031557118, 1.054965621, 0.522831228, 0.83300542, 0.967355216, 0.501057748), X_T0_R3 = c(0.601209073, 1.695796471, 1.052815987, 0.571729222, 1.391288288, 1.773644641, 0.453820027), X_T1_R1 = c(0.427800244, 1.308884798, 0.991302515, 0.329510681, 0.773414746, 1.029619555, 0.362504535), X_T1_R2 = c(0.418589633, 1.811507215, 1.206305091, 0.29886302, 0.895616224, 1.196317937, 0.408657559), X_T1_R3 = c(0.468263467, 1.352236153, 1.444060418, 0.359970383, 0.942421479, 2.388771681, 0.145078696), X_T2_R1 = c(0.300362616, 1.654754505, 1.109259911, 0.306699247, 0.585608303, 1.945573895, 0.270237172), X_T2_R2 = c(0.27920993, 1.573822163, 1.152985196, 0.310218502, 0.493783209, 1.573792123, 0.36659012), X_T2_R3 = c(1.792971556, 0.665809249, 0.778594892, 2.161999623, 1.888984449, 0.456632731, 1.631251843), X_T3_R1 = c(1.118011513, 0.570411874, 1.044634812, 1.213092011, 1.817947271, 0.234950383, 1.384650094), X_T3_R2 = c(1.008515071, 0.916509523, 0.905764637, 1.244132809, 0.752181246, 0.797524026, 1.010615689), X_T3_R3 = c(0.816620011, 0.740345088, 1.106478019, 0.899414205, 0.909160589, 0.672469518, 0.594865366), Y_T0_R1 = c(3.307846716, 0.027550169, 0.645327389, 2.887386508, 1.042465604, 0.05047425, 4.318466199), Y_T0_R2 = c(2.035398381, 0.633422527, 0.888069994, 2.062827838, 1.82433679, 0.500792593, 1.182188977), Y_T0_R3 = c(1.500168876, 0.877196975, 1.088593542, 1.392198697, 1.162069878, 0.470956741, 1.511890878), Y_T1_R1 = c(1.095875029, 0.777981021, 1.050238479, 1.17216374, 0.945470429, 0.40568268, 0.872396888), Y_T1_R2 = c(0.452742932, 0.352610874, 0.787861253, 0.477126035, 0.320200734, 1.826032539, 0.332244865), Y_T1_R3 = c(0.45960558, 0.478390214, 0.645688363, 0.395673468, 0.215407604, 0.759507568, 0.700730905), Y_T2_R1 = c(1.559068766, 0.062252184, 0.937463531, 0.994007758, 0.482591298, 1.269828631, 0.237326878), Y_T2_R2 = c(1.390406257, 0.215685731, 1.087380361, 1.018431329, 0.585660661, 1.05095161, 0.173209498), Y_T2_R3 = c(1.00828232, 0.376013801, 0.782410602, 0.906376375, 0.572489629, 1.359345852, 0.302963483), Y_T3_R1 = c(1.182635592, 0.117426355, 1.013642281, 0.967559933, 0.306328031, 1.231521805, 0.257804624), Y_T3_R2 = c(1.366839578, 0.341411017, 1.337125947, 0.943784803, 0.721978298, 1.10875345, 0.189978177), Y_T3_R3 = c(1.594404053, 0.209740069, 0.92384942, 0.897659445, 0.457172538, 1.543831721, 0.272475233), Z_T0_R1 = c(1.237203711, 0.233057698, 1.077219174, 1.156260667, 0.264806683, 1.591044318, 0.255767162), Z_T0_R2 = c(1.211301515, 0.251870699, 1.141522554, 1.194071909, 0.20882802, 1.533752995, 0.278059859), Z_T0_R3 = c(0.645425334, 1.53688617, 0.439888106, 0.819063313, 1.769224478, 0.250876057, 1.998822839), Z_T1_R1 = c(0.971645792, 0.671074934, 0.469502588, 1.312821698, 1.306039773, 1.40561198, 1.704347344), Z_T1_R2 = c(0.859830596, 1.580097955, 1.366461274, 1.24037716, 0.80578233, 1.116605654, 1.211928025), Z_T1_R3 = c(0.785228306, 1.286123696, 1.10243547, 0.996917372, 1.215506569, 0.683697612, 1.000232952), Z_T2_R1 = c(0.475576762, 2.673806674, 0.732913032, 0.763693301, 3.091813549, 0.347384763, 3.16064337), Z_T2_R2 = c(0.810829692, 1.590506889, 1.162262268, 1.367255133, 1.378518959, 0.677096267, 2.006934309), Z_T2_R3 = c(1.02507371, 2.164918846, 1.440885034, 1.185511625, 1.934374556, 0.460659928, 1.277191061), Z_T3_R1 = c(0.834953495, 2.155130232, 1.209137833, 0.934189133, 1.048650427, 0.704562113, 1.145400709), Z_T3_R2 = c(0.886903303, 0.237343684, 0.921370232, 0.737206101, 0.318232441, 1.314051524, 0.9314835), Z_T3_R3 = c(0.748710472, 0.501419194, 0.914476206, 0.641169316, 0.119979817, 1.187578276, 0.918544916)), row.names = c(NA, -7L), class = c("data.table", "data.frame"))
df2
```
#### Method 1 using tidyverse
```{r}
library(stringr)
library(tidyr)
gdf2 = gather(df2, "group", "Expression", -Gene)
gdf2$tgroup = apply(str_split_fixed(gdf2$group, "_", 3)[, c(1, 2)], 1, paste, collapse ="_")
library(dplyr)
gdf2 %>% group_by(Gene, tgroup) %>% summarize(expression_mean = mean(Expression)) %>% spread(tgroup, expression_mean)
```

#### Method 2: Using arrays. Doesn't store column names
```{r}
results=data.frame(apply(array(as.matrix(df2[,-1]), c(nrow(df2),3, ncol(df2)/3)),3, rowMeans))
results=cbind(df2$Gene, results)
library(plyr)
dresults=ddply(results,"df2$Gene",numcolwise(mean))
dresults
```

##### Method 3: Using loop
```{r}
df2_final=data.frame(matrix(nrow = nrow(df2)))
for (i in unique(gsub("_R[1-9]","",names(df2)))[-1]){
    df2_final[,i]=apply(df2[,grepl(gsub("_R[1-9]","",i),names(df2))],1, mean)
}
df2_final[,1]=df2[,1]
names(df2_final)[1]=names(df2)[1]
df2_final=ddply(df2_final,"Gene",numcolwise(mean))
df2_final
```