r - Tabulate responses for multiple columns by grouping variable with dplyr -
hi:i'm new plyr/dplyr family enjoying it. can see it's massive utility own work, i'm stil trying head around it.
have data frame looks below.
1) how produce table each non-grouping variable shows distribution of responses within each value of grouping variable?
2) note: have missing values , exclude them tabulation. realize summarize_each command apply function each column, don't know how handle missing values issue in simple way. have seen codes suggest have filter out missing values, if missing values scattered randomly through non-grouping variables?
3) fundamentally, best use complete cases dplyr?
#library library(dplyr) #sample data group<-sample(c('a', 'b', 'c'), 100, replace=true) var1<-sample(c(1,2,3,4,5,na), 100, replace=true, prob=c(0.15,0.15,0.15,0.15,0.15,0.25)) var2<-sample(c(1,2,3,4,5,na), 100, replace=true, prob=c(0.15,0.15,0.15,0.15,0.15,0.25)) var3<-sample(c(1,2,3,4,5,na), 100, replace=true, prob=c(0.15,0.15,0.15,0.15,0.15,0.25)) df<-data.frame(group, var1, var2, var3) #my code out_df<-df %>%group_by(group) out_df %>% summarise_each(funs(table))
you can counts group
each of var1
, var2
, , var3
if "melt" data frame long form first, "stack" 3 var
columns single column (value
) , create additional column (variable
) marking rows go var
.
library(dplyr) library(reshape2) #sample data group <- sample(c('a', 'b', 'c'), 100, replace=true) var1 <- sample(c(1,2,3,4,5,na), 100, replace=true, prob=c(0.15,0.15,0.15,0.15,0.15,0.25)) var2 <- sample(c(1,2,3,4,5,na), 100, replace=true, prob=c(0.15,0.15,0.15,0.15,0.15,0.25)) var3 <- sample(c(1,2,3,4,5,na), 100, replace=true, prob=c(0.15,0.15,0.15,0.15,0.15,0.25)) df<-data.frame(group, var1, var2, var3) out_df <- df %>% melt(id.var="group") %>% filter(!is.na(value)) %>% # remove na group_by(group, variable, value) %>% summarise(count=n()) %>% group_by(group, variable) %>% mutate(percent=count/sum(count))
you can stop function chain @ point @ intermediate steps, in understanding each step doing.
because grouped group
, variable
, , value
, end count
giving number of rows combination of 3 columns. group group
, variable
calculate percentage of rows each value of count
contributes each combination of 2 grouping variables. (the second group_by
not essential, because dplyr drops last grouping variable after summarise
operation (because there 1 row each combination of original grouping variables) prefer regroup explicitly.)
here's final result:
out_df group variable value count percent 1 var1 1 6 0.26086957 2 var1 2 3 0.13043478 3 var1 3 6 0.26086957 4 var1 4 1 0.04347826 5 var1 5 7 0.30434783 ... 41 c var3 1 6 0.25000000 42 c var3 2 5 0.20833333 43 c var3 3 4 0.16666667 44 c var3 4 2 0.08333333 45 c var3 5 7 0.29166667
Comments
Post a Comment