r - Tabulate responses for multiple columns by grouping variable with dplyr -

hi:i'm new plyr/dplyr family enjoying it. can see it's massive utility own work, i'm stil trying head around it.
have data frame looks below.

1) how produce table each non-grouping variable shows distribution of responses within each value of grouping variable?

2) note: have missing values , exclude them tabulation. realize summarize_each command apply function each column, don't know how handle missing values issue in simple way. have seen codes suggest have filter out missing values, if missing values scattered randomly through non-grouping variables?

3) fundamentally, best use complete cases dplyr?

#library library(dplyr) #sample data group<-sample(c('a', 'b', 'c'), 100, replace=true) var1<-sample(c(1,2,3,4,5,na), 100, replace=true,     prob=c(0.15,0.15,0.15,0.15,0.15,0.25)) var2<-sample(c(1,2,3,4,5,na), 100, replace=true, prob=c(0.15,0.15,0.15,0.15,0.15,0.25)) var3<-sample(c(1,2,3,4,5,na), 100, replace=true, prob=c(0.15,0.15,0.15,0.15,0.15,0.25)) df<-data.frame(group, var1, var2, var3) #my code out_df<-df %>%group_by(group) out_df %>% summarise_each(funs(table))

you can counts group each of var1, var2, , var3 if "melt" data frame long form first, "stack" 3 var columns single column (value) , create additional column (variable) marking rows go var.

library(dplyr) library(reshape2)  #sample data group <- sample(c('a', 'b', 'c'), 100, replace=true) var1 <- sample(c(1,2,3,4,5,na), 100, replace=true, prob=c(0.15,0.15,0.15,0.15,0.15,0.25)) var2 <- sample(c(1,2,3,4,5,na), 100, replace=true, prob=c(0.15,0.15,0.15,0.15,0.15,0.25)) var3 <- sample(c(1,2,3,4,5,na), 100, replace=true, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))  df<-data.frame(group, var1, var2, var3)  out_df <- df %>%    melt(id.var="group") %>%   filter(!is.na(value)) %>%  # remove na   group_by(group, variable, value) %>%   summarise(count=n()) %>%    group_by(group, variable) %>%    mutate(percent=count/sum(count))

you can stop function chain @ point @ intermediate steps, in understanding each step doing.

because grouped group, variable, , value, end count giving number of rows combination of 3 columns. group group , variable calculate percentage of rows each value of count contributes each combination of 2 grouping variables. (the second group_by not essential, because dplyr drops last grouping variable after summarise operation (because there 1 row each combination of original grouping variables) prefer regroup explicitly.)

here's final result:

out_df     group variable value count    percent 1          var1     1     6 0.26086957 2          var1     2     3 0.13043478 3          var1     3     6 0.26086957 4          var1     4     1 0.04347826 5          var1     5     7 0.30434783 ... 41     c     var3     1     6 0.25000000 42     c     var3     2     5 0.20833333 43     c     var3     3     4 0.16666667 44     c     var3     4     2 0.08333333 45     c     var3     5     7 0.29166667

Search This Blog

Remember

r - Tabulate responses for multiple columns by grouping variable with dplyr -

Comments

Post a Comment

Popular posts from this blog

Java 8 + Maven Javadoc plugin: Error fetching URL -

css - SVG using textPath a symbol not rendering in Firefox -

php - Google Calendar Events -