dplyr - Canonical way to select columns in R -
i comparing common "tidying" operations in dplyr , in "plain r" (see output here , source here see mean).
i have hard time finding "canonical" and concise way select columns using variable names (by canonical, mean, pure plain r , understandable minimum understanding of r (so no "voodoo trick")).
example:
## subset: columns "var_1" "var_2" excluding "var_3" ## dplyr: table %>% select(var_1:var_2, -var_3) ## plain r: r <- sapply(c("var_1", "var_2", "var_3"), function(x) which(names(table)==x)) table[ ,setdiff(r[1]:r[2],r[3]) ]
any suggestions improve plain r syntax?
edit
i implemented suggestions , compared performance on different syntaxes, , noticed use of match
, subset
lead surprising falls in performance:
# plain r, v1 system.time(for (i in 1:100) { r <- sapply(c("size", "country"), function(x) which(names(cran_df)==x)) cran_df[,r[1]:r[2]] } ) ## user system elapsed ## 0.006 0.000 0.007 # plain r, using match system.time(for (i in 1:100) { r <- match(c("size", "country"), names(cran_df)) cran_df[,r[1]:r[2]] %>% head(n=3) } ) ## user system elapsed ## 0.056 0.028 0.084 # plain r, using match , subset system.time(for (i in 1:100) { r <- match(c("size", "country"), names(cran_df)) subset(cran_df, select=r[1]:r[2]) %>% head(n=3) } ) ## user system elapsed ## 11.556 1.057 12.640 # dplyr system.time(for (i in 1:100) select(cran_tbl_df,size:country)) ## user system elapsed ## 0.034 0.000 0.034
looks implementation of subset
sub-optimal...
you can use built in subset
function, can take select
argument follows similar (though not identical) syntax dplyr::select
. note dropping columns has done in second step:
t1 <- subset(table, select = var1:var2) t2 <- subset(t1, select = -var_3)
or:
subset(subset(table, select = var1:var2), select = -var_3)
for example:
subset(subset(mtcars, select = c(mpg:wt)), select = -hp)
Comments
Post a Comment