Sharla Gelfand

here's what i know about tidyeval

there’s no shortage of resources about tidyeval (i’ve listed some at the bottom), but this is a collection of what i know.

there is really no “why” here, or not much. i’m more of a “how” person, so ymmv on the usefulness.

i won’t use mtcars or iris because i’m bored to death of them. let’s use a dataset of toronto subway delays from 2018 (available from toronto open data)

library(dplyr)

delays <- fs::dir_ls(here::here("static", "data", "ttc-delays", "delays")) %>%
  purrr::map_dfr(readxl::read_excel) %>%
  janitor::clean_names()

head(delays)
## # A tibble: 6 x 10
##   date                time  day   station code  min_delay min_gap bound
##   <dttm>              <chr> <chr> <chr>   <chr>     <dbl>   <dbl> <chr>
## 1 2018-04-01 00:00:00 00:27 Sund… ST GEO… MUSAN         8      12 W    
## 2 2018-04-01 00:00:00 07:56 Sund… FINCH … TUSC          0       0 S    
## 3 2018-04-01 00:00:00 08:00 Sund… YONGE … MUO           0       0 <NA> 
## 4 2018-04-01 00:00:00 09:50 Sund… KIPLIN… TUSC          0       0 W    
## 5 2018-04-01 00:00:00 10:18 Sund… VICTOR… MUSC          0       0 W    
## 6 2018-04-01 00:00:00 10:22 Sund… KENNED… EUNT          3       7 W    
## # … with 2 more variables: line <chr>, vehicle <dbl>

side note, but i can’t believe it’s that easy to read in 12 files and combine them. truly wild.

tidyeval time.

one variable

let’s say i want a function that returns the mean delay (min_delay is the delay, in minutes) based on a specific grouping, e.g. by station, maybe by day.

when writing the function, use enquo() to quote the variable, then !! to unquote it.

grouped_mean_delay <- function(df, group_var){
  group_var <- enquo(group_var)

  df %>%
    group_by(!!group_var) %>%
    summarise(mean_delay = mean(min_delay))
}

when i use the function, i can just call grouped_mean_delay() and pass it whatever variable i want to group by, without parentheses.

delays %>%
  grouped_mean_delay(group_var = day)
## # A tibble: 7 x 2
##   day       mean_delay
##   <chr>          <dbl>
## 1 Friday          2.34
## 2 Monday          2.46
## 3 Saturday        2.89
## 4 Sunday          2.52
## 5 Thursday        1.99
## 6 Tuesday         2.26
## 7 Wednesday       2.04

two variables, for two purposes

that’s nice, but i probably don’t always want the mean delay. what if i want the mean gap that the delay causes? the variable min_gap shows this – e.g. if min_gap is 12, then that delay caused a 12 minute gap between trains at that station.

i don’t really want to write a new function for every variable i might want to get the mean for, so it’d be nice to generalize grouped_mean_delay() to be a more general grouped mean.

you can do this the exact same way, and just add another argument for the variable you want the mean for.

grouped_mean <- function(df, group_var, mean_var){
  group_var <- enquo(group_var)
  mean_var <- enquo(mean_var)
  
  df %>%
    group_by(!!group_var) %>%
    summarise(mean = mean(!!mean_var))
}

delays %>%
  grouped_mean(group_var = day, 
               mean_var = min_gap)
## # A tibble: 7 x 2
##   day        mean
##   <chr>     <dbl>
## 1 Friday     3.46
## 2 Monday     3.48
## 3 Saturday   4.34
## 4 Sunday     3.67
## 5 Thursday   2.97
## 6 Tuesday    3.25
## 7 Wednesday  3.09

yes there are way to change the name of the output variable (i.e. so it’s not just mean). programming with dplyr talks about this, but i never really do it, so 💁

many variables, for the same purpose?

if i’m a curious person (i am), i probably want to be able to group by more than one thing at a time, e.g. by day and by subway line (line).

there’s a few ways you can do this.

pass the dots

the first, which literally blew my mind the first time i saw it, uses ..., and you pass the dots straight in when writing your function.

grouped_mean_delay_2 <- function(df, ...){
  df %>%
    group_by(...) %>%
    summarise(mean_delay = mean(min_delay))
}

delays %>%
  grouped_mean_delay_2(day, line)
## # A tibble: 63 x 3
## # Groups:   day [?]
##    day    line        mean_delay
##    <chr>  <chr>            <dbl>
##  1 Friday 16 MCCOWAN        0   
##  2 Friday 704 RAD BUS       0   
##  3 Friday BD                2.01
##  4 Friday SHP               2.04
##  5 Friday SRT               5.98
##  6 Friday YU                2.49
##  7 Friday YU / BD           0   
##  8 Friday YU/ BD            0   
##  9 Friday YU/BD             0   
## 10 Friday YUS               0   
## # … with 53 more rows

of course we have the added pleasure of the fact that this dataset isn’t coded consistently (three variants of YU/BD!), but that’s a topic for another post.

pass the vars()

the thing about passing the dots is that those ... are so mysterious. i definitely don’t always write documentation for my functions, so it’s nice to rely on named arguments that describe (even just a little!) what you should be throwing in there.

and sometimes it just doesn’t work! in my mind, there are two kinds of verbs in dplyr:

  1. verbs that take ..., like group_by() and select()
  2. (scoped) verbs that take vars(), like mutate_at() and summarise_at()

and you have to write your function using vars() a little differently, depending.

verbs that take ...

for verbs that take ..., you got to just pass the dots. but you cannot just pass the vars()! if you want to use a named argument, and vars(), then you have to expand the variables back out using !!! (that’s three bangs).

grouped_mean_delay_3 <- function(df, group_vars){
  df %>%
    group_by(!!!group_vars) %>%
    summarise(mean_delay = mean(min_delay))
}

delays %>%
  grouped_mean_delay_3(group_vars = vars(day, line))
## # A tibble: 63 x 3
## # Groups:   day [?]
##    day    line        mean_delay
##    <chr>  <chr>            <dbl>
##  1 Friday 16 MCCOWAN        0   
##  2 Friday 704 RAD BUS       0   
##  3 Friday BD                2.01
##  4 Friday SHP               2.04
##  5 Friday SRT               5.98
##  6 Friday YU                2.49
##  7 Friday YU / BD           0   
##  8 Friday YU/ BD            0   
##  9 Friday YU/BD             0   
## 10 Friday YUS               0   
## # … with 53 more rows

beauty.

(scoped) verbs that take vars()

the _at scoped verbs, like summarise_at(), don’t take ... as an argument.

the vars argument of summarise_at() specifically says it “takes a list of columns generated by vars()” (and some other things).

say we want the mean delay and the mean gap.

you can’t pass the dots here.

variable_mean_broken <- function(df, ...){
  df %>%
    summarise_at(..., mean)
}

delays %>%
  variable_mean_broken(min_delay, min_gap)
## Error in check_dot_cols(.vars, .cols): object 'min_delay' not found

but you can just pass the vars().

we need to pass in something that summarise_at() expects, specifically something that looks more like a vars() call. because summarise_at() expects something using vars(), we don’t need to do anything to expand the variables out.

just like above, how because group_by() expects ... arguments, we don’t need to do anything to those dots.

variable_mean <- function(df, mean_vars){
  df %>%
    summarise_at(mean_vars, mean)
}

delays %>%
  variable_mean(mean_vars = vars(min_delay, min_gap))
## # A tibble: 1 x 2
##   min_delay min_gap
##       <dbl>   <dbl>
## 1      2.33    3.42

beauty beauty.

many variables, for many purposes?

i think this whole vars() thing really shines when you have many variables for many purposes. i knew about passing the dots, but i was like… how do you pass the dots… twice? jenny bryan’s like yeah, you don’t.

you use vars()!

if i want to group by many variables and get the mean for many variables, then i can just throw a bunch of vars() in:

general_grouped_mean <- function(df, group_vars, mean_vars){
  df %>%
    group_by(!!!group_vars) %>%
    summarise_at(mean_vars, mean)
}

delays %>%
  general_grouped_mean(group_vars = vars(line, day), 
                       mean_vars = vars(min_delay, min_gap))
## # A tibble: 63 x 4
## # Groups:   line [?]
##    line        day      min_delay min_gap
##    <chr>       <chr>        <dbl>   <dbl>
##  1 16 MCCOWAN  Friday        0       0   
##  2 16 MCCOWAN  Saturday      0       0   
##  3 704 RAD BUS Friday        0       0   
##  4 999         Monday        0       0   
##  5 999         Thursday      0       0   
##  6 999         Tuesday       0       0   
##  7 BD          Friday        2.01    2.91
##  8 BD          Monday        2.12    3.05
##  9 BD          Saturday      2.58    3.84
## 10 BD          Sunday        2.44    3.59
## # … with 53 more rows

the only thing here is that you have to use vars(), even if you’re just passing one variable. like, this doesn’t work:

delays %>%
  general_grouped_mean(group_vars = day,
                       mean_vars = vars(min_delay, min_gap))
## Error in quos(...): object 'day' not found

because the tidyeval method for passing in a single variable to group_by() is enquo() and then !! as described in the first bit,

and neither does this:

delays %>%
  general_grouped_mean(group_vars = vars(day, line),
                       mean_vars = min_delay)
## Error in check_dot_cols(.vars, .cols): object 'min_delay' not found

because summarise_at() requires a vars() call (or one of the other options described in the .vars argument).

this is still something i’m working out myself. how do i account for the possibility of a vars(), or just a variable on its own? is this what methods are for? s3? i totally know about that. for now, i will be a heavy user of the vars(), even when it’s overkill, a la

delays %>%
  general_grouped_mean(group_vars = vars(day),
                       mean_vars = vars(min_delay))
## # A tibble: 7 x 2
##   day       min_delay
##   <chr>         <dbl>
## 1 Friday         2.34
## 2 Monday         2.46
## 3 Saturday       2.89
## 4 Sunday         2.52
## 5 Thursday       1.99
## 6 Tuesday        2.26
## 7 Wednesday      2.04

passing functions

the last thing i know is something i just learned, because someone else asked about it on twitter. thanks benjamin gowan!

say i actually looked at my data and discovered the mean isn’t a great measure for delays, so i want the median, too.

the way to do this is outside a function is:

delays %>% 
  summarise_at(vars(min_delay, min_gap), 
               funs(mean, median))
## # A tibble: 1 x 4
##   min_delay_mean min_gap_mean min_delay_median min_gap_median
##            <dbl>        <dbl>            <dbl>          <dbl>
## 1           2.33         3.42                0              0

so if you’re writing a function, it looks like this:

summary_by_var <- function(df, summary_vars, summary_funs){
  df %>%
    summarise_at(summary_vars, summary_funs)
}

if you’re just doing one summary function, it’s pretty easy to just pass right in

delays %>%
  summary_by_var(vars(min_delay, min_gap),
                 median)
## # A tibble: 1 x 2
##   min_delay min_gap
##       <dbl>   <dbl>
## 1         0       0

for multiple, you pass in a list of functions generated by funs(). i suggest you name them, otherwise your output will be ugly.

delays %>%
  summary_by_var(vars(min_delay, min_gap),
                 funs(mean = mean, median = median))
## # A tibble: 1 x 4
##   min_delay_mean min_gap_mean min_delay_median min_gap_median
##            <dbl>        <dbl>            <dbl>          <dbl>
## 1           2.33         3.42                0              0

you could actually use list(mean = mean, median = median) too, but i think the funs() case is clearer, since that’s what summarise_at()’s .funs argument says it takes.

putting it all together

now we can put everything together: group by multiple things and summarise multiple variables using multiple functions.

grouped_summary <- function(df, group_vars, summary_vars, summary_funs){
  df %>%
    group_by(!!!group_vars) %>%
    summarise_at(summary_vars, summary_funs)
}

delays %>%
  grouped_summary(group_vars = vars(day, line),
                  summary_vars = vars(min_delay, min_gap),
                  summary_funs = funs(mean = mean,
                                      median = median))
## # A tibble: 63 x 6
## # Groups:   day [?]
##    day    line  min_delay_mean min_gap_mean min_delay_median min_gap_median
##    <chr>  <chr>          <dbl>        <dbl>            <dbl>          <dbl>
##  1 Friday 16 M…           0            0                   0              0
##  2 Friday 704 …           0            0                   0              0
##  3 Friday BD              2.01         2.91                0              0
##  4 Friday SHP             2.04         3.56                0              0
##  5 Friday SRT             5.98         9.31                4             10
##  6 Friday YU              2.49         3.66                0              0
##  7 Friday YU /…           0            0                   0              0
##  8 Friday YU/ …           0            0                   0              0
##  9 Friday YU/BD           0            0                   0              0
## 10 Friday YUS             0            0                   0              0
## # … with 53 more rows

wowwww.

bye

there is a lot i don’t know still, and i am happy to be pointed in the direction of other information, or told if i’m spreading misinformation! i especially would like to know how to be able to pass vars() OR just a single variable without vars(). do i write an if statement and then use enquo() and !!? i don’t know! do you?

thanks to everyone that talks to me on twitter about tidyeval. there’s a lot out there 🌏

here is a collection of tidyeval resources that probably explain why things work the way they do: