# here's what i know about tidyeval

there’s no shortage of resources about tidyeval (i’ve listed some at the bottom), but this is a collection of what i know.

there is really no “why” here, or not much. i’m more of a “how” person, so ymmv on the usefulness.

i won’t use mtcars or iris because i’m bored to death of them. let’s use a dataset of toronto subway delays from 2018 (available from toronto open data)

```
library(dplyr)
delays <- fs::dir_ls(here::here("static", "data", "ttc-delays", "delays")) %>%
purrr::map_dfr(readxl::read_excel) %>%
janitor::clean_names()
head(delays)
```

```
## # A tibble: 6 x 10
## date time day station code min_delay min_gap bound
## <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 2018-04-01 00:00:00 00:27 Sund… ST GEO… MUSAN 8 12 W
## 2 2018-04-01 00:00:00 07:56 Sund… FINCH … TUSC 0 0 S
## 3 2018-04-01 00:00:00 08:00 Sund… YONGE … MUO 0 0 <NA>
## 4 2018-04-01 00:00:00 09:50 Sund… KIPLIN… TUSC 0 0 W
## 5 2018-04-01 00:00:00 10:18 Sund… VICTOR… MUSC 0 0 W
## 6 2018-04-01 00:00:00 10:22 Sund… KENNED… EUNT 3 7 W
## # … with 2 more variables: line <chr>, vehicle <dbl>
```

side note, but i can’t believe it’s that easy to read in 12 files and combine them. truly wild.

tidyeval time.

# one variable

let’s say i want a function that returns the mean delay (`min_delay`

is the delay, in minutes) based on a specific grouping, e.g. by `station`

, maybe by `day`

.

when writing the function, use `enquo()`

to quote the variable, then `!!`

to unquote it.

```
grouped_mean_delay <- function(df, group_var){
group_var <- enquo(group_var)
df %>%
group_by(!!group_var) %>%
summarise(mean_delay = mean(min_delay))
}
```

when i use the function, i can just call `grouped_mean_delay()`

and pass it whatever variable i want to group by, without parentheses.

```
delays %>%
grouped_mean_delay(group_var = day)
```

```
## # A tibble: 7 x 2
## day mean_delay
## <chr> <dbl>
## 1 Friday 2.34
## 2 Monday 2.46
## 3 Saturday 2.89
## 4 Sunday 2.52
## 5 Thursday 1.99
## 6 Tuesday 2.26
## 7 Wednesday 2.04
```

# two variables, for two purposes

that’s nice, but i probably don’t always want the mean delay. what if i want the mean *gap* that the delay causes? the variable `min_gap`

shows this – e.g. if `min_gap`

is 12, then that delay caused a 12 minute gap between trains at that station.

i don’t really want to write a new function for every variable i might want to get the mean for, so it’d be nice to generalize `grouped_mean_delay()`

to be a more general grouped mean.

you can do this the exact same way, and just add another argument for the variable you want the mean for.

```
grouped_mean <- function(df, group_var, mean_var){
group_var <- enquo(group_var)
mean_var <- enquo(mean_var)
df %>%
group_by(!!group_var) %>%
summarise(mean = mean(!!mean_var))
}
delays %>%
grouped_mean(group_var = day,
mean_var = min_gap)
```

```
## # A tibble: 7 x 2
## day mean
## <chr> <dbl>
## 1 Friday 3.46
## 2 Monday 3.48
## 3 Saturday 4.34
## 4 Sunday 3.67
## 5 Thursday 2.97
## 6 Tuesday 3.25
## 7 Wednesday 3.09
```

yes there are way to change the name of the output variable (i.e. so it’s not just `mean`

). programming with dplyr talks about this, but i never really do it, so 💁

# many variables, for the same purpose?

if i’m a curious person (i am), i probably want to be able to group by more than one thing at a time, e.g. by day *and* by subway line (`line`

).

there’s a few ways you can do this.

## pass the dots

the first, which literally blew my mind the first time i saw it, uses `...`

, and you pass the dots straight in when writing your function.

```
grouped_mean_delay_2 <- function(df, ...){
df %>%
group_by(...) %>%
summarise(mean_delay = mean(min_delay))
}
delays %>%
grouped_mean_delay_2(day, line)
```

```
## # A tibble: 63 x 3
## # Groups: day [?]
## day line mean_delay
## <chr> <chr> <dbl>
## 1 Friday 16 MCCOWAN 0
## 2 Friday 704 RAD BUS 0
## 3 Friday BD 2.01
## 4 Friday SHP 2.04
## 5 Friday SRT 5.98
## 6 Friday YU 2.49
## 7 Friday YU / BD 0
## 8 Friday YU/ BD 0
## 9 Friday YU/BD 0
## 10 Friday YUS 0
## # … with 53 more rows
```

of course we have the added pleasure of the fact that this dataset isn’t coded consistently (three variants of YU/BD!), but that’s a topic for another post.

## pass the vars()

the thing about passing the dots is that those `...`

are so mysterious. i definitely don’t always write documentation for my functions, so it’s nice to rely on named arguments that describe (even just a little!) what you should be throwing in there.

and sometimes it just doesn’t work! in my mind, there are two kinds of verbs in dplyr:

- verbs that take
`...`

, like`group_by()`

and`select()`

- (scoped) verbs that take
`vars()`

, like`mutate_at()`

and`summarise_at()`

and you have to write your function using `vars()`

a little differently, depending.

### verbs that take `...`

for verbs that take `...`

, you got to just pass the dots. but you cannot just pass the `vars()`

! if you want to use a named argument, and `vars()`

, then you have to expand the variables back out using `!!!`

(that’s three bangs).

```
grouped_mean_delay_3 <- function(df, group_vars){
df %>%
group_by(!!!group_vars) %>%
summarise(mean_delay = mean(min_delay))
}
delays %>%
grouped_mean_delay_3(group_vars = vars(day, line))
```

```
## # A tibble: 63 x 3
## # Groups: day [?]
## day line mean_delay
## <chr> <chr> <dbl>
## 1 Friday 16 MCCOWAN 0
## 2 Friday 704 RAD BUS 0
## 3 Friday BD 2.01
## 4 Friday SHP 2.04
## 5 Friday SRT 5.98
## 6 Friday YU 2.49
## 7 Friday YU / BD 0
## 8 Friday YU/ BD 0
## 9 Friday YU/BD 0
## 10 Friday YUS 0
## # … with 53 more rows
```

beauty.

### (scoped) verbs that take `vars()`

the `_at`

scoped verbs, like `summarise_at()`

, don’t take `...`

as an argument.

the `vars`

argument of `summarise_at()`

specifically says it “takes a list of columns generated by `vars()`

” (and some other things).

say we want the mean delay *and* the mean gap.

you can’t pass the dots here.

```
variable_mean_broken <- function(df, ...){
df %>%
summarise_at(..., mean)
}
delays %>%
variable_mean_broken(min_delay, min_gap)
```

`## Error in check_dot_cols(.vars, .cols): object 'min_delay' not found`

but you *can* just pass the `vars()`

.

we need to pass in something that `summarise_at()`

expects, specifically something that looks more like a `vars()`

call. because `summarise_at()`

*expects* something using `vars()`

, we don’t need to do anything to expand the variables out.

just like above, how because `group_by()`

expects `...`

arguments, we don’t need to do anything to those dots.

```
variable_mean <- function(df, mean_vars){
df %>%
summarise_at(mean_vars, mean)
}
delays %>%
variable_mean(mean_vars = vars(min_delay, min_gap))
```

```
## # A tibble: 1 x 2
## min_delay min_gap
## <dbl> <dbl>
## 1 2.33 3.42
```

beauty beauty.

# many variables, for many purposes?

i think this whole `vars()`

thing really shines when you have many variables for many purposes. i knew about passing the dots, but i was like… how do you pass the dots… twice? jenny bryan’s like yeah, you don’t.

you use `vars()`

!

if i want to group by many variables *and* get the mean for many variables, then i can just throw a bunch of `vars()`

in:

```
general_grouped_mean <- function(df, group_vars, mean_vars){
df %>%
group_by(!!!group_vars) %>%
summarise_at(mean_vars, mean)
}
delays %>%
general_grouped_mean(group_vars = vars(line, day),
mean_vars = vars(min_delay, min_gap))
```

```
## # A tibble: 63 x 4
## # Groups: line [?]
## line day min_delay min_gap
## <chr> <chr> <dbl> <dbl>
## 1 16 MCCOWAN Friday 0 0
## 2 16 MCCOWAN Saturday 0 0
## 3 704 RAD BUS Friday 0 0
## 4 999 Monday 0 0
## 5 999 Thursday 0 0
## 6 999 Tuesday 0 0
## 7 BD Friday 2.01 2.91
## 8 BD Monday 2.12 3.05
## 9 BD Saturday 2.58 3.84
## 10 BD Sunday 2.44 3.59
## # … with 53 more rows
```

the only thing here is that you *have* to use `vars()`

, even if you’re just passing one variable. like, this doesn’t work:

```
delays %>%
general_grouped_mean(group_vars = day,
mean_vars = vars(min_delay, min_gap))
```

`## Error in quos(...): object 'day' not found`

because the tidyeval method for passing in a single variable to `group_by()`

is `enquo()`

and then `!!`

as described in the first bit,

and neither does this:

```
delays %>%
general_grouped_mean(group_vars = vars(day, line),
mean_vars = min_delay)
```

`## Error in check_dot_cols(.vars, .cols): object 'min_delay' not found`

because `summarise_at()`

requires a `vars()`

call (or one of the other options described in the `.vars`

argument).

this is still something i’m working out myself. how do i account for the possibility of a `vars()`

, or just a variable on its own? is this what methods are for? s3? i totally know about that. for now, i will be a heavy user of the `vars()`

, even when it’s overkill, a la

```
delays %>%
general_grouped_mean(group_vars = vars(day),
mean_vars = vars(min_delay))
```

```
## # A tibble: 7 x 2
## day min_delay
## <chr> <dbl>
## 1 Friday 2.34
## 2 Monday 2.46
## 3 Saturday 2.89
## 4 Sunday 2.52
## 5 Thursday 1.99
## 6 Tuesday 2.26
## 7 Wednesday 2.04
```

# passing functions

the last thing i know is something i *just* learned, because someone else asked about it on twitter. thanks benjamin gowan!

say i actually looked at my data and discovered the mean isn’t a great measure for delays, so i want the median, too.

the way to do this is outside a function is:

```
delays %>%
summarise_at(vars(min_delay, min_gap),
funs(mean, median))
```

```
## # A tibble: 1 x 4
## min_delay_mean min_gap_mean min_delay_median min_gap_median
## <dbl> <dbl> <dbl> <dbl>
## 1 2.33 3.42 0 0
```

so if you’re writing a function, it looks like this:

```
summary_by_var <- function(df, summary_vars, summary_funs){
df %>%
summarise_at(summary_vars, summary_funs)
}
```

if you’re just doing one summary function, it’s pretty easy to just pass right in

```
delays %>%
summary_by_var(vars(min_delay, min_gap),
median)
```

```
## # A tibble: 1 x 2
## min_delay min_gap
## <dbl> <dbl>
## 1 0 0
```

for multiple, you pass in a *list* of functions generated by `funs()`

. i suggest you name them, otherwise your output will be ugly.

```
delays %>%
summary_by_var(vars(min_delay, min_gap),
funs(mean = mean, median = median))
```

```
## # A tibble: 1 x 4
## min_delay_mean min_gap_mean min_delay_median min_gap_median
## <dbl> <dbl> <dbl> <dbl>
## 1 2.33 3.42 0 0
```

you could actually use `list(mean = mean, median = median)`

too, but i think the `funs()`

case is clearer, since that’s what `summarise_at()`

’s `.funs`

argument says it takes.

# putting it all together

now we can put everything together: group by multiple things and summarise multiple variables using multiple functions.

```
grouped_summary <- function(df, group_vars, summary_vars, summary_funs){
df %>%
group_by(!!!group_vars) %>%
summarise_at(summary_vars, summary_funs)
}
delays %>%
grouped_summary(group_vars = vars(day, line),
summary_vars = vars(min_delay, min_gap),
summary_funs = funs(mean = mean,
median = median))
```

```
## # A tibble: 63 x 6
## # Groups: day [?]
## day line min_delay_mean min_gap_mean min_delay_median min_gap_median
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Friday 16 M… 0 0 0 0
## 2 Friday 704 … 0 0 0 0
## 3 Friday BD 2.01 2.91 0 0
## 4 Friday SHP 2.04 3.56 0 0
## 5 Friday SRT 5.98 9.31 4 10
## 6 Friday YU 2.49 3.66 0 0
## 7 Friday YU /… 0 0 0 0
## 8 Friday YU/ … 0 0 0 0
## 9 Friday YU/BD 0 0 0 0
## 10 Friday YUS 0 0 0 0
## # … with 53 more rows
```

wowwww.

# bye

there is a lot i don’t know still, and i am happy to be pointed in the direction of other information, or told if i’m spreading misinformation! i especially would like to know how to be able to pass `vars()`

OR just a single variable without `vars()`

. do i write an `if`

statement and then use `enquo()`

and `!!`

? i don’t know! do you?

thanks to everyone that talks to me on twitter about tidyeval. there’s a lot out there 🌏

here is a collection of tidyeval resources that probably explain why things work the way they do: