Strategies for working with new data
(This post is a basically a blog-replicate of a talk I gave at the R-Ladies Toronto + GTA R User Group kickoff, called “Opinionated Strategies for Uncharted Territories” – if you saw that, this is old news 💅)
A few weeks ago the TTC launched a new campaign around not going onto the tracks to retrieve dropped items – “it’s not worth your life.” Pretty reasonable, right?
This came along with some press saying that unauthorized people at the track level caused 26 hours of delays on the TTC last year (this includes people going onto tracks to get their belongs, people sneaking into tunnels, etc. Probably not when people drive into tunnels).
I was curious to see how this varied from line to line, what are the most common stations where people go onto the tracks (!), and what the other causes of delay were.
Luckily, Toronto Open Data releases TTC delays and has all of the 2018 delays. My goal is to answer a few of these questions about TTC delays, and to showcase some strategies for working with a new data set.
Let’s look at the data.
(A small disclaimer, and my apologies to the TTC and Toronto Open Data – I messed with this data. Just a bit. Some of the errors are yours, some are mine. You can see the messing around I did here).
delays
## # A tibble: 20,735 x 8
## date time station code description min_delay bound line
## <date> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 2018-04-01 00:27 ST GEORG… MUSAN Unsanitary Vehic… 8 W BD
## 2 2018-04-01 07:56 FINCH ST… TUSC Operator Overspe… 0 S YU
## 3 2018-04-01 08:00 YONGE UN… MUO Miscellaneous Ot… 0 <NA> YU
## 4 2018-04-01 09:50 KIPLING … TUSC Operator Overspe… 0 W BD
## 5 2018-04-01 10:18 VICTORIA… MUSC Miscellaneous Sp… 0 W BD
## 6 2018-04-01 10:22 KENNEDY … EUNT Equipment - No T… 3 W BD
## 7 2018-04-01 10:30 WILSON S… PUMEL Escalator/Elevat… 0 <NA> YU
## 8 2018-04-01 11:20 WILSON S… TUSC Operator Overspe… 0 S YU
## 9 2018-04-01 11:32 FINCH WE… MUIRS Injured or ill C… 0 <NA> YU
## 10 2018-04-01 12:00 JANE STA… TUSC Operator Overspe… 0 W BD
## # … with 20,725 more rows
Looks good!
Now, I want to look at the top 5 causes for delay along each line (YU, BD, SHP, and SRT), and see the total delays they caused in 2018. min_delay
shows the delay time, in minutes, so we can use that.
library(dplyr)
library(ggplot2)
delays %>%
group_by(line, code) %>%
summarise(delays = sum(min_delay)) %>%
arrange(-delays) %>%
slice(1:5) %>%
ggplot(aes(x = code,
y = delays)) +
geom_col() +
facet_wrap(vars(line),
scales = "free_y",
nrow = 4) +
coord_flip()
So, this is definitely not what I expected, nor what the dataset’s documentation says to expect (that the line should be one of YU, BD, SHP, and SRT).
There are a bunch of “other” “lines” with varying amounts of data and definitely “line-ness” in general – 16 MCCOWAN is not a subway line and “YU/BD” appears in various spellings (probably for delays at Yonge, St. George, and Spadina stations?).
So let’s just focus on the actual lines.
delays %>%
filter(line %in% c("BD", "YU",
"SHP", "SRT")) %>%
group_by(line, description) %>%
summarise(delays = sum(min_delay)) %>%
arrange(-delays) %>%
slice(1:5) %>%
ggplot(aes(x = description,
y = delays)) +
geom_col() +
facet_wrap(vars(line),
scales = "free_x",
nrow = 1) +
coord_flip()
This is now, apparently, the top 5 causes for delay on each line. We see some interesting things here that are more popular than I’d have thought – “Bomb Threat”, “Suspicious Package”, “Force Majeure” (i.e., a force of nature, absolving the TTC of responsibility) rank pretty high.
But no where is “Unauthorized at Track Level”! So let’s focus on that.
delays %>%
filter(description == "Unauthorized at Track Level") %>%
group_by(line) %>%
summarise(delays = sum(min_delay))
## # A tibble: 4 x 2
## line delays
## <chr> <dbl>
## 1 BD NA
## 2 SHP 7
## 3 SRT 0
## 4 YU NA
You might be familiar with what’s happening here – there are NA
values in the data, which causes the sum
to be NA
, but it has an easy fix!
delays %>%
filter(description == "Unauthorized at Track Level") %>%
group_by(line) %>%
summarise(delays = sum(min_delay, na.rm = TRUE))
## # A tibble: 4 x 2
## line delays
## <chr> <dbl>
## 1 BD 646
## 2 SHP 7
## 3 SRT 0
## 4 YU 829
This also affects our first couple of plots, which showed the top 5 delays for each line – any delay code that has an NA
value therefore has a total delay of NA
, and wouldn’t have made the cut when we ranked the lines by delay; so even if it was the top delay, it would be missing.
So, I want to talk about some strategies for avoiding this. Sure, it’s not that bad to have to go back and add in the na.rm = TRUE
, or to filter your data and ensure consistent coding, but it would be really nice to know that you have to do this up front. It’d also be nice to know what you have to do right away, and not rely on “doing something with that variable” to notice the mistake.
Visual Summaries
The first strategy I want to talk about is simple, but really powerful, and that’s getting a visual summary of your data. I want to be able to learn as much about my data, its structure and its attributes, as fast as possible.
visdat
package
A great way to get a visual overview of your data structure is through the visdat
package, created by Nick Tierney. This package allows you to “get a look at the data” by visualizing the data frame and any missingness.
library(visdat)
vis_dat(delays)
By doing this, we learn a bit about our data that wasn’t apparent just from looking at the first 10 lines. It shows us the variable types, and any missingness.
- Sometimes the
code
doesn’t have a correspondingdescription
- Sometimes
bound
andline
are missing - Sometimes
min_delay
is missing!
And if we want to get a better handle on what percent of our data is missing, we can use vis_miss()
.
vis_miss(delays)
This actually reports what percent of data is missing, both for each variable and overall. For example, the code’s description
is missing about 2% of the time, the actual delay time (min_delay
) is missing 5% of the time, and bound
is missing almost 25% of the time!
Just doing this can spark some questions that we might not have had before:
- When is
description
missing? Are there specificcode
s that just don’t have a corresponding description? Was thecode
just entered incorrectly, and should be recoded to match an existing code? - Why is the
min_delay
missing? - What does it mean for
bound
to be missing? Is it that the delay was in both directions? Do certain types of delays not have abound
?
And by following up on these questions, either by exploring the data or asking people for the answers, we can learn a ton about TTC delays and about this data set.
Doing this also gives us the knowledge that sometimes min_delay
is missing, and we will need to handle that properly in any calculations with it!
Variable summaries
The next thing I’d probably want to do is understand the different attributes of each variable, the values it can take on, etc.
One of the most classic ways to do this is by using the summary()
function, e.g.
summary(delays[["min_delay"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 2.328 3.000 515.000 1036
which tells us a bit about the distribution of the data, and how many NA
s there are,
or,
summary(delays[["line"]])
## Length Class Mode
## 20735 character character
which tells you absolutely nothing.
summary()
, in all its generality-glory, works on a ton of different classes of objects. Unfortunately it is in varying utility – for a numeric vector, you get a lot of information. For a character vector, you get nothing useful.
It’s also annoying to go variable-by-variable. Sure, you can use summary()
on a whole data frame,
summary(delays)
## date time station
## Min. :2018-01-01 Length:20735 Length:20735
## 1st Qu.:2018-04-05 Class :character Class :character
## Median :2018-07-05 Mode :character Mode :character
## Mean :2018-07-01
## 3rd Qu.:2018-09-27
## Max. :2018-12-31
##
## code description min_delay
## Length:20735 Length:20735 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.000
## Mean : 2.328
## 3rd Qu.: 3.000
## Max. :515.000
## NA's :1036
## bound line
## Length:20735 Length:20735
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
but this isn’t a great way to visualize the information, with varying degrees of usefulness and all jumbled up.
skimr
package
The skimr
package, maintained by Elin Waring and Michael Quinn, is a great solution to this. It’s designed to show summary statistics that you can quickly skim to understand your data.
skimr
also conforms to something called the “Principle of Least Surprise”, which I thought was really interesting. It’s a concept from UI and software design that basically says the behaviour of something shouldn’t surprise a user, and that the design (of the UI or software) should match the user’s experience, expectations, and mental models.
I (and obviously the creators of skimr
) think this applies to data so nicely. The data should match your experience, expectations, and mental models. And if it doesn’t, you should find out quickly!
The skim()
function shows summary statistics for each variable, separated by variable class (e.g. all character variables together, all integers together, etc).
library(skimr)
delays %>%
skim()
## Skim summary statistics
## n obs: 20735
## n variables: 8
##
## ── Variable type:character ──────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## bound 4698 16037 20735 1 1 0 6
## code 0 20735 20735 3 5 0 184
## description 462 20273 20735 4 58 0 132
## line 101 20634 20735 2 11 0 13
## station 0 20735 20735 8 22 0 201
## time 0 20735 20735 5 5 0 1388
##
## ── Variable type:Date ───────────────────────────────────────────────────────
## variable missing complete n min max median n_unique
## date 0 20735 20735 2018-01-01 2018-12-31 2018-07-05 365
##
## ── Variable type:numeric ────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## min_delay 1036 19699 20735 2.33 8.54 0 0 0 3 515 ▇▁▁▁▁▁▁▁
For all variables, it tells you how many are missing
, how many are complete
(i.e., not missing), and the total count, n
.
For character variables, it shows the min
and max
length of non-empty strings in that variable, the number of empty strings (i.e., ""
– this is great if you might have some things that should be NA
masquerading as empty strings!), and the number of unique strings.
For numeric variables, it shows the mean, sd, some percentiles, and a tiny in-line histogram(!).
For date variables, it shows the min, max, median, and number of unique dates.
Again, we learn a ton here that we didn’t get from the first few rows of data, or from an initial summary()
:
- There are 6 unique values for
bound
(I would have expected 4; N, S, E, W). - There are 184 different codes but only 132 descriptions (Do some codes share descriptions? Were some codes mistyped?)
- There are 13(!) lines!!
- There are 201 stations (I googled – apparently there’s actually only 75?)
min_delay
goes from 0 (good) to over 500
While I’m pretty surprised by the results, I think this is a much friendlier way to find some of these things out than by trying to do an analysis and getting weird results and then having to track down how things work (and adjust accordingly).
I also think this kind of visual summary has an added bonus: if you don’t have any expectations or mental models, it’s a fast way to get one.
skimr
provides a really good jumping-off point for other things to explore to get to know the data. For example, I want to know about the different values for bound
, and how those vary by line
.
skimr
works really nicely with grouped data frames, showing a summary for each group. It also has much nicer display for factors than for characters (i.e., showing the top factor levels) and can skim by a selected variable only.
So, we can see summary statistics only for the bound
variable, but within each of value of line
. Let’s focus on the “main” lines.
delays %>%
filter(line %in% c("BD", "YU", "SHP", "SRT")) %>%
mutate(bound = as.factor(bound)) %>%
group_by(line) %>%
skim(bound)
## Skim summary statistics
## n obs: 20254
## n variables: 8
## group variables: line
##
## ── Variable type:factor ─────────────────────────────────────────────────────
## line variable missing complete n n_unique
## BD bound 1902 7217 9119 5
## SHP bound 107 405 512 3
## SRT bound 173 544 717 5
## YU bound 2055 7851 9906 6
## top_counts ordered
## W: 3657, E: 3543, NA: 1902, N: 10 FALSE
## W: 219, E: 185, NA: 107, N: 1 FALSE
## S: 271, N: 261, NA: 173, E: 5 FALSE
## S: 4064, N: 3763, NA: 2055, E: 14 FALSE
Some irregularities already pop up - for example, BD is an East/West running line, but there are a total of 10 delays on this line in the direction “North” bound. Similar for YU – it runs North/South, but there are 14 delays running “East” bound. Maybe bound
was just coded incorrectly, but I’d be curious to see what stations these delays are at. For example, if they’re at Bloor/Yonge station, then maybe the line
was coded incorrectly.
These are little inconsistencies that you can totally go down the rabbit-hole on, but are useful to at least know about in case they cause inconsistencies in analysis later on.
So, overall, visual summaries are a really good way to get a feel for your data, verify or build a mental model, and identify areas for further investigation.
Opinionated Strategies
Visual summaries have a pretty important downfall: they rely on you, there, looking at the data. They rely on you knowing what you’re looking for, remembering what you’re looking for (or writing lots of documentation and comments), they rely on interactive coding.
In addition (or instead of) visual summaries, I’d suggest codifying your assumptions. This is where the “opinionated” bit comes in. Codifying your assumptions requires that your assumptions be explicitly spelled out, and ideally, that your code fails spectacularly and loudly if they’re not met.
I think that this is better than the alternative case where they’re not met, but everything still looks “ok”, and bad results can go forward if they don’t raise any red flags.
assertr
package
The assertr
package, created by Tony Fischetti, is designed to help verify assumptions about data early on in a data pipeline. It forces you to explicitly outline any assumptions about your data, and then alerts you of any deviations from those assumptions.
The assert()
function is used to assert assumptions about columns of a data set. By default, it throws an error if the assumptions are not met. If they are met, then the original data frame is returned.
For example, let’s say we want to assert that the column line
has to be one of BD, YU, SHP, SRT.
library(assertr)
delays %>%
assert(in_set("YU", "BD", "SHP", "SRT"), line)
## Column 'line' violates assertion 'in_set("YU", "BD", "SHP", "SRT")' 380 times
## verb redux_fn predicate column index value
## 1 assert NA in_set("YU", "BD", "SHP", "SRT") line 27 YU/BD
## 2 assert NA in_set("YU", "BD", "SHP", "SRT") line 71 YU/BD
## 3 assert NA in_set("YU", "BD", "SHP", "SRT") line 121 YU/BD
## 4 assert NA in_set("YU", "BD", "SHP", "SRT") line 160 YU/BD
## 5 assert NA in_set("YU", "BD", "SHP", "SRT") line 180 YU/BD
## [omitted 375 rows]
## Error: assertr stopped execution
Because this assumption isn’t met, assert
returns an error that the assertion was violated 380 times, and shows the first few lines where this is true (with index
as the row number), as well as the value that violates that assumption (in the examples it gives, the value “YU/BD” is outside of the set we named).
assert
works using functions called predicates. A predicate (apparently a kindergarten-level grammar concept that I’ve forgotten) says something about the properties or actions of a subject. In this case, the predicate in_set("YU", "BD", "SHP", "SRT")
says that the subject, line
, takes on one of the values of YU, BD, SHP, and SRT.
There are some other useful predicates built into assert
. For example, within_bounds()
says that a numeric variable can only take on values in the specified bounds.
If I want to assert the assumption that min_delay
has to be a non-negative number, then I would codify this as
delays %>%
assert(within_bounds(0, Inf), min_delay)
## # A tibble: 20,735 x 8
## date time station code description min_delay bound line
## <date> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 2018-04-01 00:27 ST GEORG… MUSAN Unsanitary Vehic… 8 W BD
## 2 2018-04-01 07:56 FINCH ST… TUSC Operator Overspe… 0 S YU
## 3 2018-04-01 08:00 YONGE UN… MUO Miscellaneous Ot… 0 <NA> YU
## 4 2018-04-01 09:50 KIPLING … TUSC Operator Overspe… 0 W BD
## 5 2018-04-01 10:18 VICTORIA… MUSC Miscellaneous Sp… 0 W BD
## 6 2018-04-01 10:22 KENNEDY … EUNT Equipment - No T… 3 W BD
## 7 2018-04-01 10:30 WILSON S… PUMEL Escalator/Elevat… 0 <NA> YU
## 8 2018-04-01 11:20 WILSON S… TUSC Operator Overspe… 0 S YU
## 9 2018-04-01 11:32 FINCH WE… MUIRS Injured or ill C… 0 <NA> YU
## 10 2018-04-01 12:00 JANE STA… TUSC Operator Overspe… 0 W BD
## # … with 20,725 more rows
And this is actually true, so what we get is the original data set! By default, within_bounds()
allows NA
values, so I should be careful and explicit that my assumption is that min_delay
is a positive number with no missing values. To do that, I can use set allow.na = FALSE
in within_bounds()
.
Or, I can use the not_na()
predicate, which specifically checks that each element is not NA
!
delays %>%
assert(not_na, min_delay)
## Column 'min_delay' violates assertion 'not_na' 1036 times
## verb redux_fn predicate column index value
## 1 assert NA not_na min_delay 34 NA
## 2 assert NA not_na min_delay 57 NA
## 3 assert NA not_na min_delay 75 NA
## 4 assert NA not_na min_delay 80 NA
## 5 assert NA not_na min_delay 154 NA
## [omitted 1031 rows]
## Error: assertr stopped execution
And now this fails, because there are NA
values.
The other built-in predicate is is_uniq()
, which checks whether each element of a variable appears only once (this can be useful for e.g. IDs in a data set). You can also include your own predicate, by writing a function that returns TRUE/FALSE.
The assertr
functions are designed to be piped together (i.e., all of your assumptions in a single pipe), but require some modification. The default behaviour is for the pipe to break after the first error, e.g.
delays %>%
assert(in_set("YU", "BD", "SHP", "SRT"), line) %>%
assert(within_bounds(0, Inf), min_delay) %>%
assert(not_na, min_delay)
## Column 'line' violates assertion 'in_set("YU", "BD", "SHP", "SRT")' 380 times
## verb redux_fn predicate column index value
## 1 assert NA in_set("YU", "BD", "SHP", "SRT") line 27 YU/BD
## 2 assert NA in_set("YU", "BD", "SHP", "SRT") line 71 YU/BD
## 3 assert NA in_set("YU", "BD", "SHP", "SRT") line 121 YU/BD
## 4 assert NA in_set("YU", "BD", "SHP", "SRT") line 160 YU/BD
## 5 assert NA in_set("YU", "BD", "SHP", "SRT") line 180 YU/BD
## [omitted 375 rows]
## Error: assertr stopped execution
only tells us that line
violated the in_set()
assumption, and nothing about min_delay
– even though we know the not_na
assumption isn’t met.
To remedy this, I use chain_start()
at the beginning of the chain of assumptions, and chain_end()
at the end. This overwrites the default behaviour of assert
(to stop after the first failure) and instead shows information about all errors. I’m using the error_stop
argument here too, otherwise it literally prints all of the errors and that makes this post very, very long.
delays %>%
chain_start() %>%
assert(in_set("YU", "BD", "SHP", "SRT"), line) %>%
assert(within_bounds(0, Inf), min_delay) %>%
assert(not_na, min_delay) %>%
chain_end(error_fun = error_stop)
## Column 'line' violates assertion 'in_set("YU", "BD", "SHP", "SRT")' 380 times
## verb redux_fn predicate column index value
## 1 assert NA in_set("YU", "BD", "SHP", "SRT") line 27 YU/BD
## 2 assert NA in_set("YU", "BD", "SHP", "SRT") line 71 YU/BD
## 3 assert NA in_set("YU", "BD", "SHP", "SRT") line 121 YU/BD
## 4 assert NA in_set("YU", "BD", "SHP", "SRT") line 160 YU/BD
## 5 assert NA in_set("YU", "BD", "SHP", "SRT") line 180 YU/BD
## [omitted 375 rows]
##
##
## Column 'min_delay' violates assertion 'not_na' 1036 times
## verb redux_fn predicate column index value
## 1 assert NA not_na min_delay 34 NA
## 2 assert NA not_na min_delay 57 NA
## 3 assert NA not_na min_delay 75 NA
## 4 assert NA not_na min_delay 80 NA
## 5 assert NA not_na min_delay 154 NA
## [omitted 1031 rows]
## Error: assertr stopped execution
There are definitely more functions in assertr
– assert
only allows you to make assumptions about columns. It may be useful to make assertions about the overall data structure (using verify()
), about values of a variable in relation to its mean (using insist()
), or e.g. about attributes of a row overall (using assert_rows()
).
In the end, though, a huge plus of assertr
functions is that when all of the assumptions are satisfied, the result is just the original data set. This allows you to check assumptions and do calculations all in one go.
Let’s say we have a cleaned version of delays
, where all NA
values of min_delay
have been dealt with and any “non-stations” are removed.
Because all of the assumptions are met, the result from the chain of assumptions is just the original data set, and we can aggregate and plot as we wish!
delays_clean %>%
chain_start() %>%
assert(in_set("YU", "BD", "SHP", "SRT"), line) %>%
assert(within_bounds(0, Inf), min_delay) %>%
assert(not_na, min_delay) %>%
chain_end() %>%
group_by(line, description) %>%
summarise(delays = sum(min_delay)) %>%
arrange(-delays) %>%
slice(1:5) %>%
ggplot(aes(x = description,
y = delays)) +
geom_col() +
facet_wrap(vars(line),
scales = "fixed",
nrow = 1) +
coord_flip()
So now, for real, we see the top 5 causes for delays on the TTC subway and SRT in 2018 – “Force Majeure” is less common, and we see some rough, but true causes of TTC delays – disorderly patron, ill customers, the ATC project, etc.
Interestingly enough, “Unauthorized at Track Level” still doesn’t crack the top 5 🤔
The end!
I hope I’ve managed to instill some good approaches around working with a new data set, specifically:
- Be wary of diving right into results
- Efficiently visualize your data to verify or build mental models
- Be liberal with rabbit holes, investigations, and edge cases
- Codify your assumptions so that it is them, not you, who fail spectacularly
Bye bye!