Tidying the TTC
Hello and welcome to another episode of tidy that data!
I enjoyed my last data tidying post so much and I’ve tweeted a couple of times about the immense satisfaction I get from tidying data sets – so this is long overdue.
I’m returning to the City of Toronto Open Data catalogue (FYI – they just launched their new open data portal, which includes a data blog, better catalogue navigation, a developer API, in-portal data exploration, and more! congratulations to the team 🎉), this time looking at Toronto Transit Commission (TTC), aka public transit, data.
This specific data set is called “TTC Ridership Analysis”. As the source says, the “TTC Ridership Analysis data set measures the first point of payment when boarding at the start of a journey using the TTC. Data includes ticket types as well as the type of vehicle.” Since it is an analysis data set, some aggregation and analysis has already been done to get it in its current format.
Unfortunately, this is not the format I want 😈
This data set actually contains three-in-one – who, where, and when.
Within the “who” data set, there are main headings (adult, senior/student, children) and subheadings (tokens, tickets, monthly pass, etc) as well as a additional section that doesn’t contain a main heading at all (day/vist./other, blind/war amps, gta pass, etc). These headings are all in the same column. There are sub-totals for each “who” section and a “system total” at the end. The “where” and “when” sections follow a similar pattern, with main and sub-headings and totals.
There is also some preamble and “postamble” discussing the data set. Because of the preamble, I expect that we won’t get nice column headings when importing the data set.
That’s okay, this is what R was made for 💪
(this too)
Let’s load the data in.
library(readxl)
library(knitr)
ttc <- read_excel(here::here("content", "post",
"2018-05-15-tidying-the-ttc",
"1985-2017 Analysis of ridership.xlsx"))
TORONTO TRANSIT COMMISSION | …2 | …3 | …4 | …5 | …6 | …7 | …8 | …9 | …10 |
---|---|---|---|---|---|---|---|---|---|
ANALYSIS OF RIDERSHIP | NA | NA | NA | NA | NA | NA | NA | NA | NA |
1985 TO 2017 ACTUALS (000’S) | NA | NA | NA | NA | NA | NA | NA | NA | NA |
NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
NA | FARE MEDIA | 2017 | 2016 | 2015 * | 2014 | 2013 | 2012 | 2011 | 2010 |
WHO | ADULT | NA | NA | NA | NA | NA | NA | NA | NA |
NA | TOKENS | 76106 | 102073 | 110945 | 111157 | 112360 | 117962 | 124748 | 120366 |
NA | TICKETS | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 1298 |
NA | TWO-FARE | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
NA | PRESTO - SINGLE RIDE | 67829 | 27397 | 13323 | 9862 | 8194 | 4399 | 1139 | 0 |
NA | PRESTO - SRVM TOKEN RIDE | 1271 | 1157 | N/A | N/A | N/A | N/A | N/A | N/A |
As expected, the first few rows are the preamble, and the first column gives the who/where/when specification. The second column of data includes both the main headings (adult) as well as sub headings (tokens, tickets, etc). All of the columns are characters, and the missing values (“N/A”) read in as strings instead of missing values.
We’ll get started by replacing those “N/A” strings with literal NA (missing values), and converting everything to lowercase so it doesn’t feel like the data set is yelling at us 📣. For the NA replacement, I’m using Nicholas Tierney’s naniar
package, designed to deal with missing data. This vignette specifically is a great walk through of how to replace values with NA.
library(dplyr)
library(naniar)
ttc <- ttc %>%
replace_with_na_all(condition = ~.x == "N/A") %>%
mutate_all(.funs = tolower)
TORONTO TRANSIT COMMISSION | …2 | …3 | …4 | …5 | …6 | …7 | …8 | …9 | …10 |
---|---|---|---|---|---|---|---|---|---|
analysis of ridership | NA | NA | NA | NA | NA | NA | NA | NA | NA |
1985 to 2017 actuals (000’s) | NA | NA | NA | NA | NA | NA | NA | NA | NA |
NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
NA | fare media | 2017 | 2016 | 2015 * | 2014 | 2013 | 2012 | 2011 | 2010 |
who | adult | NA | NA | NA | NA | NA | NA | NA | NA |
NA | tokens | 76106 | 102073 | 110945 | 111157 | 112360 | 117962 | 124748 | 120366 |
NA | tickets | NA | NA | NA | NA | NA | NA | NA | 1298 |
NA | two-fare | NA | NA | NA | NA | NA | NA | NA | NA |
NA | presto - single ride | 67829 | 27397 | 13323 | 9862 | 8194 | 4399 | 1139 | 0 |
NA | presto - srvm token ride | 1271 | 1157 | NA | NA | NA | NA | NA | NA |
Next, I’m going to remove those first few rows of description and any row that describes a total (it’s just aggregating the other rows).
I’m also cleaning up the column names using Sam Firke’s janitor
package, since it feels like they’re yelling too.
library(stringr)
library(janitor)
ttc <- ttc[-c(1:3),] %>%
clean_names() %>%
filter(!str_detect(x2, "total")) %>%
mutate(id = row_number())
id | toronto_transit_commission | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 |
---|---|---|---|---|---|---|---|---|---|
1 | NA | fare media | 2017 | 2016 | 2015 * | 2014 | 2013 | 2012 | 2011 |
2 | who | adult | NA | NA | NA | NA | NA | NA | NA |
3 | NA | tokens | 76106 | 102073 | 110945 | 111157 | 112360 | 117962 | 124748 |
4 | NA | tickets | NA | NA | NA | NA | NA | NA | NA |
5 | NA | two-fare | NA | NA | NA | NA | NA | NA | NA |
6 | NA | presto - single ride | 67829 | 27397 | 13323 | 9862 | 8194 | 4399 | 1139 |
At this point, I’ve also added in an id
column that’s just the row number. I’ll use this to figure out how to cutoff and split the various who/where/when data sets. To do this, I’m getting the id
(i.e., row) where the “who”, “where”, and “when” sections start by seeing, for example, where toronto_transit_commission == "who"
.
who_start <- ttc %>%
filter(toronto_transit_commission == "who") %>%
pull(id)
where_start <- ttc %>%
filter(toronto_transit_commission == "where") %>%
pull(id)
when_start <- ttc %>%
filter(toronto_transit_commission == "when") %>%
pull(id)
From this, we can see that row 2 is where the “who” data set begins, and rows 34 and 41 for the “where” and “when” data sets, respectively.
ttc_who <- ttc[who_start:(where_start - 1), -1]
x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 |
---|---|---|---|---|---|---|---|---|---|
adult | NA | NA | NA | NA | NA | NA | NA | NA | NA |
tokens | 76106 | 102073 | 110945 | 111157 | 112360 | 117962 | 124748 | 120366 | 114686 |
tickets | NA | NA | NA | NA | NA | NA | NA | 1298 | 8807 |
two-fare | NA | NA | NA | NA | NA | NA | NA | NA | NA |
presto - single ride | 67829 | 27397 | 13323 | 9862 | 8194 | 4399 | 1139 | 0 | 0 |
presto - srvm token ride | 1271 | 1157 | NA | NA | NA | NA | NA | NA | NA |
Since this contains a few different sections (adult, senior/student, etc), I’m repeating a similar process to understand where each of those sections starts. If you know of a more clever way to do this, let me know!
ttc_who <- ttc_who %>%
mutate(id = row_number())
adult_start <- ttc_who %>%
filter(x2 == "adult") %>%
pull(id)
senior_student_start <- ttc_who %>%
filter(x2 == "senior/student") %>%
pull(id)
children_start <- ttc_who %>%
filter(x2 == "children") %>%
pull(id)
remaining_start <- ttc_who %>%
filter(x2 == "day/vist./other") %>%
pull(id)
I’ve recomputed the id
field so that it uses the new row numbers from ttc_who
, and not from the original ttc
sedate. Again, we can see that the adult, senior/student, children, and remaining sections start at rows 1, 14, 23, and 28. Now let’s split them into individual data sets, adding a “who” column that describes the main heading so that data’s not lost.
adult <- ttc_who[(adult_start + 1):(senior_student_start - 1), ] %>%
mutate(who = "adult")
senior_student <- ttc_who[(senior_student_start + 1):(children_start - 1), ] %>%
mutate(who = "senior/student")
children <- ttc_who[(children_start + 1):(remaining_start - 1), ] %>%
mutate(who = "children")
who | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
---|---|---|---|---|---|---|---|---|---|
adult | tokens | 76106 | 102073 | 110945 | 111157 | 112360 | 117962 | 124748 | 120366 |
adult | tickets | NA | NA | NA | NA | NA | NA | NA | 1298 |
adult | two-fare | NA | NA | NA | NA | NA | NA | NA | NA |
adult | presto - single ride | 67829 | 27397 | 13323 | 9862 | 8194 | 4399 | 1139 | 0 |
adult | presto - srvm token ride | 1271 | 1157 | NA | NA | NA | NA | NA | NA |
adult | presto - srvm cash ride | 821 | 582 | NA | NA | NA | NA | NA | NA |
Starting to look a lot better! The “remaining” section is a bit of an anomaly in that its first column actually describes the “who” and not the fare type, so it needs some additional massaging. I’m doing this weird renaming and rearranging so that we can easily stick this together with the other sections.
remaining <- ttc_who[-c(1:(remaining_start -1)), ] %>%
mutate(who = x2,
x2 = NA)
who | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
---|---|---|---|---|---|---|---|---|---|
day/vist./other | NA | 6728 | 9130 | 8561 | 10033 | 11428 | 11929 | 10642 | 10605 |
blind/war amps | NA | 1086 | 1088 | 1086 | 1119 | 1109 | 1086 | 1060 | 1073 |
premium express | NA | 448 | 474 | 490 | 451 | 401 | 372 | 344 | 322 |
postal carriers | NA | NA | NA | NA | NA | NA | NA | NA | NA |
gta pass | NA | 4283 | 4855 | 5471 | 6087 | 5784 | 5388 | 5642 | 5667 |
Now, we can put the “who” back together and start rearranging, first by giving appropriate names to the columns.
the_who <- adult %>%
bind_rows(senior_student) %>%
bind_rows(children) %>%
bind_rows(remaining) %>%
select(who, everything(), -id)
colnames(the_who) <- c("who", "fare_type", 2017:1985)
who | fare_type | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 |
---|---|---|---|---|---|---|---|---|---|
adult | tokens | 76106 | 102073 | 110945 | 111157 | 112360 | 117962 | 124748 | 120366 |
adult | tickets | NA | NA | NA | NA | NA | NA | NA | 1298 |
adult | two-fare | NA | NA | NA | NA | NA | NA | NA | NA |
adult | presto - single ride | 67829 | 27397 | 13323 | 9862 | 8194 | 4399 | 1139 | 0 |
adult | presto - srvm token ride | 1271 | 1157 | NA | NA | NA | NA | NA | NA |
adult | presto - srvm cash ride | 821 | 582 | NA | NA | NA | NA | NA | NA |
We’re getting close 🙏. This data is in a good place, but it’s in a wide format – each year’s data is in a different column. This will make it difficult to, for example, plot rides paid with tokens by year. The last step is to convert the data to a long format, so that we have one row for every observation (i.e., one row for every who/fare_type/year combination).
library(tidyr)
the_who <- the_who %>%
gather(key = year, value = n, -who, -fare_type) %>%
mutate(year = as.numeric(year),
n = as.numeric(n)) %>%
select(year, who, fare_type, n)
year | who | fare_type | n |
---|---|---|---|
2017 | adult | tokens | 76106 |
2017 | adult | tickets | NA |
2017 | adult | two-fare | NA |
2017 | adult | presto - single ride | 67829 |
2017 | adult | presto - srvm token ride | 1271 |
2017 | adult | presto - srvm cash ride | 821 |
Now we’re done with the “who”! The “where” and “when” follow very similar patterns, so I’ll go through them quickly.
ttc_where <- ttc[(where_start + 1):(when_start - 1), -1] %>%
filter(x2 != "rail") %>%
mutate(where = if_else(x2 == "bus", "bus", "rail")) %>%
select(where, x2, everything())
colnames(ttc_where) <- c("where", "method", 2017:1985)
ttc_where <- ttc_where %>%
gather(key = year, value = n, -where, -method) %>%
mutate(year = as.numeric(year),
n = as.numeric(n)) %>%
select(year, where, method, n)
year | where | method | n |
---|---|---|---|
2017 | bus | bus | 261113 |
2017 | rail | subway | 213012 |
2017 | rail | s.r.t. | 3177 |
2017 | rail | trolley coach | 0 |
2017 | rail | streetcar | 55914 |
2016 | bus | bus | 252899 |
ttc_when <- ttc[when_start:nrow(ttc), -1] %>%
select(-id)
colnames(ttc_when) = c("when", 2017:1985)
ttc_when <- ttc_when %>%
gather(key = year, value = n, -when) %>%
mutate(year = as.numeric(year),
n = as.numeric(n)) %>%
select(year, when, n)
year | when | n |
---|---|---|
2017 | weekday | 424155 |
2017 | weekend/holiday | 109061 |
2016 | weekday | 424117 |
2016 | weekend/holiday | 113962 |
2015 | weekday | 423808 |
2015 | weekend/holiday | 110197 |
Now that we have all three datasets in a tidy, analyzable format, we can explore.
Let’s start by looking at TTC trips over time, just for adults.
library(plotly)
the_who %>%
filter(who == "adult") %>%
group_by(year) %>%
summarise(trips = sum(n/1000, na.rm = TRUE)) %>%
plot_ly(x = ~year, y = ~trips) %>%
add_lines() %>%
layout(title = "Annual adult TTC trips",
yaxis = list(title = "TTC trips (millions)",
rangemode = "tozero"))
We can see immediately that there is a huge drop in adult TTC trips in 1990, dropping from a high of 367 million annual trips to 280 million in 1995. It wasn’t until 2007 that the adult trips reached what they’d been in 1990. I’m not from Toronto, so I did some digging.
I found the TTC’s 1991 annual report which explained that the decline was due to the recession at the time. This may seem obvious if you were around and cognizant then, but I was, ahem, born during this decline. Former premier Mike Harris was also elected in ’95 and cut provincial funding to the TTC. The increase in ridership during the late 90s seems mostly due to the controversial formation of the “Megacity”, the amalgamation of six former municipalities into one “City of Toronto”.
We can also look at what percentage of TTC trips start on the bus, subway, etc. The dataset measures the first point of payment, so we don’t actually see, for example, how many trips were on the bus since you can make a direct transfer a streetcar or subway to the bus.
ttc_where %>%
group_by(year) %>%
mutate(prop = n/sum(n)) %>%
ungroup() %>%
plot_ly(x = ~year, y = ~prop) %>%
add_lines(color = ~method) %>%
layout(title = "Percent of trips started by transit type",
yaxis = list(title = "", tickformat = ".1%"),
legend = list(orientation = "h"))
The S.R.T is an extension of the subway line in the east (in Scarborough) – it is a light metro system, and not counted in with the rest of the subway. This line is quite short (6 instead of the 69 stations on the main subway) and contains the least used station, which has no direct bus connection.
I’d be especially interested to see how this data changes for 2018 – in late 2017 we saw the subway extension of Line 1 into Vaughan, with 6 new stations including a stop at York Univesity, which is expected to carry 27,000 passengers a day by 2020 – previously, the primary route to the university was an express bus. The city is also experimenting with the King Street Pilot, intended to make travelling by streetcar less painful (i.e., less stuck in traffic with cars).
There’s a lot more exploration that can be done with this data; this is not intended to be exhaustive, or to shed new light on the operations of the TTC or the history of Toronto. I did learn a fair bit about both in the process (including that the TTC had trolley coaches from 1922 until 1993!) but this is perhaps an indication that I have a lot to learn about my new city and the transit that serves it!
This is also not intended to roast the way this data set has been put together by the TTC – the original data set is in a tabular format that is verbose, quite detailed, and intended for human reading. It is a great overview and analysis of the TTC over the last 32 years. My goal is to demonstrate how we can take that tabular data, and though it looks daunting, transform it into a machine readable format to allow for additional exploration.
👋