opendatatoronto 0.1.0 is on CRAN!
I’m beyond excited to announce that opendatatoronto
is now released on CRAN! opendatatoronto
is a package for searching and accessing data from the City of Toronto’s Open Data portal.
Toronto’s Open Data team is all about increasing access to civic data, both by releasing more and more data and enabling people to easily retrieve data. They’ve partnered with folks to make plugins for google sheets, QGIS, and I worked with them to create opendatatoronto
. Happy to say that this team is putting their money where their mouth is and investing 🤑 into open source.
The main goal of the package is to make it easy to get data. It can be a pain to find a data set on the portal, manually download it, move it to your analysis folder, figure out what R package to use to read it in (or how to read in multiple files efficiently!), etc. opendatatoronto
enables you to skip a lot of those steps and read the data directly into R.
There’s a package website with a lot of details, but I’ll demonstrate some usage here, too!
Package usage
First, you can install the package directly from CRAN (!!!):
install.packages("opendatatoronto")
Let’s say I’m interested in the Apartment Buildings Registration data set (yes, I’m giving the TTC data a break).
I can take the URL directly and show information about this data set… which, by the way, is something called a “package”. I know 😋. We can see information about the package, including topics it covers, an excerpt of information, how many resources (the actual data sets!) there are (along with their formats), how often it is refreshed, etc.
library(opendatatoronto)
library(dplyr)
apartment_building_package <- show_package("https://open.toronto.ca/dataset/apartment-building-registration/")
apartment_building_package %>%
glimpse()
## Observations: 1
## Variables: 10
## $ title <chr> "Apartment Building Registration"
## $ id <chr> "2b98b3f3-4f3a-42a4-a4e9-b44d3026595a"
## $ topics <chr> "Business,City government,Development and infra…
## $ civic_issues <chr> NA
## $ excerpt <chr> "This dataset contains building information for…
## $ dataset_category <chr> "Table"
## $ num_resources <int> 1
## $ formats <chr> "CSV,JSON,XML"
## $ refresh_rate <chr> "Monthly"
## $ last_refreshed <date> 2019-10-07
To show the resources available in this package, we can take the package ID (or keep using the URL, but to show different usage!) and list the resources:
apartment_building_package_id <- apartment_building_package[["id"]]
apartment_building_package_resources <- list_package_resources(apartment_building_package_id)
apartment_building_package_resources
## # A tibble: 1 x 4
## name id format last_modified
## <chr> <chr> <chr> <date>
## 1 Apartment Building Regist… 3ad76a8c-0518-4df2-b94e-… CSV NA
There’s only one resource available, and we can read it directly into R using get_resource
:
apartment_buildings <- apartment_building_package_resources %>%
get_resource()
apartment_buildings %>%
glimpse()
## Observations: 3,450
## Variables: 70
## $ `_id` <int> 1, 2, 3, 4, 5, 6, 14, 7, 8…
## $ AIR_CONDITIONING_TYPE <chr> "NONE", "NONE", "NONE", "N…
## $ AMENITIES_AVAILABLE <chr> "Indoor recreation room , …
## $ BALCONIES <chr> "YES", "YES", "YES", "YES"…
## $ BARRIER_FREE_ACCESSIBILTY_ENTR <chr> "YES", "YES", "YES", "YES"…
## $ BIKE_PARKING <chr> "Not Available", "Not Avai…
## $ EXTERIOR_FIRE_ESCAPE <chr> "NO", "NO", "NO", "NO", "N…
## $ `FACILITIES_AVAILABLE?` <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ FIRE_ALARM <chr> "YES", "YES", "YES", "YES"…
## $ GARBAGE_CHUTES <chr> "YES", "YES", "YES", "YES"…
## $ HEATING_TYPE <chr> "HOT WATER", "HOT WATER", …
## $ INTERCOM <chr> "YES", "YES", "YES", "YES"…
## $ `IS_THERE_A_COOLING_ROOM?` <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ `IS_THERE_EMERGENCY_POWER?` <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ LAUNDRY_ROOM <chr> "YES", "YES", "YES", "YES"…
## $ LOCKER_OR_STORAGE_ROOM <chr> "YES", "YES", "YES", "YES"…
## $ NO_BARRIERFREE_ACCESSBLE_UNITS <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ NO_OF_ACCESSIBLEPARKING_SPACES <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ NO_OF_ELEVATORS <chr> "2", "2", "4", "2", "0", "…
## $ NO_OF_STOREYS <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ NO_OF_UNITS <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ `NON-SMOKING_BUILDING` <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ PARKING_TYPE <chr> "Underground Garage , Gara…
## $ PETS_ALLOWED <chr> "YES", "YES", "YES", "YES"…
## $ PROP_MANAGEMENT_COMPANY_NAME <chr> NA, NA, "GREENWIN", "TCH",…
## $ PROPERTY_TYPE <chr> "TCHC", "TCHC", "TCHC", "T…
## $ RSN <int> 4155584, 4155588, 4155594,…
## $ SEPARATE_GAS_METERS_EACH_UNIT <chr> "NO", "NO", "NO", "NO", "N…
## $ SEPARATE_HYDRO_METER_EACH_UNIT <chr> "NO", "NO", "NO", "NO", "Y…
## $ SEPARATE_WATER_METERS_EA_UNIT <chr> "NO", "NO", "NO", "NO", "N…
## $ SITE_ADDRESS <chr> "365 BAY MILLS BLVD ", "8…
## $ SPRINKLER_SYSTEM <chr> "YES", "YES", "YES", "YES"…
## $ VISITOR_PARKING <chr> "BOTH", "BOTH", "BOTH", "P…
## $ WARD <chr> "22", "04", "11", "12", "1…
## $ WINDOW_TYPE <chr> "SINGLE PANE", "DOUBLE PAN…
## $ YEAR_BUILT <chr> "1970", "1974", "1968", "1…
## $ YEAR_REGISTERED <chr> "2017", "2017", "2017", "2…
## $ HEATING_EQUIPMENT_STATUS <chr> "ORIGINAL", "ORIGINAL", "O…
## $ DATE_OF_LAST_INSPECTION_BY_TSSA <chr> "DEC 5, 2016", "AUG 18, 20…
## $ NON_SMOKING_BUILDING <chr> "YES", "YES", "YES", "YES"…
## $ FACILITIES_AVAILABLE <chr> "Green Bin / Organics", "G…
## $ PCODE <chr> "M1T", "M6K", "M5R", "M4S"…
## $ SPRINKLER_SYSTEM_TEST_RECORD <chr> "YES", "YES", "YES", "YES"…
## $ IS_THERE_A_COOLING_ROOM <chr> "NO", "NO", "NO", "NO", "N…
## $ ELEVATOR_STATUS <chr> "ORIGINAL", "REPLACED", "R…
## $ ELEVATOR_PARTS_REPLACED <chr> NA, NA, NA, "FULL REPLACEM…
## $ DESCRIPTION_OF_INDOOR_EXERCISE_ROOM <chr> NA, NA, NA, NA, NA, NA, NA…
## $ SPRINKLER_SYSTEM_YEAR_INSTALLED <int> 1970, 1969, 1968, 1977, NA…
## $ CONFIRMED_UNITS <int> 184, 57, 460, 156, 32, 175…
## $ YEAR_OF_REPLACEMENT <chr> NA, "2003", "2017", "2009"…
## $ OUTDOOR_GARBAGE_STORAGE_AREA <chr> "YES", "YES", "YES", "YES"…
## $ NO_OF_LAUNDRY_ROOM_MACHINES <int> 16, 16, 28, 13, 8, 18, 2, …
## $ ANNUAL_FIRE_PUMP_FLOW_TEST_RECORDS <chr> "YES", "YES", "YES", "YES"…
## $ IS_THERE_EMERGENCY_POWER <chr> "YES", "YES", "YES", "YES"…
## $ ANNUAL_FIRE_ALARM_TEST_RECORDS <chr> "YES", "YES", "YES", "YES"…
## $ HEATING_EQUIPMENT_YEAR_INSTALLED <int> 1970, 1974, 1968, 1977, 20…
## $ APPROVED_FIRE_SAFETY_PLAN <chr> "YES", "YES", "YES", "YES"…
## $ CONFIRMED_STOREYS <int> 13, 7, 25, 16, 4, 14, 4, 7…
## $ LAUNDRY_ROOM_LOCATION <chr> "TBD", "TBD", "TBD", "MAIN…
## $ LAUNDRY_ROOM_HOURS_OF_OPERATION <chr> "8:00AM TO 8:00PM", "8:00A…
## $ EMERG_POWER_SUPPLY_TEST_RECORDS <chr> "YES", "YES", "YES", "YES"…
## $ TSSA_TEST_RECORDS <chr> "YES", "YES", "YES", "YES"…
## $ DESCRIPTION_OF_OUTDOOR_REC_FACILITIES <chr> NA, NA, NA, NA, NA, NA, NA…
## $ GREEN_BIN_LOCATION <chr> "REAR OF BUILDING", "REAR …
## $ PET_RESTRICTIONS <chr> NA, NA, NA, NA, NA, NA, NA…
## $ DESCRIPTION_OF_CHILD_PLAY_AREA <chr> NA, NA, NA, NA, NA, NA, NA…
## $ INDOOR_GARBAGE_STORAGE_AREA <chr> "YES", "YES", "YES", "YES"…
## $ RECYCLING_BINS_LOCATION <chr> "REAR OF BUILDING", "REAR …
## $ NO_OF_ACCESSIBLE_PARKING_SPACES <int> 4, 4, 4, 4, 0, 2, 0, 2, 2,…
## $ NO_BARRIER_FREE_ACCESSBLE_UNITS <int> 1, 1, 1, 4, 0, 0, 0, 0, 0,…
Let’s say I want to see what amenities Toronto apartments have. I know, for instance, that my apartment has a big ol’ NA
for amenities in this data set (and for a lot of the other characteristics 😬).
First, let’s split up the amenities_available
variable, which is in the format “amenity_1 , amenity_2 , …”, i.e., with amenities separated by " , ". I’m also counting the number of buildings and carrying it along for the ride so that I can calculate a percentage later on:
library(janitor)
library(tidyr)
amenities <- apartment_buildings %>%
clean_names() %>%
select(id, amenities_available) %>%
mutate(n_buildings = n_distinct(id)) %>%
separate_rows(amenities_available, sep = " , ")
amenities
## # A tibble: 4,202 x 3
## id amenities_available n_buildings
## <int> <chr> <int>
## 1 1 Indoor recreation room 3450
## 2 1 Outdoor rec facilities 3450
## 3 1 Child play area 3450
## 4 2 Indoor recreation room 3450
## 5 3 Indoor recreation room 3450
## 6 4 <NA> 3450
## 7 5 <NA> 3450
## 8 6 <NA> 3450
## 9 14 <NA> 3450
## 10 7 Indoor exercise room 3450
## # … with 4,192 more rows
Next, I count how many times each amenity appears (i.e., how many buildings have that amenity), replace NA
s with “None” (for more clarity + no disappearance in plotting down the road), and calculate what percent of buildings have each amenity (or what percent have none 😅):
amenities <- amenities %>%
count(amenities_available, n_buildings) %>%
mutate(
amenities_available = ifelse(is.na(amenities_available), "None", amenities_available),
prop = n / n_buildings
)
amenities
## # A tibble: 8 x 4
## amenities_available n_buildings n prop
## <chr> <int> <int> <dbl>
## 1 Child play area 3450 260 0.0754
## 2 Indoor exercise room 3450 240 0.0696
## 3 Indoor pool 3450 109 0.0316
## 4 Indoor recreation room 3450 522 0.151
## 5 Outdoor pool 3450 200 0.0580
## 6 Outdoor rec facilities 3450 230 0.0667
## 7 Sauna 3450 126 0.0365
## 8 None 3450 2515 0.729
And finally, I visualize the results:
library(ggplot2)
amenities %>%
ggplot(aes(x = reorder(amenities_available, -prop), y = prop)) +
geom_col() +
scale_y_continuous(labels = scales::percent) +
labs(
x = "",
y = "",
title = "Amenities Available in Toronto Apartment Buildings",
subtitle = "Percent of rental units with three or more storeys and ten or more units",
caption = "Source: Apartment Building Registration (Toronto Open Data)"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title.position = "plot"
)
Seems I’m not alone having no amenities!
There’s a lot of data available in the portal, and I’d encourage you to check it out if you’re interested! We don’t have squirrels, and maybe we’ll end up with some raccoons, but there’s a lot on there. If you encounter any issues using the package, please let me know.
On creating my first package/sentimental biznass
This is my first “real” R package and I learned a ton in the process. I’ve been talking (complaining?) a lot about things like travis and test coverage over the last few months. I’m not going to give any of these topics a full treatment right now, but here are some resources:
- usethis workflow for package development
- A Beginner’s guide to Travis-CI for R
- R packages book
- R Package Primer
I will say that packages like usethis
, devtools
, and testthat
will be your best friends in this process. There’s been a lot of hard work put into those packages to make creating your own package as smooth and seamless as possible, from inception all the way to CRAN submission. I mentioned in my post on the RStudio Conference Diversity Scholarship, but taking part in the Tidy Tools workshop there was also really instrumental in getting me to think about how to write a package, from creating a clear, consistent API to obsessively writing tests.
I’d be remiss not to mention that the heavy lifting in opendatatoronto
is done by the ckanr
package, created by Scott Chamberlain. A very cool part of creating opendatatoronto
was that I could identify things I wanted to change or update in ckanr
and I was able to do it and contribute back to that package, too! Thanks to Scott for having me play a big part in the ckanr
0.4.0 release. Now I’m an author on two CRAN packages 😭
Thanks to everyone who helped me along the way, whether it be through sharing their own package code on GitHub so I could shamelessly borrow (copy) their approaches or by answering questions I asked directly. This is my first go, but it won’t be my last 💪