comparing two data frames: one #rstats, many ways!
today i tweeted asking about packages to compare two data frames (specifically, their variable names and types):
anyone aware of an #rstats package that will compare two data frames' names/types and output nice, descriptive errors if they don't match? do i have to build this myself?
— Sharla Gelfand (@sharlagelfand) February 17, 2020
judging by the ratio (18 replies to 38 likes!) at the time of writing, yes there are definitely packages to do this! in this post, i want to explore the (many) different and excellent methods for comparing two data frames in #rstats.
the data
first, what are we talking about? imagine i have some data coming in, and i know what it should look like. it should look like iris
! 🌹
iris
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## # … with 145 more rows
so, iris
has 5 columns: Sepal.Length, Sepal.Width, Petal.Length, and Species. the column types are: numeric/double (times 4!) and factor.
the data i’m bringing in could have a number of problems.
it could be missing a column:
library(dplyr)
iris_missing_cols <- iris %>%
select(-Sepal.Length)
iris_missing_cols
## # A tibble: 150 x 4
## Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <fct>
## 1 3.5 1.4 0.2 setosa
## 2 3 1.4 0.2 setosa
## 3 3.2 1.3 0.2 setosa
## 4 3.1 1.5 0.2 setosa
## 5 3.6 1.4 0.2 setosa
## # … with 145 more rows
it could have an extra column:
iris_extra_cols <- iris %>%
mutate(Specieso = Species)
iris_extra_cols
## # A tibble: 150 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Specieso
## <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 5.1 3.5 1.4 0.2 setosa setosa
## 2 4.9 3 1.4 0.2 setosa setosa
## 3 4.7 3.2 1.3 0.2 setosa setosa
## 4 4.6 3.1 1.5 0.2 setosa setosa
## 5 5 3.6 1.4 0.2 setosa setosa
## # … with 145 more rows
it could be missing a column and have an extra column:
iris_missing_extra_cols <- iris %>%
select(-Sepal.Length) %>%
mutate(Specieso = Species)
iris_missing_extra_cols
## # A tibble: 150 x 5
## Sepal.Width Petal.Length Petal.Width Species Specieso
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 3.5 1.4 0.2 setosa setosa
## 2 3 1.4 0.2 setosa setosa
## 3 3.2 1.3 0.2 setosa setosa
## 4 3.1 1.5 0.2 setosa setosa
## 5 3.6 1.4 0.2 setosa setosa
## # … with 145 more rows
it could have a column of the wrong class:
iris_wrong_class <- iris %>%
mutate(Species = as.character(Species))
iris_wrong_class
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## # … with 145 more rows
or it could have all three!
iris_missing_extra_cols_wrong_class <- iris_missing_extra_cols %>%
mutate(Species = as.character(Species))
iris_missing_extra_cols_wrong_class
## # A tibble: 150 x 5
## Sepal.Width Petal.Length Petal.Width Species Specieso
## <dbl> <dbl> <dbl> <chr> <fct>
## 1 3.5 1.4 0.2 setosa setosa
## 2 3 1.4 0.2 setosa setosa
## 3 3.2 1.3 0.2 setosa setosa
## 4 3.1 1.5 0.2 setosa setosa
## 5 3.6 1.4 0.2 setosa setosa
## # … with 145 more rows
one last thing is not a problem, but i’m curious to see if the packages i explore will think it is. this is when the column order is different:
iris_diff_order <- iris %>%
select(Species, everything())
iris_diff_order
## # A tibble: 150 x 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.1 3.5 1.4 0.2
## 2 setosa 4.9 3 1.4 0.2
## 3 setosa 4.7 3.2 1.3 0.2
## 4 setosa 4.6 3.1 1.5 0.2
## 5 setosa 5 3.6 1.4 0.2
## # … with 145 more rows
what i dream of ☁️
in the tweet i mentioned i’m looking for packages that will provide me with nice, descriptive errors (or robust data to create my own) describing what’s different between the data sets.
what does this look like? two main things.
saying what columns are missing (and what’s optionally, what’s extra).
for
iris
vsiris_missing_cols
, i want to know thatiris_missing_cols
is missing a column called Sepal.Length.for
iris
vsiris_extra_cols
, i want to know thatiris_extra_cols
has an extra column called Specieso.for
iris
vsiris_missing_extra_cols
, i want to know thatiris_missing_extra_cols
is missing Sepal.Length and has an extra column Specieso.saying what classes are different.
for
iris
vsiris_wrong_class
, i want to know that Species should be a factor, but it’s a character.
ok, putting them together, three things.
saying what columns are missing, what’s extra, and what classes are different.
for
iris
vsiris_missing_extra_cols_wrong_class
, i want to know thatiris_missing_extra_cols_wrong_class
is missing Sepal.Length, has an extra column Specieso, and that Species should be a factor but instead is a character.
finally, if the data frames are the same, it would be nice to have a function that returns TRUE or just the original data.
so! what can i use to get this?
the functions
dplyr::all_equal()
let’s check out some comparisons using dplyr::all_equal(), suggested by Dilsher Dhillon!
first, what happens if the data frames are the same?
all_equal(iris, iris)
## [1] TRUE
nice
if they are the same, but a different order?
all_equal(iris, iris_diff_order)
## [1] TRUE
wena
if there are missing and extra columns?
all_equal(iris, iris_missing_extra_cols)
## [1] "Cols in y but not x: `Specieso`. "
## [2] "Cols in x but not y: `Sepal.Length`. "
if there’s an incorrect variable class?
all_equal(iris, iris_wrong_class)
## [1] "Incompatible type for column `Species`: x factor, y character"
what about wrong columns and the wrong class?
all_equal(iris, iris_missing_extra_cols_wrong_class)
## [1] "Cols in y but not x: `Specieso`. "
## [2] "Cols in x but not y: `Sepal.Length`. "
looks like however dplyr::all_equal()
works, reporting on the missing/extra columns is prioritized over reporting on the incorrect variable class. this is not great but not a huge deal to me - the function could report on the wrong columns, they get fixed, and then it reports on the wrong class.
overall, i do like the nice printing, e.g. “cols in y but not in x”, and these could probably be str_replace-ified in a function to supply the actual data frame names (e.g. iris vs iris_wrong_class).
i’m running the version of dplyr on CRAN - i had a dev version installed earlier and some of the messages looked different. if these will be changing (that’s ok!) i wouldn’t want to rely on altering the string output for creating my own errors, since the strings could change, and i don’t necessarily want to require dplyr 1.0.0 in a package just for all_equal strings :~)
janitor::compare_df_cols()
i consider myself a big fan of janitor but there’s always new functions in there that surprise me! thank you to Rebecca Janis for mentioning this (and sharing her own function for comparing column classes)!
let’s go in the same order. what happens if the data frames are the same?
library(janitor)
compare_df_cols(iris, iris)
## column_name iris.x iris.y
## 1 Petal.Length numeric numeric
## 2 Petal.Width numeric numeric
## 3 Sepal.Length numeric numeric
## 4 Sepal.Width numeric numeric
## 5 Species factor factor
it just returns a comparison of the columns () (what’s in both data frames, and their classes in each).
you can set an option only to return things that don’t match (or things that do):
compare_df_cols(iris, iris, return = "mismatch")
## [1] column_name iris.x iris.y
## <0 rows> (or 0-length row.names)
(in this case, nothing).
i assume that this will be robust to different order of columns, but let’s double check:
compare_df_cols(iris, iris_diff_order)
## column_name iris iris_diff_order
## 1 Petal.Length numeric numeric
## 2 Petal.Width numeric numeric
## 3 Sepal.Length numeric numeric
## 4 Sepal.Width numeric numeric
## 5 Species factor factor
cool
what about missing or extra columns?
compare_df_cols(iris, iris_missing_extra_cols)
## column_name iris iris_missing_extra_cols
## 1 Petal.Length numeric numeric
## 2 Petal.Width numeric numeric
## 3 Sepal.Length numeric <NA>
## 4 Sepal.Width numeric numeric
## 5 Species factor factor
## 6 Specieso <NA> factor
i like this because it shows all columns in both, and what their types are, by data frame (NA if the column isn’t in that df)
strangely enough, a “mismatch” here is a mismatch in class, and NA does not count as a mismatch:
compare_df_cols(iris, iris_missing_extra_cols, return = "mismatch")
## [1] column_name iris iris_missing_extra_cols
## <0 rows> (or 0-length row.names)
and does count as a match 🙃
compare_df_cols(iris, iris_missing_extra_cols, return = "match")
## column_name iris iris_missing_extra_cols
## 1 Petal.Length numeric numeric
## 2 Petal.Width numeric numeric
## 3 Sepal.Length numeric <NA>
## 4 Sepal.Width numeric numeric
## 5 Species factor factor
## 6 Specieso <NA> factor
what about if the variable classes don’t match?
compare_df_cols(iris, iris_wrong_class, return = "mismatch")
## column_name iris iris_wrong_class
## 1 Species factor character
i like this because you can easily construct your own message from this structure, using glue:
library(glue)
compare_df_cols(iris, iris_wrong_class, return = "mismatch") %>%
mutate(message = glue("Column `{column_name}`: expected class {iris}, got class {iris_wrong_class}")) %>%
pull(message)
## Column `Species`: expected class factor, got class character
and the resulting data frame can be filtered manually when the filters from “mismatch” aren’t what i want, to see all differences:
compare_df_cols(iris, iris_missing_extra_cols_wrong_class) %>%
filter(is.na(iris) | is.na(iris_missing_extra_cols_wrong_class) |
iris != iris_missing_extra_cols_wrong_class)
## column_name iris iris_missing_extra_cols_wrong_class
## 1 Sepal.Length numeric <NA>
## 2 Species factor character
## 3 Specieso <NA> factor
vetr::alike()
next up is vetr::alike()
! this is a new package to me, excited to see what it can do 🔍! thanks to Brodie G for suggesting (and creating!) it.
again, let’s start off checking on data frames that match:
library(vetr)
alike(iris, iris)
## [1] TRUE
cool, what about different order of columns?
alike(iris, iris_diff_order)
## [1] "`names(iris_diff_order)[1]` should be \"Sepal.Length\" (is \"Species\")"
uh oh
what about a data frame with missing and extra columns?
alike(iris, iris_missing_extra_cols)
## [1] "`names(iris_missing_extra_cols)[1]` should be \"Sepal.Length\" (is \"Sepal.Width\")"
this only tells me that one of the column names is wrong (the first one), and not what’s extra.
it can identify when there’s extra columns:
alike(iris, iris_extra_cols)
## [1] "`iris_extra_cols` should have 5 columns (has 6)"
but doesn’t say what they are. similarly to dplyr::all_equal()
, it seems like it prioritizes one kind of reporting (an incorrect column name) and doesn’t indicate other issues.
let’s look at an incorrect class:
alike(iris, iris_wrong_class)
## [1] "`iris_wrong_class$Species` should be class \"factor\" (is \"character\")"
i like this format - says what it should be, and what it actually is.
testthat::expect_equal()
this one is my idea 😇 because i love testthat
when the data frames are the same, expect_equal()
returns:
library(testthat)
expect_equal(iris, iris)
nothing. well actually, it returns the data frame invisibly:
x <- expect_equal(iris, iris)
x
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## # … with 145 more rows
but doesn’t print it automatically, probably since it’s usually used within a test_that()
call.
when there are missing/extra columns:
expect_equal(iris, iris_missing_extra_cols)
## Error: `iris` not equal to `iris_missing_extra_cols`.
## Cols in y but not x: `Specieso`.
## Cols in x but not y: `Sepal.Length`.
the reporting is very similar (the same?) as from dplyr::all_equal()
🤔 are these the same functions?
expect_equal(iris, iris_wrong_class)
## Error: `iris` not equal to `iris_wrong_class`.
## Incompatible type for column `Species`: x factor, y character
ok yes, the same as dplyr::all_equal()
but returning an error instead of a string. it probably just calls all_equal()
. maybe i shouldn’t have bragged that this one was my idea 😂
diffdf::diffdf()
Marianna Foos swept in while i was writing this post and suggested diffdf!
library(diffdf)
diffdf(iris, iris)
## No issues were found!
thank you, diffdf!
diffdf(iris, iris_diff_order)
## No issues were found!
i agree!
diffdf(iris, iris_missing_extra_cols)
## Warning in diffdf(iris, iris_missing_extra_cols):
## There are columns in BASE that are not in COMPARE !!
## There are columns in COMPARE that are not in BASE !!
## Differences found between the objects!
##
## A summary is given below.
##
## There are columns in BASE that are not in COMPARE !!
## All rows are shown in table below
##
## ==============
## COLUMNS
## --------------
## Sepal.Length
## --------------
##
## There are columns in COMPARE that are not in BASE !!
## All rows are shown in table below
##
## ==========
## COLUMNS
## ----------
## Specieso
## ----------
shit ok! we got DIFFERENCES !!
diffdf(iris, iris_wrong_class)
## Warning in diffdf(iris, iris_wrong_class):
## There are columns in BASE and COMPARE with different modes !!
## There are columns in BASE and COMPARE with different classes !!
## Differences found between the objects!
##
## A summary is given below.
##
## There are columns in BASE and COMPARE with different modes !!
## All rows are shown in table below
##
## ================================
## VARIABLE MODE.BASE MODE.COMP
## --------------------------------
## Species numeric character
## --------------------------------
##
## There are columns in BASE and COMPARE with different classes !!
## All rows are shown in table below
##
## ==================================
## VARIABLE CLASS.BASE CLASS.COMP
## ----------------------------------
## Species factor character
## ----------------------------------
i don’t know what a mode is (something related to factors, i guess!) but i like that it reports the different classes in Species.
this function also returns a list of data frames with issues invisibly (i suppressed warnings because it was yelling ok)
x <- diffdf(iris, iris_missing_extra_cols_wrong_class, suppress_warnings = TRUE)
x[["VarClassDiffs"]]
## VARIABLE CLASS.BASE CLASS.COMP
## 6 Species factor character
x[["ExtColsBase"]]
## # A tibble: 1 x 1
## COLUMNS
## * <chr>
## 1 Sepal.Length
x[["ExtColsComp"]]
## # A tibble: 1 x 1
## COLUMNS
## * <chr>
## 1 Specieso
this is similar to the kind of information you get from janitor::compare_df_cols()
, and i like it!
comparing the functions
i have a bunch of functions to compare, on a bunch of criteria!
- iris == iris 😋
- iris in a different order is still iris
- detects missing columns
- detects extra columns
- detects missing and extra columns
- detects difference in class
- detects different columns and difference in class
- nice strings to use for messages
- returns data (that i can use to BYOmessages)
i’m not going to include testthat::test_equal()
in the comparison, since it’s so similar (read: basically the same) to dplyr::all_equal()
dplyr::all_equal() | janitor::compare_df_cols() | vetr::alike() | diffdf::diffdf() | |
---|---|---|---|---|
iris is iris | ✅ | ✅ | ✅ | ✅ |
column swapped iris is iris | ✅ | ✅ | ❌ | ✅ |
missing columns | ✅ | ✅ | ✅ | ✅ |
extra columns | ✅ | ✅ | ✅ | ✅ |
missing and extra columns | ✅ | ✅ | ❌ | ✅ |
difference in class | ✅ | ✅ | ✅ | ✅ |
different columns and classes | ❌ | ✅ | ❌ | ✅ |
nice strings to use for messages | ✅ | ❌ | ✅❌ | ❌ |
returns data on differences | ❌ | ✅ | ❌ | ✅ |
what to do what to do
ok, so what did i end up doing? combining things, of course!
i loved learning about new packages and seeing what’s available in packages i already know. i am inclined to use stuff from dplyr and janitor because they’re often already dependencies in my work.
so, i decided to cobble something together using the data available from janitor::compare_df_cols()
, and the basic check of dplyr::all_equal()
as a jumping off point, and i borrowed some of the message/error/etc wording from the packages i explored. i also have my function return the original data if all checks pass so that it can easily be used in a pipeline!
i won’t belabour you with the source code, but here’s how it works:
sharla_diff(iris, iris)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## # … with 145 more rows
sharla_diff(iris, iris_diff_order)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## # … with 145 more rows
sharla_diff(iris, iris_missing_extra_cols)
## Warning: `iris` contains extra column(s): Sepal.Length.
## Error: `iris` is missing column(s): Specieso.
sharla_diff(iris, iris_wrong_class)
## Error: `iris` column types are not as expected. These columns have issues:
## Species should be class character, is factor
sharla_diff(iris, iris_missing_extra_cols_wrong_class)
## Warning: `iris` contains extra column(s): Sepal.Length.
## Error: `iris` is missing column(s): Specieso.
sharla_diff()
errors when it’s missing columns, so it doesn’t have the chance to look for the wrong class in iris_missing_extra_cols_wrong_class
(like dplyr::all_equal()
!). as i mentioned, that’s not a huge deal for me.
i’m sure this function isn’t perfect and will be changed as i see issues with it. that’s it for now! please see the honourable mentions appendix if you think i snubbed your package ok!!!
honourable mentions
packages for comparing values
some people suggested packages for comparing the actual values within a data frame instead of the data frame name/classes. that’s not what i was looking for, but helpful anyways!
if you want to do that:
visdat
in general is such an awesome package for visually understanding your data. Nick Tierney is its creator (and suggester!), specifically suggesting visdat::vis_compare() for visually comparing values.
dataCompareR::rCompare() can be used for comparing values very robustly. thank you to Robert Campbell for the suggestion!
a little unrelated, but diffobj looks very cool for visualizing diffs of objects (duh, hence the name) within your console. ty to Tokhir Dadaev and Brodie Gaslam for mentioning (and Brodie for creating - a double for both in this post!).
daff::diff_data() and arsenal::comparedf()
ok, thank you and no offence to everyone (Erin Grand, Antoine Fabri, Eric Polley, Benjamin Gowan) that suggested these!!! daff::diff_data() doesn’t render in rmarkdown and arsenal::comparedf() is a little too robust for what i’m looking for.
either way, both methods provide more information than i want and don’t really lend themselves to easy creation of errors or warnings. if you are looking for a full report (tm) of what is different between two data frames, these might be for you!
vctrs::vec_assert()
according to Hadley, vctrs::vec_assert()
might one day create the kind of messages i want about classes (e.g. expected Date, got character). i look forward to it (and to learning vctrs someday)!
ok that’s really it! bye!