comparing two data frames: one #rstats, many ways!

today i tweeted asking about packages to compare two data frames (specifically, their variable names and types):

judging by the ratio (18 replies to 38 likes!) at the time of writing, yes there are definitely packages to do this! in this post, i want to explore the (many) different and excellent methods for comparing two data frames in #rstats.

the data

first, what are we talking about? imagine i have some data coming in, and i know what it should look like. it should look like iris! 🌹

iris
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## # … with 145 more rows

so, iris has 5 columns: Sepal.Length, Sepal.Width, Petal.Length, and Species. the column types are: numeric/double (times 4!) and factor.

the data i’m bringing in could have a number of problems.

it could be missing a column:

library(dplyr)

iris_missing_cols <- iris %>%
  select(-Sepal.Length)

iris_missing_cols
## # A tibble: 150 x 4
##   Sepal.Width Petal.Length Petal.Width Species
##         <dbl>        <dbl>       <dbl> <fct>  
## 1         3.5          1.4         0.2 setosa 
## 2         3            1.4         0.2 setosa 
## 3         3.2          1.3         0.2 setosa 
## 4         3.1          1.5         0.2 setosa 
## 5         3.6          1.4         0.2 setosa 
## # … with 145 more rows

it could have an extra column:

iris_extra_cols <- iris %>%
  mutate(Specieso = Species)

iris_extra_cols
## # A tibble: 150 x 6
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Specieso
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <fct>   
## 1          5.1         3.5          1.4         0.2 setosa  setosa  
## 2          4.9         3            1.4         0.2 setosa  setosa  
## 3          4.7         3.2          1.3         0.2 setosa  setosa  
## 4          4.6         3.1          1.5         0.2 setosa  setosa  
## 5          5           3.6          1.4         0.2 setosa  setosa  
## # … with 145 more rows

it could be missing a column and have an extra column:

iris_missing_extra_cols <- iris %>%
  select(-Sepal.Length) %>%
  mutate(Specieso = Species)

iris_missing_extra_cols
## # A tibble: 150 x 5
##   Sepal.Width Petal.Length Petal.Width Species Specieso
##         <dbl>        <dbl>       <dbl> <fct>   <fct>   
## 1         3.5          1.4         0.2 setosa  setosa  
## 2         3            1.4         0.2 setosa  setosa  
## 3         3.2          1.3         0.2 setosa  setosa  
## 4         3.1          1.5         0.2 setosa  setosa  
## 5         3.6          1.4         0.2 setosa  setosa  
## # … with 145 more rows

it could have a column of the wrong class:

iris_wrong_class <- iris %>%
  mutate(Species = as.character(Species))

iris_wrong_class
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## # … with 145 more rows

or it could have all three!

iris_missing_extra_cols_wrong_class <- iris_missing_extra_cols %>%
  mutate(Species = as.character(Species))

iris_missing_extra_cols_wrong_class
## # A tibble: 150 x 5
##   Sepal.Width Petal.Length Petal.Width Species Specieso
##         <dbl>        <dbl>       <dbl> <chr>   <fct>   
## 1         3.5          1.4         0.2 setosa  setosa  
## 2         3            1.4         0.2 setosa  setosa  
## 3         3.2          1.3         0.2 setosa  setosa  
## 4         3.1          1.5         0.2 setosa  setosa  
## 5         3.6          1.4         0.2 setosa  setosa  
## # … with 145 more rows

one last thing is not a problem, but i’m curious to see if the packages i explore will think it is. this is when the column order is different:

iris_diff_order <- iris %>%
  select(Species, everything())

iris_diff_order
## # A tibble: 150 x 5
##   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>          <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa           5.1         3.5          1.4         0.2
## 2 setosa           4.9         3            1.4         0.2
## 3 setosa           4.7         3.2          1.3         0.2
## 4 setosa           4.6         3.1          1.5         0.2
## 5 setosa           5           3.6          1.4         0.2
## # … with 145 more rows

what i dream of ☁️

in the tweet i mentioned i’m looking for packages that will provide me with nice, descriptive errors (or robust data to create my own) describing what’s different between the data sets.

what does this look like? two main things.

  1. saying what columns are missing (and what’s optionally, what’s extra).

    for iris vs iris_missing_cols, i want to know that iris_missing_cols is missing a column called Sepal.Length.

    for iris vs iris_extra_cols, i want to know that iris_extra_cols has an extra column called Specieso.

    for iris vs iris_missing_extra_cols, i want to know that iris_missing_extra_cols is missing Sepal.Length and has an extra column Specieso.

  2. saying what classes are different.

    for iris vs iris_wrong_class, i want to know that Species should be a factor, but it’s a character.

ok, putting them together, three things.

  1. saying what columns are missing, what’s extra, and what classes are different.

    for iris vs iris_missing_extra_cols_wrong_class, i want to know that iris_missing_extra_cols_wrong_class is missing Sepal.Length, has an extra column Specieso, and that Species should be a factor but instead is a character.

finally, if the data frames are the same, it would be nice to have a function that returns TRUE or just the original data.

so! what can i use to get this?

the functions

dplyr::all_equal()

let’s check out some comparisons using dplyr::all_equal(), suggested by Dilsher Dhillon!

first, what happens if the data frames are the same?

all_equal(iris, iris)
## [1] TRUE

nice

if they are the same, but a different order?

all_equal(iris, iris_diff_order)
## [1] TRUE

wena

if there are missing and extra columns?

all_equal(iris, iris_missing_extra_cols)
## [1] "Cols in y but not x: `Specieso`. "    
## [2] "Cols in x but not y: `Sepal.Length`. "

if there’s an incorrect variable class?

all_equal(iris, iris_wrong_class)
## [1] "Incompatible type for column `Species`: x factor, y character"

what about wrong columns and the wrong class?

all_equal(iris, iris_missing_extra_cols_wrong_class)
## [1] "Cols in y but not x: `Specieso`. "    
## [2] "Cols in x but not y: `Sepal.Length`. "

looks like however dplyr::all_equal() works, reporting on the missing/extra columns is prioritized over reporting on the incorrect variable class. this is not great but not a huge deal to me - the function could report on the wrong columns, they get fixed, and then it reports on the wrong class.

overall, i do like the nice printing, e.g. “cols in y but not in x”, and these could probably be str_replace-ified in a function to supply the actual data frame names (e.g. iris vs iris_wrong_class).

i’m running the version of dplyr on CRAN - i had a dev version installed earlier and some of the messages looked different. if these will be changing (that’s ok!) i wouldn’t want to rely on altering the string output for creating my own errors, since the strings could change, and i don’t necessarily want to require dplyr 1.0.0 in a package just for all_equal strings :~)

janitor::compare_df_cols()

i consider myself a big fan of janitor but there’s always new functions in there that surprise me! thank you to Rebecca Janis for mentioning this (and sharing her own function for comparing column classes)!

let’s go in the same order. what happens if the data frames are the same?

library(janitor)

compare_df_cols(iris, iris)
##    column_name  iris.x  iris.y
## 1 Petal.Length numeric numeric
## 2  Petal.Width numeric numeric
## 3 Sepal.Length numeric numeric
## 4  Sepal.Width numeric numeric
## 5      Species  factor  factor

it just returns a comparison of the columns () (what’s in both data frames, and their classes in each).

you can set an option only to return things that don’t match (or things that do):

compare_df_cols(iris, iris, return = "mismatch")
## [1] column_name iris.x      iris.y     
## <0 rows> (or 0-length row.names)

(in this case, nothing).

i assume that this will be robust to different order of columns, but let’s double check:

compare_df_cols(iris, iris_diff_order)
##    column_name    iris iris_diff_order
## 1 Petal.Length numeric         numeric
## 2  Petal.Width numeric         numeric
## 3 Sepal.Length numeric         numeric
## 4  Sepal.Width numeric         numeric
## 5      Species  factor          factor

cool

what about missing or extra columns?

compare_df_cols(iris, iris_missing_extra_cols)
##    column_name    iris iris_missing_extra_cols
## 1 Petal.Length numeric                 numeric
## 2  Petal.Width numeric                 numeric
## 3 Sepal.Length numeric                    <NA>
## 4  Sepal.Width numeric                 numeric
## 5      Species  factor                  factor
## 6     Specieso    <NA>                  factor

i like this because it shows all columns in both, and what their types are, by data frame (NA if the column isn’t in that df)

strangely enough, a “mismatch” here is a mismatch in class, and NA does not count as a mismatch:

compare_df_cols(iris, iris_missing_extra_cols, return = "mismatch")
## [1] column_name             iris                    iris_missing_extra_cols
## <0 rows> (or 0-length row.names)

and does count as a match 🙃

compare_df_cols(iris, iris_missing_extra_cols, return = "match")
##    column_name    iris iris_missing_extra_cols
## 1 Petal.Length numeric                 numeric
## 2  Petal.Width numeric                 numeric
## 3 Sepal.Length numeric                    <NA>
## 4  Sepal.Width numeric                 numeric
## 5      Species  factor                  factor
## 6     Specieso    <NA>                  factor

what about if the variable classes don’t match?

compare_df_cols(iris, iris_wrong_class, return = "mismatch")
##   column_name   iris iris_wrong_class
## 1     Species factor        character

i like this because you can easily construct your own message from this structure, using glue:

library(glue)

compare_df_cols(iris, iris_wrong_class, return = "mismatch") %>%
  mutate(message = glue("Column `{column_name}`: expected class {iris}, got class {iris_wrong_class}")) %>%
  pull(message)
## Column `Species`: expected class factor, got class character

and the resulting data frame can be filtered manually when the filters from “mismatch” aren’t what i want, to see all differences:

compare_df_cols(iris, iris_missing_extra_cols_wrong_class) %>%
  filter(is.na(iris) | is.na(iris_missing_extra_cols_wrong_class) |
    iris != iris_missing_extra_cols_wrong_class)
##    column_name    iris iris_missing_extra_cols_wrong_class
## 1 Sepal.Length numeric                                <NA>
## 2      Species  factor                           character
## 3     Specieso    <NA>                              factor

vetr::alike()

next up is vetr::alike()! this is a new package to me, excited to see what it can do 🔍! thanks to Brodie G for suggesting (and creating!) it.

again, let’s start off checking on data frames that match:

library(vetr)

alike(iris, iris)
## [1] TRUE

cool, what about different order of columns?

alike(iris, iris_diff_order)
## [1] "`names(iris_diff_order)[1]` should be \"Sepal.Length\" (is \"Species\")"

uh oh

what about a data frame with missing and extra columns?

alike(iris, iris_missing_extra_cols)
## [1] "`names(iris_missing_extra_cols)[1]` should be \"Sepal.Length\" (is \"Sepal.Width\")"

this only tells me that one of the column names is wrong (the first one), and not what’s extra.

it can identify when there’s extra columns:

alike(iris, iris_extra_cols)
## [1] "`iris_extra_cols` should have 5 columns (has 6)"

but doesn’t say what they are. similarly to dplyr::all_equal(), it seems like it prioritizes one kind of reporting (an incorrect column name) and doesn’t indicate other issues.

let’s look at an incorrect class:

alike(iris, iris_wrong_class)
## [1] "`iris_wrong_class$Species` should be class \"factor\" (is \"character\")"

i like this format - says what it should be, and what it actually is.

testthat::expect_equal()

this one is my idea 😇 because i love testthat

when the data frames are the same, expect_equal() returns:

library(testthat)

expect_equal(iris, iris)

nothing. well actually, it returns the data frame invisibly:

x <- expect_equal(iris, iris)

x
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## # … with 145 more rows

but doesn’t print it automatically, probably since it’s usually used within a test_that() call.

when there are missing/extra columns:

expect_equal(iris, iris_missing_extra_cols)
## Error: `iris` not equal to `iris_missing_extra_cols`.
## Cols in y but not x: `Specieso`. 
## Cols in x but not y: `Sepal.Length`.

the reporting is very similar (the same?) as from dplyr::all_equal() 🤔 are these the same functions?

expect_equal(iris, iris_wrong_class)
## Error: `iris` not equal to `iris_wrong_class`.
## Incompatible type for column `Species`: x factor, y character

ok yes, the same as dplyr::all_equal() but returning an error instead of a string. it probably just calls all_equal(). maybe i shouldn’t have bragged that this one was my idea 😂

diffdf::diffdf()

Marianna Foos swept in while i was writing this post and suggested diffdf!

library(diffdf)

diffdf(iris, iris)
## No issues were found!

thank you, diffdf!

diffdf(iris, iris_diff_order)
## No issues were found!

i agree!

diffdf(iris, iris_missing_extra_cols)
## Warning in diffdf(iris, iris_missing_extra_cols): 
## There are columns in BASE that are not in COMPARE !!
## There are columns in COMPARE that are not in BASE !!
## Differences found between the objects!
## 
## A summary is given below.
## 
## There are columns in BASE that are not in COMPARE !!
## All rows are shown in table below
## 
##   ==============
##      COLUMNS    
##   --------------
##    Sepal.Length 
##   --------------
## 
## There are columns in COMPARE that are not in BASE !!
## All rows are shown in table below
## 
##   ==========
##    COLUMNS  
##   ----------
##    Specieso 
##   ----------

shit ok! we got DIFFERENCES !!

diffdf(iris, iris_wrong_class)
## Warning in diffdf(iris, iris_wrong_class): 
## There are columns in BASE and COMPARE with different modes !!
## There are columns in BASE and COMPARE with different classes !!
## Differences found between the objects!
## 
## A summary is given below.
## 
## There are columns in BASE and COMPARE with different modes !!
## All rows are shown in table below
## 
##   ================================
##    VARIABLE  MODE.BASE  MODE.COMP 
##   --------------------------------
##    Species    numeric   character 
##   --------------------------------
## 
## There are columns in BASE and COMPARE with different classes !!
## All rows are shown in table below
## 
##   ==================================
##    VARIABLE  CLASS.BASE  CLASS.COMP 
##   ----------------------------------
##    Species     factor    character  
##   ----------------------------------

i don’t know what a mode is (something related to factors, i guess!) but i like that it reports the different classes in Species.

this function also returns a list of data frames with issues invisibly (i suppressed warnings because it was yelling ok)

x <- diffdf(iris, iris_missing_extra_cols_wrong_class, suppress_warnings = TRUE)

x[["VarClassDiffs"]]
##   VARIABLE CLASS.BASE CLASS.COMP
## 6  Species     factor  character
x[["ExtColsBase"]]
## # A tibble: 1 x 1
##   COLUMNS     
## * <chr>       
## 1 Sepal.Length
x[["ExtColsComp"]]
## # A tibble: 1 x 1
##   COLUMNS 
## * <chr>   
## 1 Specieso

this is similar to the kind of information you get from janitor::compare_df_cols(), and i like it!

comparing the functions

i have a bunch of functions to compare, on a bunch of criteria!

  1. iris == iris 😋
  2. iris in a different order is still iris
  3. detects missing columns
  4. detects extra columns
  5. detects missing and extra columns
  6. detects difference in class
  7. detects different columns and difference in class
  8. nice strings to use for messages
  9. returns data (that i can use to BYOmessages)

i’m not going to include testthat::test_equal() in the comparison, since it’s so similar (read: basically the same) to dplyr::all_equal()

dplyr::all_equal() janitor::compare_df_cols() vetr::alike() diffdf::diffdf()
iris is iris
column swapped iris is iris
missing columns
extra columns
missing and extra columns
difference in class
different columns and classes
nice strings to use for messages ✅❌
returns data on differences

what to do what to do

ok, so what did i end up doing? combining things, of course!

i loved learning about new packages and seeing what’s available in packages i already know. i am inclined to use stuff from dplyr and janitor because they’re often already dependencies in my work.

so, i decided to cobble something together using the data available from janitor::compare_df_cols(), and the basic check of dplyr::all_equal() as a jumping off point, and i borrowed some of the message/error/etc wording from the packages i explored. i also have my function return the original data if all checks pass so that it can easily be used in a pipeline!

i won’t belabour you with the source code, but here’s how it works:

sharla_diff(iris, iris)
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## # … with 145 more rows
sharla_diff(iris, iris_diff_order)
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## # … with 145 more rows
sharla_diff(iris, iris_missing_extra_cols)
## Warning: `iris` contains extra column(s): Sepal.Length.
## Error: `iris` is missing column(s): Specieso.
sharla_diff(iris, iris_wrong_class)
## Error: `iris` column types are not as expected. These columns have issues:
## Species should be class character, is factor
sharla_diff(iris, iris_missing_extra_cols_wrong_class)
## Warning: `iris` contains extra column(s): Sepal.Length.
## Error: `iris` is missing column(s): Specieso.

sharla_diff() errors when it’s missing columns, so it doesn’t have the chance to look for the wrong class in iris_missing_extra_cols_wrong_class (like dplyr::all_equal()!). as i mentioned, that’s not a huge deal for me.

i’m sure this function isn’t perfect and will be changed as i see issues with it. that’s it for now! please see the honourable mentions appendix if you think i snubbed your package ok!!!

honourable mentions

packages for comparing values

some people suggested packages for comparing the actual values within a data frame instead of the data frame name/classes. that’s not what i was looking for, but helpful anyways!

if you want to do that:

visdat in general is such an awesome package for visually understanding your data. Nick Tierney is its creator (and suggester!), specifically suggesting visdat::vis_compare() for visually comparing values.

dataCompareR::rCompare() can be used for comparing values very robustly. thank you to Robert Campbell for the suggestion!

a little unrelated, but diffobj looks very cool for visualizing diffs of objects (duh, hence the name) within your console. ty to Tokhir Dadaev and Brodie Gaslam for mentioning (and Brodie for creating - a double for both in this post!).

daff::diff_data() and arsenal::comparedf()

ok, thank you and no offence to everyone (Erin Grand, Antoine Fabri, Eric Polley, Benjamin Gowan) that suggested these!!! daff::diff_data() doesn’t render in rmarkdown and arsenal::comparedf() is a little too robust for what i’m looking for.

either way, both methods provide more information than i want and don’t really lend themselves to easy creation of errors or warnings. if you are looking for a full report (tm) of what is different between two data frames, these might be for you!

vctrs::vec_assert()

according to Hadley, vctrs::vec_assert() might one day create the kind of messages i want about classes (e.g. expected Date, got character). i look forward to it (and to learning vctrs someday)!

ok that’s really it! bye!

Avatar
Sharla Gelfand
Freelance R and Shiny Developer