crying @ sephora
A few days ago an amazing tweet made the rounds:
Connie scraped and shared a data set of 100ish reviews from Sephora that involved the word “crying” - the whole repo is here. This was done for a class called “Data Gardens” taught by Everest Pipkin at Carnegie Melon. From the class sylabus, “Data Gardens is a studio class in creative code and software practices, with an emphasis on data as medium.” It sounds very cool and instead of crying that I never chose a program that involved classes like that, I’m analyzing the crying data 💁
Let’s read in the data. It contains information on the product rated as well as the actual title + text of the review, the number of stars rated, the date, and a user id.
library(jsonlite) library(dplyr) crying <- fromJSON("https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/crying_dataset.json", simplifyDataFrame = TRUE ) crying <- as_tibble(crying[["reviews"]]) crying
## # A tibble: 105 x 6 ## date product_info$br… $name $type $url review_body review_title stars ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 29 M… Too Faced Bett… Masc… http… Now I can … AWESOME 5 st… ## 2 29 S… Too Faced Bett… Masc… http… This holds… if you're s… 5 st… ## 3 23 M… Too Faced Bett… Masc… http… I just bou… Hate it 1 st… ## 4 15 A… Too Faced Bett… Masc… http… To start o… Nearly perf… 5 st… ## 5 21 S… Too Faced Bett… Masc… http… This masca… Amazing!! 5 st… ## 6 30 M… Too Faced Bett… Masc… http… "Let's tal… Tricky but … 5 st… ## 7 3 Ap… Too Faced Bett… Masc… http… I really w… nothing lik… 1 st… ## 8 6 Ma… Too Faced Bett… Masc… http… I bought t… absolute be… 5 st… ## 9 7 Se… Too Faced Bett… Masc… http… I have ext… Color: Stan… 5 st… ## 10 27 F… Too Faced Bett… Masc… http… My 6$ drug… Didn't like… 1 st… ## # … with 95 more rows, and 1 more variable: userid <dbl>
I want to look at the number of stars rated. It looks like this variable actually contains the phrase star/stars:
crying %>% count(stars)
## # A tibble: 5 x 2 ## stars n ## <chr> <int> ## 1 1 star 6 ## 2 2 stars 2 ## 3 3 stars 4 ## 4 4 stars 14 ## 5 5 stars 79
That’s not that useful for analysis, so I’ll actually pull out just the star rating:
library(tidyr) crying <- crying %>% separate(stars, into = "stars", convert = TRUE) crying %>% count(stars)
## # A tibble: 5 x 2 ## stars n ## <int> <int> ## 1 1 6 ## 2 2 2 ## 3 3 4 ## 4 4 14 ## 5 5 79
We can see that even though these reviews are about crying, the ratings are overwhelmingly good - about 75% 5 star ratings.
library(janitor) library(ggplot2) crying %>% tabyl(stars) %>% ggplot(aes(x = stars, y = percent)) + geom_col() + scale_x_continuous("Star Rating") + scale_y_continuous("Percent of reviews", labels = scales::percent, limit = c(0, 1)) + ggtitle("Sephora crying review star ratings")
There’s no way that I could let all of that juicy review data go without doing some text analysis. I’ve still only read like the first three chapters of the tidy text mining book (sorry 🇨🇦) so we’re just going to count some words.
I’m pasting the review title + text together, then separating that text out into words (adding a review id along with it so we can keep track of distinct reviews, and carrying the stars along for later!):
library(tidytext) crying_tokens <- crying %>% mutate(review_id = row_number()) %>% mutate(review = paste(review_title, review_body)) %>% select(review_id, stars, review) %>% unnest_tokens(word, review) crying_tokens
## # A tibble: 12,150 x 3 ## review_id stars word ## <int> <int> <chr> ## 1 1 5 awesome ## 2 1 5 now ## 3 1 5 i ## 4 1 5 can ## 5 1 5 cry ## 6 1 5 all ## 7 1 5 i ## 8 1 5 want ## 9 1 5 without ## 10 1 5 having ## # … with 12,140 more rows
Next, I’ll get rid of stop words, and with the remaining words, count how many times they appear:
crying_tokens <- crying_tokens %>% anti_join(stop_words, by = "word") crying_tokens_frequency <- crying_tokens %>% anti_join(stop_words, by = "word") %>% count(word, sort = TRUE) crying_tokens_frequency %>% head(10) %>% ggplot(aes(x = reorder(word, n), y = n)) + geom_col() + labs(x = "", y = "Number of appearances", title = "Top 10 used words in Sephora crying reviews") + coord_flip()
The top words are unsurprising given the context + products chosen!
I also want to look at the sentiment of each review compared to the star rating. I’m guessing they won’t really match up because this context is… hard to capture. But let’s see. I’m using the NRC sentiment lexicon because it has the most words in common with my data set. The net sentiment is the number of words with positive sentiment minus the number with negative sentiment.
crying_sentiment <- crying_tokens %>% inner_join(get_sentiments("nrc") %>% filter(sentiment %in% c("positive", "negative")), by = "word") %>% count(review_id, stars, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(net_sentiment = positive - negative) ggplot( crying_sentiment, aes(x = stars, y = net_sentiment) ) + geom_jitter(width = 0.1, height = 0.1, alpha = 0.5) + labs( x = "Star Rating", y = "Net Sentiment" )
The 5 star reviews are a little all over the place, but it does seem like maybe there’s some sort of trend for the 1-3 star reviews? Is n = 15 statistically significant? Is someone could to revoke my statistics degrees?
And because I have to…
library(ggpubr) library(jpeg) kim <- readJPEG(here::here("content", "post", "2019-11-08-crying-sephora", "kim.jpg")) ggplot( crying_sentiment, aes(x = stars, y = net_sentiment) ) + background_image(kim) + geom_jitter(width = 0.1, height = 0.1, alpha = 0.5, size = 3) + labs( x = "Star Rating", y = "Net Sentiment", title = "Star rating versus net sentiment sentiment", subtitle = "Sephora product reviews involving the word 'crying'", caption = "Data Source: @crabbage_\nAnalysis: @sharlagelfand" )