Star Trek: The Next Generation catchphrases

It was only a matter of time before I reached “The Sopranos Point” of the pandemic and decided to pick up Star Trek. I thought I might not ~get it~ but turns out that it’s just a soap opera for nerds (🙋)… so it was also only a matter of time before I decided to analyze data from it. I posted a plot on twitter (only click if you want spoilers!), but left the code a mystery (ok no one asked for it). Here it is now!

At the time I embarked on this little project, we had just wrapped up season 1 of Star Trek: The Next Generation. Now we’re on season 3, but….. let’s just look at Season 1 here.

The way that the characters speak is so wonderful and distinct, and I wanted to see if text analysis could capture that. Luckily, I didn’t have to work very hard, neither to get the data nor to analyze it, thanks to wonderful packages like {rtrek}, which contains datasets related to Star Trek - including transcripts! And, of course, where would I be without {tidytext} for text analysis?

First, I’ll grab the TNG transcripts via {rtrek} and filter only for the first season:

library(rtrek)
library(dplyr)

transcripts <- st_transcripts() %>%
  filter(series == "TNG", season == 1)

transcripts
## # A tibble: 25 x 10
##    format  series season number title   production airdate url     url2    text 
##    <chr>   <chr>   <int>  <int> <chr>        <int> <chr>   <chr>   <chr>   <lis>
##  1 episode TNG         1      1 Encoun…        101 1987-0… https:… http:/… <tib…
##  2 episode TNG         1      3 The Na…        103 1987-1… https:… http:/… <tib…
##  3 episode TNG         1      4 Code o…        104 1987-1… https:… http:/… <tib…
##  4 episode TNG         1      5 Haven          105 1987-1… https:… http:/… <tib…
##  5 episode TNG         1      6 Where …        106 1987-1… https:… http:/… <tib…
##  6 episode TNG         1      7 The La…        107 1987-1… https:… http:/… <tib…
##  7 episode TNG         1      8 Lonely…        108 1987-1… https:… http:/… <tib…
##  8 episode TNG         1      9 Justice        109 1987-1… https:… http:/… <tib…
##  9 episode TNG         1     10 The Ba…        110 1987-1… https:… http:/… <tib…
## 10 episode TNG         1     11 Hide a…        111 1987-1… https:… http:/… <tib…
## # … with 15 more rows

The text here is nested in the text column, and that’s all we need, so I’ll just hold on to that:

library(tidyr)

transcripts <- transcripts %>%
  select(text) %>%
  unnest(cols = text)

transcripts
## # A tibble: 11,248 x 6
##    line_number perspective    setting         description character line        
##          <int> <chr>          <chr>           <chr>       <chr>     <chr>       
##  1          NA <NA>           "Fade in"       <NA>        <NA>      <NA>        
##  2           1 Ext. Space - … "The u.s.s. En… <NA>        Picard V… Captain's l…
##  3           2 Other introdu… "On the gigant… <NA>        Picard V… My orders a…
##  4           3 Int. Engine r… "Huge, with a … Continuing  Picard V… ... I am be…
##  5           4 Closer on ves… "Showing the d… <NA>        Picard V… I am still …
##  6          NA Int. Lounge d… "With its huge… <NA>        <NA>      <NA>        
##  7           5 Continued       <NA>           Continuing  Picard V… ... my crew…
##  8           6 Int. Bridge -… "Picard, troi,… Continuing  Picard V… ... a first…
##  9           7 Angle emphasi… "As picard tur… <NA>        Picard    You will ag…
## 10           8 Angle emphasi…  <NA>           <NA>        Data      Difficult .…
## # … with 11,238 more rows

Let’s just focus on the character and line:

transcripts <- transcripts %>%
  select(character, line) %>%
  filter(!is.na(line))

transcripts
## # A tibble: 10,448 x 2
##    character   line                                                             
##    <chr>       <chr>                                                            
##  1 Picard V.o. Captain's log, stardate 42353.7. Our destination is planet Cygnu…
##  2 Picard V.o. My orders are to examine Farpoint, a starbase built there by the…
##  3 Picard V.o. ... I am becoming better acquainted with my new command, this Ga…
##  4 Picard V.o. I am still somewhat in awe of its size and complexity.           
##  5 Picard V.o. ... my crew we are short in several key positions, most notably …
##  6 Picard V.o. ... a first officer, but I am informed that a highly experienced…
##  7 Picard      You will agree, Data, that Starfleet's instructions are difficul…
##  8 Data        Difficult ... how so? Simply solve the mystery of Farpoint Stati…
##  9 Picard      As simple as that.                                               
## 10 Troi        Farpoint Station. Even the name sounds mysterious.               
## # … with 10,438 more rows

We can see already that the characters need some ~data cleaning~ (and thanks to the helpful {rtrek} vignette I knew this would be coming). “Picard V.o.” is a voiceover by Jean-Luc Picard, the Captain of this series’ ship, The Enterprise. It may be a voiceover, but it’s just Picard nonetheless.

I’ll strip off that excess " V.o." text, and some other variants that I found from peeking at the values of character. Then, we just see “Picard”:

library(stringr)

transcripts <- transcripts %>%
  mutate(character = str_remove_all(character, " \\(V.o.\\)|'s Com Voice| V.o.|  \\(Cont'd\\)| (O.s.)"))

transcripts
## # A tibble: 10,448 x 2
##    character line                                                               
##    <chr>     <chr>                                                              
##  1 Picard    Captain's log, stardate 42353.7. Our destination is planet Cygnus …
##  2 Picard    My orders are to examine Farpoint, a starbase built there by the i…
##  3 Picard    ... I am becoming better acquainted with my new command, this Gala…
##  4 Picard    I am still somewhat in awe of its size and complexity.             
##  5 Picard    ... my crew we are short in several key positions, most notably ...
##  6 Picard    ... a first officer, but I am informed that a highly experienced m…
##  7 Picard    You will agree, Data, that Starfleet's instructions are difficult? 
##  8 Data      Difficult ... how so? Simply solve the mystery of Farpoint Station.
##  9 Picard    As simple as that.                                                 
## 10 Troi      Farpoint Station. Even the name sounds mysterious.                 
## # … with 10,438 more rows

My plan is to look at “catchphrases” for each major character in season 1, starting with bigrams - two words appearing together. If you look at the previous sentence, the bigrams are “my plan”, “plan is”, “is to”, “to look”, etc etc.

From already posting this on twitter and getting roasted for it, I know that two main character’s names will be interpreted as multiple words (because, I guess, they are): La Forge, of Geordi La Forge, and Jean-Luc, of the previously mentioned Picard. “La” is not meaningful without “Forge”, and “Jean” not without “Luc”, so I’m going to cheat a bit here and glue those together for the analysis:

transcripts <- transcripts %>%
  mutate(
    line = str_replace_all(line, "La Forge", "LaForge"),
    line = str_replace_all(line, "Jean-Luc", "JeanLuc")
  )

Ok ok, I’m getting ahead of myself. Let’s focus on the characters I care about: Picard (the captain), Data (Android :3), Riker (Commander), Beverly (the ship’s doctor), Tasha (head of security - spoiler, but RIP), Geordi (??? he becomes Chief Engineer in season 1 but I forget his job before), Troi (the ship’s counselor), Worf (a Klingon commander?), and Wesley (a literal child but ok). No offense to everyone else, no offense to Q. Nine characters fit on a single plot really well!

transcripts <- transcripts %>%
  filter(character %in% c(
    "Picard", "Data", "Riker", "Beverly",
    "Tasha", "Geordi", "Troi", "Worf", "Wesley"
  ))

Now, I’ll split every line into bigrams and count how many times the bigrams appear, by character:

library(tidytext)

bigrams <- transcripts %>%
  unnest_tokens(output = bigram, input = line, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) %>%
  count(character, bigram, sort = TRUE)

bigrams
## # A tibble: 50,958 x 3
##    character bigram            n
##    <chr>     <chr>         <int>
##  1 Picard    number one      138
##  2 Picard    this is          94
##  3 Picard    of the           86
##  4 Data      it is            67
##  5 Picard    to the           66
##  6 Picard    we are           56
##  7 Picard    in the           54
##  8 Picard    captain's log    53
##  9 Picard    on the           52
## 10 Picard    do you           49
## # … with 50,948 more rows

Amazingly enough, the top bigram is actually meaningful - “number one”, by Picard, refers to the ship’s commander, William Riker. But most of the rest of the top 10 bigrams are pretty boring - “this is”, “of the”, “it is”, “to the”. Some of these bigrams consist of two stopwords together.

Stopwords do not add much meaning to a sentence, and we can usually safely remove them. Since we have bigrams, two words together, here, it’s not a matter of just anti-joining on a stopwords data set. Instead, I’ll take the approach of removing a bigram if both of the words are stopwords.

From the {tidytext} package, I’m using the “snowball” stopword lexicon - this will allow for more conservative stopword removal, since it contains way less words than the other two lexicons:

stop_words %>%
  count(lexicon)
## # A tibble: 3 x 2
##   lexicon      n
##   <chr>    <int>
## 1 onix       404
## 2 SMART      571
## 3 snowball   174
tng_stopwords <- stop_words %>%
  filter(lexicon == "snowball") %>%
  pull(word)

To see if both words in a bigram are stopwords, I’ll split the bigram into two columns, and test if each is a stopword (yes I know how to use across() wowww):

bigrams <- bigrams %>%
  separate(bigram, into = c("word1", "word2"), sep = " ") %>%
  mutate(across(c(word1, word2), .fns = list(is_stopword = ~ .x %in% tng_stopwords)))

bigrams
## # A tibble: 50,958 x 6
##    character word1     word2     n word1_is_stopword word2_is_stopword
##    <chr>     <chr>     <chr> <int> <lgl>             <lgl>            
##  1 Picard    number    one     138 FALSE             FALSE            
##  2 Picard    this      is       94 TRUE              TRUE             
##  3 Picard    of        the      86 TRUE              TRUE             
##  4 Data      it        is       67 TRUE              TRUE             
##  5 Picard    to        the      66 TRUE              TRUE             
##  6 Picard    we        are      56 TRUE              TRUE             
##  7 Picard    in        the      54 TRUE              TRUE             
##  8 Picard    captain's log      53 FALSE             FALSE            
##  9 Picard    on        the      52 TRUE              TRUE             
## 10 Picard    do        you      49 TRUE              TRUE             
## # … with 50,948 more rows

Filter only for cases where they’re not both stopwords (but one may be, e.g. “the” below):

bigrams <- bigrams %>%
  filter(!(word1_is_stopword & word2_is_stopword))

head(bigrams)
## # A tibble: 6 x 6
##   character word1     word2          n word1_is_stopword word2_is_stopword
##   <chr>     <chr>     <chr>      <int> <lgl>             <lgl>            
## 1 Picard    number    one          138 FALSE             FALSE            
## 2 Picard    captain's log           53 FALSE             FALSE            
## 3 Picard    mister    data          38 FALSE             FALSE            
## 4 Picard    the       enterprise    38 TRUE              FALSE            
## 5 Picard    the       bridge        37 TRUE              FALSE            
## 6 Picard    log       stardate      32 FALSE             FALSE

Then reunite the word columns into a happy bigram again:

bigrams <- bigrams %>%
  unite(bigram, c(word1, word2), sep = " ") %>%
  select(-contains("is_stopword"))

bigrams
## # A tibble: 43,904 x 3
##    character bigram             n
##    <chr>     <chr>          <int>
##  1 Picard    number one       138
##  2 Picard    captain's log     53
##  3 Picard    mister data       38
##  4 Picard    the enterprise    38
##  5 Picard    the bridge        37
##  6 Picard    log stardate      32
##  7 Geordi    aye sir           30
##  8 Picard    away team         30
##  9 Picard    can you           30
## 10 Picard    lieutenant yar    29
## # … with 43,894 more rows

We could just look at each character’s top bigrams, but I want to see what makes them them; what makes them unique - what if all of the characters have the same top bigrams? Boring!

Because I’m only human, I’ll go ahead and just copy Julia Silge’s excellent blog post on {tidylo} and use weighted log odds, a method of comparing features (bigrams) across some set (characters).

Since this is ~tidy text mining~, it’s super easy to get the top 5 bigrams for each character, by the weighted log odds (no I don’t know what it is but it compares them, OK? I’ll remove the x-axis labels anyways)

library(tidylo)

tng_tidylo <- bigrams %>%
  bind_log_odds(character, bigram, n) %>%
  group_by(character) %>%
  top_n(5, log_odds_weighted) %>%
  ungroup()

tng_tidylo
## # A tibble: 46 x 4
##    character bigram               n log_odds_weighted
##    <chr>     <chr>            <int>             <dbl>
##  1 Picard    number one         138              4.89
##  2 Picard    captain's log       53              6.91
##  3 Picard    log stardate        32              5.18
##  4 Geordi    aye sir             30              8.13
##  5 Picard    mister laforge      22              4.20
##  6 Data      i believe           18              3.87
##  7 Wesley    yes sir             17              7.56
##  8 Tasha     frequencies open    14              6.74
##  9 Riker     riker to            13              6.54
## 10 Tasha     open sir            12              6.23
## # … with 36 more rows

Very good!

In order to visualize, I want to clean up a bit more. Let’s give everyone the dignity of their full names, and reorder them according to… rank? I don’t know.

AND, before I forget… “laforge” -> “la forge” and “jeanluc” -> “jeanluc”, at least.

library(forcats)

tng_tidylo <- tng_tidylo %>%
  mutate(
    character = recode(character,
      "Picard" = "Jean-Luc Picard",
      "Riker" = "William Riker",
      "Troi" = "Deanna Troi",
      "Tasha" = "Tasha Yar",
      "Geordi" = "Geordi La Forge",
      "Beverly" = "Dr. Beverly Crusher",
      "Wesley" = "Wesley Crusher"
    ),
    character = fct_relevel(
      character, "Jean-Luc Picard", "William Riker", "Data", "Deanna Troi",
      "Tasha Yar", "Worf", "Geordi La Forge", "Dr. Beverly Crusher", "Wesley Crusher"
    ),
    bigram = str_replace_all(bigram, "laforge", "la forge"),
    bigram = str_replace_all(bigram, "jeanluc", "jean luc")
  )

Now, finally, we can visualize. Of course, there’s R packages for making visualizations look Star Trek-y. In particular, I’m using {trekcolors} and {trekfont} to supply the… colours and font. By the way, all of these Star Trek packages were created by Matt Leonawicz. Thank you for the packages!

library(ggplot2)
library(trekcolors)
library(trekfont)

tng_tidylo %>%
  ggplot(aes(
    x = log_odds_weighted,
    y = reorder_within(bigram, log_odds_weighted, character),
    fill = character
  )) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(character), scales = "free_y") +
  labs(
    x = NULL, y = NULL,
    title = "Start Trek: The Next Generation catchphrases",
    subtitle = "(Season 1 bigrams compared via tidy log odds)",
    caption = "Data: {rtrek} package"
  ) +
  scale_y_reordered() +
  scale_fill_trek("starfleet") +
  theme_dark(base_family = "Khan") +
  theme(
    plot.title.position = "plot",
    axis.ticks.length = unit(0, "pt"),
    axis.text.x = element_blank(),
    plot.background = element_rect(fill = "black", colour = "black"),
    plot.title = element_text(colour = "#327CCB", family = "StarNext", hjust = 0.5, size = 20),
    plot.subtitle = element_text(colour = "white", hjust = 0.5),
    plot.caption = element_text(colour = "white"),
    axis.text = element_text(colour = "white", family = "Federation")
  )

I absolutely love how much the different character’s personalities (and yes, catchphrases!) shine through, even in such a simple analysis like this. Jean-Luc Picard starts every episode in season 1 with the Captain’s log, and often calls Riker “Number one”. Data is an Android who cannot use compound words, and is a wealth of, well, data - “most interesting” and “it appears” make total sense. Deanna Troi, the counselor, focuses on understanding people and other life forms’ feelings - “I sense” and “I believe” are so appropriate. And it goes on!

Of course, as folks on twitter quickly realized (and I knew too, okay!), some of these are in fact trigrams (three words together), not bigrams. Captain Picard starts every episode with “Captain’s log, stardate…” and then the date. So it’s of no surprise that “Captain’s log” and “log stardate” would both appear. Would it be more accurate to look at trigrams? …quadgrams? What else are we missing? OH WELL.

If anyone knows of some way to look at multiple n-grams (e.g. bigrams, trigrams, …quadgrams?) all together, and exclude any overlap (like, keeping “captain’s log stardate” but not “captain’s log”), I’m all ears. I did try some stuff out but it felt too much like sketchy science to include here!

I could go on and on (maybe I have already) but I’d rather go watch more Star Trek. Bye!

Avatar
Sharla Gelfand
Freelance R and Shiny Developer