In this session, we covered how to use regular expression (also known as regex) to operate on character data (also called strings) using the {tidyverse} package. We use the friends dataset provided by the{friends} package as an example. So first, let’s load up our packages!

library(tidyverse)
library(friends)
friends

# A tibble: 67,373 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 There's nothing to tell! He's just so… Monica…      1       1     1         1
 2 C'mon, you're going out with the guy!… Joey T…      1       1     1         2
 3 All right Joey, be nice. So does he h… Chandl…      1       1     1         3
 4 Wait, does he eat chalk?               Phoebe…      1       1     1         4
 5 (They all stare, bemused.)             Scene …      1       1     1         5
 6 Just, 'cause, I don't want her to go … Phoebe…      1       1     1         6
 7 Okay, everybody relax. This is not ev… Monica…      1       1     1         7
 8 Sounds like a date to me.              Chandl…      1       1     1         8
 9 [Time Lapse]                           Scene …      1       1     1         9
10 Alright, so I'm back in high school, … Chandl…      1       1     1        10
# ℹ 67,363 more rows

Friends has six columns, where the first column shows all the text said by the actors during the show.

Detecting patterns with `str_detect()`

Finding specific characters and repeats with `()` and `{}`

If we want to find a specific set of characters that repeated multiple times, we can combine the regex () and {}.

Let’s first find all cases where the actor said “I-” multiple times to indicate stuttering from embarassment or confusion. We’ll do this by keeping all rows with that string using filter() function from {dplyr} as well as using the str_detect() from the {stringr} package to use as a condition statement (if this pattern exists, it is TRUE, otherwise FALSE).

friends %>%
  filter(str_detect(text, "I-"))

# A tibble: 770 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 That's one way! Me, I- I went for the… Paul t…      1       1     5         6
 2 You actually broke her watch? Wow! Th… Monica…      1       1     5         7
 3 What? I-I said you had a-              Monica…      1       1    15         3
 4 I- I think they're great! I, I really… Ross G…      1       2     4         4
 5 Oh, not much. I-I got a job.           Rachel…      1       2     9         2
 6 Hi, Mindy. Hi, it-it's Rachel. Yeah, … Rachel…      1       2    11        15
 7 Oh, no! I- I can't drink this now!     Phoebe…      1       3     3        36
 8 Well, I-I could live without it.       Rachel…      1       3     9        16
 9 You know.. you let your guard down, y… Phoebe…      1       3    12        32
10 Look, I- I could go on pretending-     Monica…      1       3    12        33
# ℹ 760 more rows

Wrapping the I- in () doesn’t change the result, but we do need to do it when checking for repeats. Using () tells regex to look specifically for the characters between the ().

friends %>%
  filter(str_detect(text, "(I-)"))

# A tibble: 770 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 That's one way! Me, I- I went for the… Paul t…      1       1     5         6
 2 You actually broke her watch? Wow! Th… Monica…      1       1     5         7
 3 What? I-I said you had a-              Monica…      1       1    15         3
 4 I- I think they're great! I, I really… Ross G…      1       2     4         4
 5 Oh, not much. I-I got a job.           Rachel…      1       2     9         2
 6 Hi, Mindy. Hi, it-it's Rachel. Yeah, … Rachel…      1       2    11        15
 7 Oh, no! I- I can't drink this now!     Phoebe…      1       3     3        36
 8 Well, I-I could live without it.       Rachel…      1       3     9        16
 9 You know.. you let your guard down, y… Phoebe…      1       3    12        32
10 Look, I- I could go on pretending-     Monica…      1       3    12        33
# ℹ 760 more rows

If we want to find cases where actors say “I-” multiple times, let’s say 2 times, we put the number of times inside the curly bracket. So {2} means find the pattern at least 2 or more times.

friends %>%
  filter(str_detect(text, "(I-){2}"))

# A tibble: 152 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 Okay, but Pheebs, Pheebs, Jack gave u… Rachel…      1       4     7         8
 2 Hey, sorry I'm late. But my, uh, gran… Bobby …      1      10     7        36
 3 I-I-I don't know.                      Rachel…      1      24     3        17
 4 I forgot... I am supposed to pick up … Rachel…      1      24    10        30
 5 I-I-I gotta go, I gotta go. Okay, not… Janice…      3       1     3        10
 6 Lips moving, still talking. I mean it… Monica…      3       3     8        12
 7 I called the sperm bank today, they h… Joey T…      3       3    12         2
 8 Well, it was just, it was all so craz… Phoebe…      3       4     2        12
 9 Oh, no, no, no, don't go! I've scared… Chandl…      3       4     5        17
10 Well I-I-I, that kind of thing requir… Ross G…      3       5     3        21
# ℹ 142 more rows

If we want to find the string between 2 to 4 times, we use a , to separate the range:

friends %>%
  filter(str_detect(text, "(I-){2,4}"))

# A tibble: 152 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 Okay, but Pheebs, Pheebs, Jack gave u… Rachel…      1       4     7         8
 2 Hey, sorry I'm late. But my, uh, gran… Bobby …      1      10     7        36
 3 I-I-I don't know.                      Rachel…      1      24     3        17
 4 I forgot... I am supposed to pick up … Rachel…      1      24    10        30
 5 I-I-I gotta go, I gotta go. Okay, not… Janice…      3       1     3        10
 6 Lips moving, still talking. I mean it… Monica…      3       3     8        12
 7 I called the sperm bank today, they h… Joey T…      3       3    12         2
 8 Well, it was just, it was all so craz… Phoebe…      3       4     2        12
 9 Oh, no, no, no, don't go! I've scared… Chandl…      3       4     5        17
10 Well I-I-I, that kind of thing requir… Ross G…      3       5     3        21
# ℹ 142 more rows

If we want to only find cases where an actor starts their line with “I-”, we use the regex special character ^, which tells regex to only look for the pattern when it starts at the beginning of the string.

friends %>%
  filter(str_detect(text, "^(I-){2,4}"))

# A tibble: 22 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 I-I-I don't know.                      Rachel…      1      24     3        17
 2 I-I-I gotta go, I gotta go. Okay, not… Janice…      3       1     3        10
 3 I-I-I don't think so.                  Eric (…      3       6     7        19
 4 I-I-I-I-I don't know, I don't know wh… Phoebe…      3      13    12        17
 5 I-I-I...                               Ross G…      4       5     6        13
 6 I-I-I of course, I have more responsi… Rachel…      4       9     5         9
 7 I-I-I think you look great.            Monica…      4      14     3        52
 8 I-I-I don't care if I said some other… Ross G…      5       4     1        18
 9 I-I-I don't know if this falls under … Joey T…      5       5     8        20
10 I-I-I found it interesting.            Phoebe…      5      19     3       106
# ℹ 12 more rows

The opposite, if we want to find a pattern that is at the end of a string, we use $.

friends %>%
  filter(str_detect(text, "(I-){2,4}$"))

# A tibble: 0 × 6
# ℹ 6 variables: text <chr>, speaker <chr>, season <int>, episode <int>,
#   scene <int>, utterance <int>

Looking for different characters with `|` and `[]`

If we want to look for different characters (“this” OR “that”) we use either | or [] depending on if the different characters is at one position in the string or multiple. For instance, if we want to find patterns that are either I will or I'll at the beginning of a string, we combine () with |. The | is a regex special character that means OR.

friends %>%
  filter(str_detect(text, "^(I will|I'll)"))

# A tibble: 211 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 I'll be fine, alright? Really, everyo… Ross G…      1       1     1        33
 2 I'll have whatever Christine is havin… Ross G…      1       1     2         3
 3 I will not take this abuse.            Joey T…      1       1    11        31
 4 I'll meet somone else. There'll be ot… Monica…      1       3    12        40
 5 I will, I will. See, I'm waiting for … Ross G…      1       7     8        50
 6 I'll never forget you.                 David        1      10     7        63
 7 I'll have you know that Gloria Tribbi… Joey T…      1      11     6        22
 8 I will go for that drink.              Janice…      1      14     6        27
 9 I'll be right back.                    Monica…      1      14     8        14
10 I'll even let you ring the bell.       Firema…      1      14    16         3
# ℹ 201 more rows

If instead we want to find cases where a character is different at a specific location in the string we use []. For instance, looking for Huh or huh (notice the capitalization difference):

friends %>%
  filter(str_detect(text, "[hH]uh"))

# A tibble: 807 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 "Kinda puts that whole pillow thing i… Chandl…      1       2     3        46
 2 "Your folks are really that bad, huh?" Joey T…      1       2     5         1
 3 "Huh?!"                                Robbie       1       2     7         5
 4 "Okay. \"So. Whaddya want from me, Da… Chandl…      1       3     2         5
 5 "Huh!"                                 Phoebe…      1       3     8         5
 6 "Really? He does? Hey, buddy, what's … Chandl…      1       3    11         5
 7 "You had a rough day, huh.. c'mere."   Ross G…      1       3    14         9
 8 "C'mon, Ross! You, me, Joey, ice, guy… Chandl…      1       4     2        29
 9 "Uh-huh."                              Phoebe…      1       4    14         2
10 "No plans, huh?"                       Joey T…      1       5     2         2
# ℹ 797 more rows

If we wanted to look for times when actors use huh as a question or to seek feedback (like “that is pretty cool, huh?”), we could try to use [hH]uh?, but in regex, a ? is a special character. When you use ?, it tells regex the that character before the ? may or may not be there. For instance, if we want to find all times when kinda and kind show up, we do:

friends %>%
  filter(str_detect(text, "[kK]inda?"))

# A tibble: 708 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 Oh God... well, it started about a ha… Rachel…      1       1     1        56
 2 Ooh, I was kinda hoping that wouldn't… Rachel…      1       1     1        58
 3 Well, I was kinda supposed to be head… Rachel…      1       1     2        56
 4 Well actually thanks, but I think I'm… Rachel…      1       1     2        59
 5 No, it's, it's more of a fifth date k… Paul t…      1       1     8         3
 6 Y'know what? I think we can just leav… Rachel…      1       1    14        16
 7 Oh. Listen, do you think- and try not… Ross G…      1       1    14        41
 8 Marsha, these are cave people. Okay? … Ross G…      1       2     2         5
 9 Oh, I think this is the episode of Th… Chandl…      1       2     3         1
10 Kinda puts that whole pillow thing in… Chandl…      1       2     3        46
# ℹ 698 more rows

But if we want to find the ? in the string, we have to “escape” it with \\ (in R you have to double escape the \, normally regex only needs \ once). So to find cases where huh? is asked before another sentence or at the end of the string (using ( |$)) we do:

friends %>%
  filter(str_detect(text, "[hH]uh\\?( |$)"))

# A tibble: 402 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 Your folks are really that bad, huh?   Joey T…      1       2     5         1
 2 C'mon, Ross! You, me, Joey, ice, guys… Chandl…      1       4     2        29
 3 No plans, huh?                         Joey T…      1       5     2         2
 4 So, you and Angela, huh?               Joey T…      1       5     9         1
 5 Oh, you got the whole night, huh?      Monica…      1       6     5        16
 6 Now, see, I don't believe any of that… Joey T…      1       8     4        18
 7 Wow, those pills really worked, huh?   Monica…      1       8    11        14
 8 Huh?                                   Guy          1       9     2         2
 9 Well, you were right. How can they do… Ross G…      1       9     2        29
10 Nice shoes, huh?                       Joey T…      1      10     3        21
# ℹ 392 more rows

Replacing a pattern with another string using `str_replace_all()`

A common use for using regex is to find and replace text with something else. There are two functions to do that: str_replace(), which will find the first case of the pattern and replace that, and str_replace_all(), which will find all cases of a pattern and replace them all. For instance, let’s change all cases of huh? (a very common US word) with eh? (the same use as huh? but used in Canada).

friends %>%
  filter(str_detect(text, "[hH]uh\\?")) %>%
  mutate(text = str_replace_all(text, "[hH]uh\\?", "eh?"))

# A tibble: 415 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 "Your folks are really that bad, eh?"  Joey T…      1       2     5         1
 2 "eh?!"                                 Robbie       1       2     7         5
 3 "Okay. \"So. Whaddya want from me, Da… Chandl…      1       3     2         5
 4 "C'mon, Ross! You, me, Joey, ice, guy… Chandl…      1       4     2        29
 5 "No plans, eh?"                        Joey T…      1       5     2         2
 6 "So, you and Angela, eh?"              Joey T…      1       5     9         1
 7 "Oh, you got the whole night, eh?"     Monica…      1       6     5        16
 8 "Now, see, I don't believe any of tha… Joey T…      1       8     4        18
 9 "Wow, those pills really worked, eh?"  Monica…      1       8    11        14
10 "eh?"                                  Guy          1       9     2         2
# ℹ 405 more rows

Done!

Detecting patterns with str_detect()

Finding specific characters and repeats with () and {}

Looking for different characters with | and []

Replacing a pattern with another string using str_replace_all()

Detecting patterns with `str_detect()`

Finding specific characters and repeats with `()` and `{}`

Looking for different characters with `|` and `[]`

Replacing a pattern with another string using `str_replace_all()`