Using regular expressions to help process character data

Wrangling and processing character data using ‘regular expressions’, which is a character pattern matching tool.
character data
regular expressions
Authors

Yingxiao Yan

Luke W. Johnston

Published

January 8, 2024

Modified

May 13, 2024

In this session, we covered how to use regular expression (also known as regex) to operate on character data (also called strings) using the {tidyverse} package. We use the friends dataset provided by the{friends} package as an example. So first, let’s load up our packages!

library(tidyverse)
library(friends)
friends
# A tibble: 67,373 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 There's nothing to tell! He's just so… Monica…      1       1     1         1
 2 C'mon, you're going out with the guy!… Joey T…      1       1     1         2
 3 All right Joey, be nice. So does he h… Chandl…      1       1     1         3
 4 Wait, does he eat chalk?               Phoebe…      1       1     1         4
 5 (They all stare, bemused.)             Scene …      1       1     1         5
 6 Just, 'cause, I don't want her to go … Phoebe…      1       1     1         6
 7 Okay, everybody relax. This is not ev… Monica…      1       1     1         7
 8 Sounds like a date to me.              Chandl…      1       1     1         8
 9 [Time Lapse]                           Scene …      1       1     1         9
10 Alright, so I'm back in high school, … Chandl…      1       1     1        10
# ℹ 67,363 more rows

Friends has six columns, where the first column shows all the text said by the actors during the show.

Detecting patterns with str_detect()

Finding specific characters and repeats with () and {}

If we want to find a specific set of characters that repeated multiple times, we can combine the regex () and {}.

Let’s first find all cases where the actor said “I-” multiple times to indicate stuttering from embarassment or confusion. We’ll do this by keeping all rows with that string using filter() function from {dplyr} as well as using the str_detect() from the {stringr} package to use as a condition statement (if this pattern exists, it is TRUE, otherwise FALSE).

friends %>%
  filter(str_detect(text, "I-"))
# A tibble: 770 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 That's one way! Me, I- I went for the… Paul t…      1       1     5         6
 2 You actually broke her watch? Wow! Th… Monica…      1       1     5         7
 3 What? I-I said you had a-              Monica…      1       1    15         3
 4 I- I think they're great! I, I really… Ross G…      1       2     4         4
 5 Oh, not much. I-I got a job.           Rachel…      1       2     9         2
 6 Hi, Mindy. Hi, it-it's Rachel. Yeah, … Rachel…      1       2    11        15
 7 Oh, no! I- I can't drink this now!     Phoebe…      1       3     3        36
 8 Well, I-I could live without it.       Rachel…      1       3     9        16
 9 You know.. you let your guard down, y… Phoebe…      1       3    12        32
10 Look, I- I could go on pretending-     Monica…      1       3    12        33
# ℹ 760 more rows

Wrapping the I- in () doesn’t change the result, but we do need to do it when checking for repeats. Using () tells regex to look specifically for the characters between the ().

friends %>%
  filter(str_detect(text, "(I-)"))
# A tibble: 770 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 That's one way! Me, I- I went for the… Paul t…      1       1     5         6
 2 You actually broke her watch? Wow! Th… Monica…      1       1     5         7
 3 What? I-I said you had a-              Monica…      1       1    15         3
 4 I- I think they're great! I, I really… Ross G…      1       2     4         4
 5 Oh, not much. I-I got a job.           Rachel…      1       2     9         2
 6 Hi, Mindy. Hi, it-it's Rachel. Yeah, … Rachel…      1       2    11        15
 7 Oh, no! I- I can't drink this now!     Phoebe…      1       3     3        36
 8 Well, I-I could live without it.       Rachel…      1       3     9        16
 9 You know.. you let your guard down, y… Phoebe…      1       3    12        32
10 Look, I- I could go on pretending-     Monica…      1       3    12        33
# ℹ 760 more rows

If we want to find cases where actors say “I-” multiple times, let’s say 2 times, we put the number of times inside the curly bracket. So {2} means find the pattern at least 2 or more times.

friends %>%
  filter(str_detect(text, "(I-){2}"))
# A tibble: 152 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 Okay, but Pheebs, Pheebs, Jack gave u… Rachel…      1       4     7         8
 2 Hey, sorry I'm late. But my, uh, gran… Bobby …      1      10     7        36
 3 I-I-I don't know.                      Rachel…      1      24     3        17
 4 I forgot... I am supposed to pick up … Rachel…      1      24    10        30
 5 I-I-I gotta go, I gotta go. Okay, not… Janice…      3       1     3        10
 6 Lips moving, still talking. I mean it… Monica…      3       3     8        12
 7 I called the sperm bank today, they h… Joey T…      3       3    12         2
 8 Well, it was just, it was all so craz… Phoebe…      3       4     2        12
 9 Oh, no, no, no, don't go! I've scared… Chandl…      3       4     5        17
10 Well I-I-I, that kind of thing requir… Ross G…      3       5     3        21
# ℹ 142 more rows

If we want to find the string between 2 to 4 times, we use a , to separate the range:

friends %>%
  filter(str_detect(text, "(I-){2,4}"))
# A tibble: 152 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 Okay, but Pheebs, Pheebs, Jack gave u… Rachel…      1       4     7         8
 2 Hey, sorry I'm late. But my, uh, gran… Bobby …      1      10     7        36
 3 I-I-I don't know.                      Rachel…      1      24     3        17
 4 I forgot... I am supposed to pick up … Rachel…      1      24    10        30
 5 I-I-I gotta go, I gotta go. Okay, not… Janice…      3       1     3        10
 6 Lips moving, still talking. I mean it… Monica…      3       3     8        12
 7 I called the sperm bank today, they h… Joey T…      3       3    12         2
 8 Well, it was just, it was all so craz… Phoebe…      3       4     2        12
 9 Oh, no, no, no, don't go! I've scared… Chandl…      3       4     5        17
10 Well I-I-I, that kind of thing requir… Ross G…      3       5     3        21
# ℹ 142 more rows

If we want to only find cases where an actor starts their line with “I-”, we use the regex special character ^, which tells regex to only look for the pattern when it starts at the beginning of the string.

friends %>%
  filter(str_detect(text, "^(I-){2,4}"))
# A tibble: 22 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 I-I-I don't know.                      Rachel…      1      24     3        17
 2 I-I-I gotta go, I gotta go. Okay, not… Janice…      3       1     3        10
 3 I-I-I don't think so.                  Eric (…      3       6     7        19
 4 I-I-I-I-I don't know, I don't know wh… Phoebe…      3      13    12        17
 5 I-I-I...                               Ross G…      4       5     6        13
 6 I-I-I of course, I have more responsi… Rachel…      4       9     5         9
 7 I-I-I think you look great.            Monica…      4      14     3        52
 8 I-I-I don't care if I said some other… Ross G…      5       4     1        18
 9 I-I-I don't know if this falls under … Joey T…      5       5     8        20
10 I-I-I found it interesting.            Phoebe…      5      19     3       106
# ℹ 12 more rows

The opposite, if we want to find a pattern that is at the end of a string, we use $.

friends %>%
  filter(str_detect(text, "(I-){2,4}$"))
# A tibble: 0 × 6
# ℹ 6 variables: text <chr>, speaker <chr>, season <int>, episode <int>,
#   scene <int>, utterance <int>

Looking for different characters with | and []

If we want to look for different characters (“this” OR “that”) we use either | or [] depending on if the different characters is at one position in the string or multiple. For instance, if we want to find patterns that are either I will or I'll at the beginning of a string, we combine () with |. The | is a regex special character that means OR.

friends %>%
  filter(str_detect(text, "^(I will|I'll)"))
# A tibble: 211 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 I'll be fine, alright? Really, everyo… Ross G…      1       1     1        33
 2 I'll have whatever Christine is havin… Ross G…      1       1     2         3
 3 I will not take this abuse.            Joey T…      1       1    11        31
 4 I'll meet somone else. There'll be ot… Monica…      1       3    12        40
 5 I will, I will. See, I'm waiting for … Ross G…      1       7     8        50
 6 I'll never forget you.                 David        1      10     7        63
 7 I'll have you know that Gloria Tribbi… Joey T…      1      11     6        22
 8 I will go for that drink.              Janice…      1      14     6        27
 9 I'll be right back.                    Monica…      1      14     8        14
10 I'll even let you ring the bell.       Firema…      1      14    16         3
# ℹ 201 more rows

If instead we want to find cases where a character is different at a specific location in the string we use []. For instance, looking for Huh or huh (notice the capitalization difference):

friends %>%
  filter(str_detect(text, "[hH]uh"))
# A tibble: 807 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 "Kinda puts that whole pillow thing i… Chandl…      1       2     3        46
 2 "Your folks are really that bad, huh?" Joey T…      1       2     5         1
 3 "Huh?!"                                Robbie       1       2     7         5
 4 "Okay. \"So. Whaddya want from me, Da… Chandl…      1       3     2         5
 5 "Huh!"                                 Phoebe…      1       3     8         5
 6 "Really? He does? Hey, buddy, what's … Chandl…      1       3    11         5
 7 "You had a rough day, huh.. c'mere."   Ross G…      1       3    14         9
 8 "C'mon, Ross! You, me, Joey, ice, guy… Chandl…      1       4     2        29
 9 "Uh-huh."                              Phoebe…      1       4    14         2
10 "No plans, huh?"                       Joey T…      1       5     2         2
# ℹ 797 more rows

If we wanted to look for times when actors use huh as a question or to seek feedback (like “that is pretty cool, huh?”), we could try to use [hH]uh?, but in regex, a ? is a special character. When you use ?, it tells regex the that character before the ? may or may not be there. For instance, if we want to find all times when kinda and kind show up, we do:

friends %>%
  filter(str_detect(text, "[kK]inda?"))
# A tibble: 708 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 Oh God... well, it started about a ha… Rachel…      1       1     1        56
 2 Ooh, I was kinda hoping that wouldn't… Rachel…      1       1     1        58
 3 Well, I was kinda supposed to be head… Rachel…      1       1     2        56
 4 Well actually thanks, but I think I'm… Rachel…      1       1     2        59
 5 No, it's, it's more of a fifth date k… Paul t…      1       1     8         3
 6 Y'know what? I think we can just leav… Rachel…      1       1    14        16
 7 Oh. Listen, do you think- and try not… Ross G…      1       1    14        41
 8 Marsha, these are cave people. Okay? … Ross G…      1       2     2         5
 9 Oh, I think this is the episode of Th… Chandl…      1       2     3         1
10 Kinda puts that whole pillow thing in… Chandl…      1       2     3        46
# ℹ 698 more rows

But if we want to find the ? in the string, we have to “escape” it with \\ (in R you have to double escape the \, normally regex only needs \ once). So to find cases where huh? is asked before another sentence or at the end of the string (using ( |$)) we do:

friends %>%
  filter(str_detect(text, "[hH]uh\\?( |$)"))
# A tibble: 402 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 Your folks are really that bad, huh?   Joey T…      1       2     5         1
 2 C'mon, Ross! You, me, Joey, ice, guys… Chandl…      1       4     2        29
 3 No plans, huh?                         Joey T…      1       5     2         2
 4 So, you and Angela, huh?               Joey T…      1       5     9         1
 5 Oh, you got the whole night, huh?      Monica…      1       6     5        16
 6 Now, see, I don't believe any of that… Joey T…      1       8     4        18
 7 Wow, those pills really worked, huh?   Monica…      1       8    11        14
 8 Huh?                                   Guy          1       9     2         2
 9 Well, you were right. How can they do… Ross G…      1       9     2        29
10 Nice shoes, huh?                       Joey T…      1      10     3        21
# ℹ 392 more rows

Replacing a pattern with another string using str_replace_all()

A common use for using regex is to find and replace text with something else. There are two functions to do that: str_replace(), which will find the first case of the pattern and replace that, and str_replace_all(), which will find all cases of a pattern and replace them all. For instance, let’s change all cases of huh? (a very common US word) with eh? (the same use as huh? but used in Canada).

friends %>%
  filter(str_detect(text, "[hH]uh\\?")) %>%
  mutate(text = str_replace_all(text, "[hH]uh\\?", "eh?"))
# A tibble: 415 × 6
   text                                   speaker season episode scene utterance
   <chr>                                  <chr>    <int>   <int> <int>     <int>
 1 "Your folks are really that bad, eh?"  Joey T…      1       2     5         1
 2 "eh?!"                                 Robbie       1       2     7         5
 3 "Okay. \"So. Whaddya want from me, Da… Chandl…      1       3     2         5
 4 "C'mon, Ross! You, me, Joey, ice, guy… Chandl…      1       4     2        29
 5 "No plans, eh?"                        Joey T…      1       5     2         2
 6 "So, you and Angela, eh?"              Joey T…      1       5     9         1
 7 "Oh, you got the whole night, eh?"     Monica…      1       6     5        16
 8 "Now, see, I don't believe any of tha… Joey T…      1       8     4        18
 9 "Wow, those pills really worked, eh?"  Monica…      1       8    11        14
10 "eh?"                                  Guy          1       9     2         2
# ℹ 405 more rows

Done!