In this session, we covered how to use regular expression (also known as regex) to operate on character data (also called strings) using the {tidyverse} package. We use the friends dataset provided by the{friends} package as an example. So first, let’s load up our packages!
library(tidyverse)library(friends)friends
# A tibble: 67,373 × 6
text speaker season episode scene utterance
<chr> <chr> <int> <int> <int> <int>
1 There's nothing to tell! He's just so… Monica… 1 1 1 1
2 C'mon, you're going out with the guy!… Joey T… 1 1 1 2
3 All right Joey, be nice. So does he h… Chandl… 1 1 1 3
4 Wait, does he eat chalk? Phoebe… 1 1 1 4
5 (They all stare, bemused.) Scene … 1 1 1 5
6 Just, 'cause, I don't want her to go … Phoebe… 1 1 1 6
7 Okay, everybody relax. This is not ev… Monica… 1 1 1 7
8 Sounds like a date to me. Chandl… 1 1 1 8
9 [Time Lapse] Scene … 1 1 1 9
10 Alright, so I'm back in high school, … Chandl… 1 1 1 10
# ℹ 67,363 more rows
Friends has six columns, where the first column shows all the text said by the actors during the show.
Detecting patterns with str_detect()
Finding specific characters and repeats with () and {}
If we want to find a specific set of characters that repeated multiple times, we can combine the regex () and {}.
Let’s first find all cases where the actor said “I-” multiple times to indicate stuttering from embarassment or confusion. We’ll do this by keeping all rows with that string using filter() function from {dplyr} as well as using the str_detect() from the {stringr} package to use as a condition statement (if this pattern exists, it is TRUE, otherwise FALSE).
friends %>%filter(str_detect(text, "I-"))
# A tibble: 770 × 6
text speaker season episode scene utterance
<chr> <chr> <int> <int> <int> <int>
1 That's one way! Me, I- I went for the… Paul t… 1 1 5 6
2 You actually broke her watch? Wow! Th… Monica… 1 1 5 7
3 What? I-I said you had a- Monica… 1 1 15 3
4 I- I think they're great! I, I really… Ross G… 1 2 4 4
5 Oh, not much. I-I got a job. Rachel… 1 2 9 2
6 Hi, Mindy. Hi, it-it's Rachel. Yeah, … Rachel… 1 2 11 15
7 Oh, no! I- I can't drink this now! Phoebe… 1 3 3 36
8 Well, I-I could live without it. Rachel… 1 3 9 16
9 You know.. you let your guard down, y… Phoebe… 1 3 12 32
10 Look, I- I could go on pretending- Monica… 1 3 12 33
# ℹ 760 more rows
Wrapping the I- in () doesn’t change the result, but we do need to do it when checking for repeats. Using () tells regex to look specifically for the characters between the ().
friends %>%filter(str_detect(text, "(I-)"))
# A tibble: 770 × 6
text speaker season episode scene utterance
<chr> <chr> <int> <int> <int> <int>
1 That's one way! Me, I- I went for the… Paul t… 1 1 5 6
2 You actually broke her watch? Wow! Th… Monica… 1 1 5 7
3 What? I-I said you had a- Monica… 1 1 15 3
4 I- I think they're great! I, I really… Ross G… 1 2 4 4
5 Oh, not much. I-I got a job. Rachel… 1 2 9 2
6 Hi, Mindy. Hi, it-it's Rachel. Yeah, … Rachel… 1 2 11 15
7 Oh, no! I- I can't drink this now! Phoebe… 1 3 3 36
8 Well, I-I could live without it. Rachel… 1 3 9 16
9 You know.. you let your guard down, y… Phoebe… 1 3 12 32
10 Look, I- I could go on pretending- Monica… 1 3 12 33
# ℹ 760 more rows
If we want to find cases where actors say “I-” multiple times, let’s say 2 times, we put the number of times inside the curly bracket. So {2} means find the pattern at least 2 or more times.
friends %>%filter(str_detect(text, "(I-){2}"))
# A tibble: 152 × 6
text speaker season episode scene utterance
<chr> <chr> <int> <int> <int> <int>
1 Okay, but Pheebs, Pheebs, Jack gave u… Rachel… 1 4 7 8
2 Hey, sorry I'm late. But my, uh, gran… Bobby … 1 10 7 36
3 I-I-I don't know. Rachel… 1 24 3 17
4 I forgot... I am supposed to pick up … Rachel… 1 24 10 30
5 I-I-I gotta go, I gotta go. Okay, not… Janice… 3 1 3 10
6 Lips moving, still talking. I mean it… Monica… 3 3 8 12
7 I called the sperm bank today, they h… Joey T… 3 3 12 2
8 Well, it was just, it was all so craz… Phoebe… 3 4 2 12
9 Oh, no, no, no, don't go! I've scared… Chandl… 3 4 5 17
10 Well I-I-I, that kind of thing requir… Ross G… 3 5 3 21
# ℹ 142 more rows
If we want to find the string between 2 to 4 times, we use a , to separate the range:
friends %>%filter(str_detect(text, "(I-){2,4}"))
# A tibble: 152 × 6
text speaker season episode scene utterance
<chr> <chr> <int> <int> <int> <int>
1 Okay, but Pheebs, Pheebs, Jack gave u… Rachel… 1 4 7 8
2 Hey, sorry I'm late. But my, uh, gran… Bobby … 1 10 7 36
3 I-I-I don't know. Rachel… 1 24 3 17
4 I forgot... I am supposed to pick up … Rachel… 1 24 10 30
5 I-I-I gotta go, I gotta go. Okay, not… Janice… 3 1 3 10
6 Lips moving, still talking. I mean it… Monica… 3 3 8 12
7 I called the sperm bank today, they h… Joey T… 3 3 12 2
8 Well, it was just, it was all so craz… Phoebe… 3 4 2 12
9 Oh, no, no, no, don't go! I've scared… Chandl… 3 4 5 17
10 Well I-I-I, that kind of thing requir… Ross G… 3 5 3 21
# ℹ 142 more rows
If we want to only find cases where an actor starts their line with “I-”, we use the regex special character ^, which tells regex to only look for the pattern when it starts at the beginning of the string.
friends %>%filter(str_detect(text, "^(I-){2,4}"))
# A tibble: 22 × 6
text speaker season episode scene utterance
<chr> <chr> <int> <int> <int> <int>
1 I-I-I don't know. Rachel… 1 24 3 17
2 I-I-I gotta go, I gotta go. Okay, not… Janice… 3 1 3 10
3 I-I-I don't think so. Eric (… 3 6 7 19
4 I-I-I-I-I don't know, I don't know wh… Phoebe… 3 13 12 17
5 I-I-I... Ross G… 4 5 6 13
6 I-I-I of course, I have more responsi… Rachel… 4 9 5 9
7 I-I-I think you look great. Monica… 4 14 3 52
8 I-I-I don't care if I said some other… Ross G… 5 4 1 18
9 I-I-I don't know if this falls under … Joey T… 5 5 8 20
10 I-I-I found it interesting. Phoebe… 5 19 3 106
# ℹ 12 more rows
The opposite, if we want to find a pattern that is at the end of a string, we use $.
friends %>%filter(str_detect(text, "(I-){2,4}$"))
# A tibble: 0 × 6
# ℹ 6 variables: text <chr>, speaker <chr>, season <int>, episode <int>,
# scene <int>, utterance <int>
Looking for different characters with | and []
If we want to look for different characters (“this” OR “that”) we use either | or [] depending on if the different characters is at one position in the string or multiple. For instance, if we want to find patterns that are either I will or I'll at the beginning of a string, we combine () with |. The | is a regex special character that means OR.
# A tibble: 211 × 6
text speaker season episode scene utterance
<chr> <chr> <int> <int> <int> <int>
1 I'll be fine, alright? Really, everyo… Ross G… 1 1 1 33
2 I'll have whatever Christine is havin… Ross G… 1 1 2 3
3 I will not take this abuse. Joey T… 1 1 11 31
4 I'll meet somone else. There'll be ot… Monica… 1 3 12 40
5 I will, I will. See, I'm waiting for … Ross G… 1 7 8 50
6 I'll never forget you. David 1 10 7 63
7 I'll have you know that Gloria Tribbi… Joey T… 1 11 6 22
8 I will go for that drink. Janice… 1 14 6 27
9 I'll be right back. Monica… 1 14 8 14
10 I'll even let you ring the bell. Firema… 1 14 16 3
# ℹ 201 more rows
If instead we want to find cases where a character is different at a specific location in the string we use []. For instance, looking for Huh or huh (notice the capitalization difference):
friends %>%filter(str_detect(text, "[hH]uh"))
# A tibble: 807 × 6
text speaker season episode scene utterance
<chr> <chr> <int> <int> <int> <int>
1 "Kinda puts that whole pillow thing i… Chandl… 1 2 3 46
2 "Your folks are really that bad, huh?" Joey T… 1 2 5 1
3 "Huh?!" Robbie 1 2 7 5
4 "Okay. \"So. Whaddya want from me, Da… Chandl… 1 3 2 5
5 "Huh!" Phoebe… 1 3 8 5
6 "Really? He does? Hey, buddy, what's … Chandl… 1 3 11 5
7 "You had a rough day, huh.. c'mere." Ross G… 1 3 14 9
8 "C'mon, Ross! You, me, Joey, ice, guy… Chandl… 1 4 2 29
9 "Uh-huh." Phoebe… 1 4 14 2
10 "No plans, huh?" Joey T… 1 5 2 2
# ℹ 797 more rows
If we wanted to look for times when actors use huh as a question or to seek feedback (like “that is pretty cool, huh?”), we could try to use [hH]uh?, but in regex, a ? is a special character. When you use ?, it tells regex the that character before the ? may or may not be there. For instance, if we want to find all times when kinda and kind show up, we do:
friends %>%filter(str_detect(text, "[kK]inda?"))
# A tibble: 708 × 6
text speaker season episode scene utterance
<chr> <chr> <int> <int> <int> <int>
1 Oh God... well, it started about a ha… Rachel… 1 1 1 56
2 Ooh, I was kinda hoping that wouldn't… Rachel… 1 1 1 58
3 Well, I was kinda supposed to be head… Rachel… 1 1 2 56
4 Well actually thanks, but I think I'm… Rachel… 1 1 2 59
5 No, it's, it's more of a fifth date k… Paul t… 1 1 8 3
6 Y'know what? I think we can just leav… Rachel… 1 1 14 16
7 Oh. Listen, do you think- and try not… Ross G… 1 1 14 41
8 Marsha, these are cave people. Okay? … Ross G… 1 2 2 5
9 Oh, I think this is the episode of Th… Chandl… 1 2 3 1
10 Kinda puts that whole pillow thing in… Chandl… 1 2 3 46
# ℹ 698 more rows
But if we want to find the ? in the string, we have to “escape” it with \\ (in R you have to double escape the \, normally regex only needs \ once). So to find cases where huh? is asked before another sentence or at the end of the string (using ( |$)) we do:
# A tibble: 402 × 6
text speaker season episode scene utterance
<chr> <chr> <int> <int> <int> <int>
1 Your folks are really that bad, huh? Joey T… 1 2 5 1
2 C'mon, Ross! You, me, Joey, ice, guys… Chandl… 1 4 2 29
3 No plans, huh? Joey T… 1 5 2 2
4 So, you and Angela, huh? Joey T… 1 5 9 1
5 Oh, you got the whole night, huh? Monica… 1 6 5 16
6 Now, see, I don't believe any of that… Joey T… 1 8 4 18
7 Wow, those pills really worked, huh? Monica… 1 8 11 14
8 Huh? Guy 1 9 2 2
9 Well, you were right. How can they do… Ross G… 1 9 2 29
10 Nice shoes, huh? Joey T… 1 10 3 21
# ℹ 392 more rows
Replacing a pattern with another string using str_replace_all()
A common use for using regex is to find and replace text with something else. There are two functions to do that: str_replace(), which will find the first case of the pattern and replace that, and str_replace_all(), which will find all cases of a pattern and replace them all. For instance, let’s change all cases of huh? (a very common US word) with eh? (the same use as huh? but used in Canada).