-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathR-Studio Notebook Intro to Strings.Rmd
543 lines (368 loc) · 13.8 KB
/
R-Studio Notebook Intro to Strings.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
---
title: "R Notebook: Introduction to Strings"
output: html_notebook
author: Kyle Rasku
---
References:
Wickham, C., Jeon, T. & Cotton, R. (2021). String Manipulation with stringr in R; DataCamp https://app.datacamp.com/learn/courses/string-manipulation-with-stringr-in-r
and
Wickham, H. & Grolemund, G. (2017) R for Data Science, O'Reilly and Associates
<b>Freely accessible here:</b> https://r4ds.had.co.nz/index.html
```{r}
library(tidyverse)
```
PART 1
Creating and Joining Strings
Dealing with Quotes and Single Quotes within a String
```{r}
# You can begin and end with either single or double quotes
String1 <- "This is a string"
String2 <- 'Another "awesome" string!'
# If you choose to use double or single quotes within a string delimited by that character,
# you need to use the backslash, \, to escape the internal instance of the character
String3 <- "The title is \"Escape Quotes from Alcatraz\""
String4 <- "He said, 'Nobody knows the trouble I\'ve seen!'"
# Just so you know, to type an actual backslash, you need 2 backslashes
String5 <- "The path to my file is C:\\Stuff\\Nerdy"
# Use the 'writeLines' function to see what things will look like
# writeLines is the stringr version of base R's paste and paste0
writeLines(String1)
writeLines(String2)
writeLines(String3)
writeLines(String4)
writeLines(String5)
```
PART 2
Joining Strings with Base R's paste and paste0
```{r}
# The paste function takes a series of strings and puts them together in to a single string
# By default, the new string will separate each vector item with a space
# By using the optional sep argument, you can specify the character that will separate the vector items
paste("a", "e", "i", "o", "u", "y")
paste("The English vowels are", " a", "e", "i", "o", "u", "and y", sep = ",")
```
```{r}
# paste, paste0 and str_c: comparing behavior
nyc <- c("New York", "Big Apple", "Broadway")
la <- c("Los Angeles", "Tinseltown", "Hollywood")
paste(nyc, la, sep = ", ")
paste0(nyc, la)
str_c(nyc, "=", la, collapse="; ")
```
```{r}
# Combining paste commands to create more complex output
teachers <- c("Dr. Moody", "Dr. Barnes", "Dr. Mittal", "Dr. Jiminez")
courses <- c("Introduction to Anatomy", "Biochemistry", "Microbiology", "Fundamentals of Medicine")
locations <- c("Hewitt 2431", "Ogilvy 45", "Ogilvy 58", "Hewitt 881")
paste(teachers, "will be your instructor for", courses, "at", locations)
# blank line
writeLines("")
# or assign output vectors to a variable, to create a single output with writeLines:
schedule <- paste(teachers, "will be your instructor for", courses, "at", locations)
writeLines(schedule)
```
```{r}
action <- "go"
paste("Going to a",
paste(rep(action, 2), collapse = "-"),
collapse = " ")
nonsense_word <- "tu"
paste("Put two in your",
paste(rep(nonsense_word, 2), collapse = "-"),
collapse = " ")
```
PART 3
Counting the number of code points in a string
```{r}
# Returns the number of code points in a string
str_length("The quick brown fox jumps over the lazy dog.")
str_length(c("String One", "String Two Fish", "Red Fish, Blue Fish"))
```
PART 4
Combining Strings with str_c
```{r}
prefix <- "It was a dark and stormy "
str_c(prefix, "night.")
str_c(prefix, "boat ride.")
str_c(prefix, "drink.")
```
```{r}
# Using the sep argument to control how strings are separated
str_c("123456", "11/15/2021", "Digestive Health Exam", "92.2", "Pass", "97", sep = ";")
```
```{r}
# Display missing values as NA with str_replace_na
record <- c(567322, "12/27/21", "Johnson, George A.", NA, "Spring 2022", "3.55")
str_replace_na(record)
```
```{r}
# Auto-vectorization with str_c
students <- str_c("Student ", c("125542", "126781", "124872"), " Academic Year ", c("2020-21", "2020-21", "2021-22"))
writeLines(students)
```
```{r}
# Using an if statement with str_c
name <- "Dr. Amy Pace"
course <- "Introduction to Pediatrics"
course2 <- TRUE
str_c(
"Your instructor will be ", name, " for ", course,
if (course2) ", and the following clinical rotation",
"."
)
```
```{r}
# Using str_c with collapse
# NOTE: passing a variable name to the 'parse' command turns the name into a variable object that can
# then be referenced / evaluated with the 'eval' command. This is very handy for creating concise code!
name1 <- c("Jodie", "A", "Pierce")
name2 <- c("Chantell", "B", "Richards")
name3 <- c("Eric", "J", "Johnson")
name4 <- c("Anna", "R", "Yurchenko")
for (x in 1:4) {
variable <- str_c("name", x)
print(str_c(eval(parse(text = variable)) , collapse = " "))
}
```
PART 5
Sub-setting strings with str_sub
```{r}
# Sub-setting strings with str_sub
courses <- c("FMS501", "FMS511", "LMH513")
prefixes <- str_sub(courses, 1, 3)
writeLines(prefixes)
# negative numbers count backwards from end
str_sub(courses, -3, -1)
```
```{r}
# Using str_sub to perform transformation
scale_values <- c("Disagree", "Neutral", "Agree")
str_sub(scale_values, 1, 1) <- str_to_lower(str_sub(scale_values, 1, 1))
print(scale_values)
```
PART 6
Finding Patterns in Strings with str_starts and str_ends
```{r}
places <- c("New York City", "Jefferson City", "Charlotte Amalie", "Honolulu", "Montpelier", "Pago Pago")
str_starts(places, "[ABCDEFGHIJ]")
str_ends(places, "City")
str_ends(places, "City", negate=TRUE)
places[str_starts(places, "[ABCDEFGHIJ]")]
```
PART 7
More on Pattern Matching with Strings
```{r}
# Using str_detect to find a pattern in a string or vector of strings
string_vector <- c("alibi", "dizzy", "launch", "hazardous", "epiphany", "alliteration", "demotic", "mellifluous")
contains_z <- str_detect(string_vector, pattern = fixed("z"))
print(contains_z)
sum(contains_z)
string_vector[contains_z]
```
```{r}
# str_subset allows you to subset strings based on a match
# the match can be a fixed match (as below), a regular expression, or rebus expression
# the regular expression below matches any character in the brackets [], exactly {2} times
# the matched words returned below all have two consecutive vowels somewhere in the word
# for more on using regular expressions, visit https://www.regular-expressions.info/tutorial.html
str_subset(string_vector, '[aeiouy]{2}')
str_subset(string_vector, '[aeiouy]{3}')
```
```{r}
# Using str_count with string_vector
writeLines(string_vector)
counts <- str_count(string_vector, 'z')
counts
sum(counts)
```
```{r}
# Using str_count with a pattern to create a list of counts
num_as <- str_count(string_vector, pattern = fixed("a"))
hist(num_as)
```
```{r}
# Using a boolean expression to subset a string vector
string_vector[num_as>1]
```
PART 8
Splitting Strings
```{r}
# Splitting strings
sentence <- "The course was taught last semester by Dr. Joe Amonsoovic. He really was the greatest!"
str_split(sentence, " ", simplify = TRUE)
words_from_sentence <- str_split(sentence, " ", simplify = TRUE)
instructor <- paste(words_from_sentence[1,9:10], collapse = " ")
instructor
# Use two-backslashes to escape the literal character, .
# Otherwise, R assumes you mean 'regular expression dot' which matches any character
instructor <- str_remove(instructor, "\\.")
writeLines(instructor)
```
```{r}
# Splitting strings
chipmunks <- "Alvin & Simon & Theodore & Sabrina"
str_split(chipmunks, pattern = " & ")
str_split(chipmunks, pattern = " & ", n = 3)
```
```{r}
# Splitting a list into a matrix using the simplify = TRUE argument
names <- c("Smith, Archie", "Tuttle, Jane", "Yuskitas, Anne")
mNames <- str_split(names, fixed(", "), simplify = TRUE)
first_names <- mNames[,2]
last_names <- mNames[,1]
mNames
writeLines("\n")
first_names
last_names
```
PART 9
WORKING WITH POE LINES
```{r}
# Developing and executing a simple algorithm to find interesting words in a corpus
poe_lines <- '"Seldom we find," says Solomon Don Dunce, "Half an idea in the profoundest sonnet. Through all the flimsy things we see at once As easily as through a Naples bonnet- Trash of all trash!- how can a lady don it? Yet heavier far than your Petrarchan stuff- Owl-downy nonsense that the faintest puff Twirls into trunk-paper the while you con it."'
# remove punctuation (as you may have guessed, both ? and . are special to regular expressions and must be escaped!)
poe_lines <- str_replace_all(poe_lines, '[,"\\?\\.!]', '')
# obtain a lower-case vector
words <- str_to_lower(str_split(poe_lines, " ", simplify = TRUE))
# remove end-of-word dashes (a dash also has a special regex meaning, so it too is escaped; the $ anchors the dash
# to the end of the word, so that only dashes that appear at the end of a word will be removed.)
words <- str_remove(words, "\\-$")
# remove duplicates
words <- words[!duplicated(words)]
# of chars per word
word_lengths <- str_length(words)
# average word-length
avg_word_length <- mean(word_lengths)
# use the vector of lengths to create a boolean indexing vector to get all the words that are longer than average
interesting_words <- words[word_lengths>avg_word_length]
interesting_words
```
PART 9
Replacing Parts of Strings
```{r}
# Replace one or more instances of a pattern with something else
str_replace(string_vector, '[aeiouy]', '?')
str_replace_all(string_vector, '[aeiouy]', '?')
```
```{r}
# Another example, replacing & with 'and'
str_replace("Tom & Jerry", pattern = " & ", replacement = " and ")
str_replace_all("Alvin & Simon & Theodore", pattern = " & ", replacement = " and ")
```
PART 10
Introducing rebus and dplyr
```{r}
# babynames is a data set of babynames
# rebus is a library for pattern-matching
# https://www.rdocumentation.org/packages/rebus/versions/0.1-3
library("rebus")
library("babynames")
```
In this section, we will introduce a little bit of dplyr, just to do some filtering of data that we will then operate on with stringr functions. Below, dplyr is used to *filter* the babynames data by year.
If you compare the R cell below with the cell following it, you will notice that there's more than one way to use dplyr. You can make the dataframe you wish to operate on the first argument to the function as in:
filter(babynames, year==2014)
OR you can use the **pipe operator** to 'pipe' the data frame to the filter function, like this:
babynames %>% filter(year==2014)
These look different, but they do the same thing. And, of course, you can assign the results to a variable that you can then reference.
babies_of_2014 <- filter(babynames, year==2014)
OR
babies_of_2014 <- babynames %>% filter(year==2014)
```{r}
babies_2014 <- filter(babynames, year==2014)
names_2014 <- babies_2014$name
# Create a list of the last two letters of every name
last_two <- str_sub(names_2014, -2, -1)
# Create a boolean vector, telling whether or not a name ends in 'ee'
ends_in_ee <- str_detect(last_two, pattern = fixed("ee"))
# Subset data frame using boolean vector as a filter
sex <- babies_2014[ends_in_ee,]$sex
# Basic table
table(sex)
```
In 2014, many more commonly female names ended in 'ee' than commonly male names.
```{r}
# Percentage of 2014 boys names that contain "q"
boys_2014 <- babynames %>% filter(sex=="M", year==2014)
boys_names <- boys_2014$name
with_q <- str_detect(boys_names, pattern = "q" %R% ANY_CHAR)
mean(with_q)*100
```
About .7% of all boys named in 2014 had a name containing the letter 'q'; that's not many!
Let's see the names:
```{r}
boys_2014[with_q,]$name
```
Those are some nifty names!
CAPTURE AND GROUP
```{r}
# capture is good with match functions
rx_price <- capture(digit(1, Inf) %R% DOT %R% digit(2))
rx_quantity <- capture(digit(1, Inf))
rx_all <- DOLLAR %R% rx_price %R% " for " %R% rx_quantity
str_match("The price was $123.99 for 12.", rx_price)
str_match_all("The price was $123.99 for 12.", rx_quantity)
str_match_all("The price was $123.99 for 12.", rx_all)
# group is mostly used with alternation. See ?or.
rx_spread <- group("peanut butter" %|% "jam" %|% "marmalade")
str_match_all(
"You can have peanut butter, jam, or marmalade on your toast.",
rx_spread
)
```
```{r}
# Capturing an email address with rebus
email <- capture(one_or_more(WRD)) %R% "@" %R% capture(one_or_more(WRD)) %R% DOT %R% capture(one_or_more(WRD))
email_parts <- str_match(contact_list, email)
email_parts
writeLines("")
email_parts[,3]
```
```{r}
# Rebus backreferences
narratives <- c("SKIING", "SURFING", "CLEANING", "BUILDING", "BOOGIEING", "DIVORCED")
pattern <- capture(one_or_more(WRD)) %R% "ING"
# Visualize the capture groups
str_view(narratives, pattern)
# Change the phrases using the captured words
str_replace(narratives, pattern, str_c("CARELESSLY", REF1, sep = " "))
```
PART 11
Formatting Numbers
```{r}
# Formatting numbers with stringr and paste
# Some vectors of numbers
percent_change <- c(4, -1.91, 3.00, -5.002)
income <- c(72.19, 1030.18, 10291.93, 1189192.18)
whole_income <- c("72", "1,030", "10,292", "1,189,192")
p_values <- c(0.12, 0.98, 0.0000191, 0.00000000002)
# Format c(0.0011, 0.011, 1) with digits = 1
format(c(0.0011, 0.011, 1), digits = 1)
# Format c(1.0011, 2.011, 1) with digits = 1
format(c(1.0011, 2.011, 1), digits = 1)
# Format percent_change to one place after the decimal point
str_c(format(percent_change, digits=2), "%")
# Format income strings with $s
paste("$", whole_income, sep = "")
# Format p_values in fixed format
format(p_values, scientific = FALSE)
# In scientific format
format(p_values, scientific = TRUE)
```
PART 12
Creating a simple table with format and paste
```{r}
# Define the column names
column_names <- c("Year 0", "Year 1", "Year 2", "Project Lifetime")
# Create formatted income
income <- c(72.19, 1030.18, 10291.93, 1189192.18)
formatted_income <- format(income, digits = 2, big.mark = ",")
# Create dollar_income
dollar_income <- paste0("$", formatted_income)
# Format the column names
formatted_names <- format(column_names, justify = "right")
# Create rows
rows <- paste(formatted_names, dollar_income, sep = " ")
# Write rows
writeLines(rows)
```