-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathR-Studio Notebook Lesson 5 Homework.Rmd
126 lines (88 loc) · 5.02 KB
/
R-Studio Notebook Lesson 5 Homework.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
title: "R Notebook: Lesson 5 Homework"
output: html_notebook
author: Kyle Rasku
---
```{r}
library(tidyverse)
```
PART I
Corpus Ratios
```{r}
vanity_fair <- "“Good-bye, then,” he said, shaking his fist in a rage, and slamming the door by which he retreated. And this time he really gave his order for march: and mounted in the court-yard. Mrs. O’Dowd heard the clattering hoofs of the horses as they issued from the gate; and looking on, made many scornful remarks on poor Joseph as he rode down the street with Isidor after him in the laced cap. The horses, which had not been exercised for some days, were lively, and sprang about the street. Jos, a clumsy and timid horseman, did not look to advantage in the saddle. “Look at him, Amelia dear, driving into the parlour window. Such a bull in a china-shop I never saw.” And presently the pair of riders disappeared at a canter down the street leading in the direction of the Ghent road, Mrs. O’Dowd pursuing them with a fire of sarcasm so long as they were in sight. All that day from morning until past sunset, the cannon never ceased to roar. It was dark when the cannonading stopped all of a sudden. All of us have read of what occurred during that interval. The tale is in every Englishman’s mouth; and you and I, who were children when the great battle was won and lost, are never tired of hearing and recounting the history of that famous action. Its remembrance rankles still in the bosoms of millions of the countrymen of those brave men who lost the day. They pant for an opportunity of revenging that humiliation; and if a contest, ending in a victory on their part, should ensue, elating them in their turn, and leaving its cursed legacy of hatred and rage behind to us, there is no end to the so-called glory and shame, and to the alternations of successful and unsuccessful murder, in which two high-spirited nations might engage. Centuries hence, we Frenchmen and Englishmen might be boasting and killing each other still, carrying out bravely the Devil’s code of honour."
# remove punctuation, except periods, since the corpus needs to be split into lines.
vanity_fair <- str_replace_all(vanity_fair, '[,“”"\\?;:!]', '')
# before splitting, replace all instances of Mrs. with Mrs
vanity_fair <- str_replace_all(vanity_fair, pattern = fixed("Mrs."), "Mrs")
# obtain lines
lines <- str_trim(str_split(vanity_fair, "\\.", simplify = TRUE), side = "both")
for (x in 1:15) {
# obtain words for each line
words <- str_to_lower(str_split(lines[x], " ", simplify = TRUE))
# remove duplicates
words <- words[!duplicated(words)]
# of chars per word
word_lengths <- str_length(words)
# vectors of short and long words
short_words <- words[word_lengths<4]
long_words <- words[word_lengths>3]
# length of each vector to create the ratio of small words to large words
ratio <- length(short_words)/length(long_words)
print(ratio)
}
```
In most of the lines, there are more long words (4 characters or more), than short words (3 characters or less), only lines 5, 7 and 10 do not follow this rule. In line 7, the ratio is 1, so the number of long words and the number of short words are exactly equal.
PART II
2013 GIRLS and IP ADDRESSES
```{r}
# babynames is a data set of babynames
# rebus is a library for pattern-matching
# https://www.rdocumentation.org/packages/rebus/versions/0.1-3
library("rebus")
library("babynames")
```
```{r}
girls_2013 <- babynames %>% filter(sex=="F", year==2013)
gnames_2013 <- girls_2013$name
# Create a boolean vector, telling whether or not a name contains three consecutive vowels
contains3 <- str_detect(gnames_2013, pattern = "[aeiouy]{3}")
# Percentage of girls in 2013 whose names have three consecutive vowels
mean(contains3)*100
# Total # of names
sum(contains3)
```
In 2013, 1,794 girls had names containing three consecutive vowels; about 9%.
```{r}
L_detect <- str_starts(gnames_2013, pattern = "L")
L_girls <- gnames_2013[L_detect]
vowel_counts <- str_count(L_girls, "[aeiouy]")
hist(vowel_counts)
```
Of all the 2013 girls whose names began with "L", most had 3 vowels in their names!
```{r}
length(vowel_counts)
```
Total # of 2013 girls with a name beginning with L: 1,097.
IP ADDRESS MATCHING
IP addresses consist of four numbers between 0 and 255 separated by dots.
The first three numbers represent the subnet.
```{r}
ips <- c("180.113.128.248", "185.116.139.255", "190.119.151.6", "195.122.162.13", "200.125.173.20", "205.128.184.27",
"210.131.195.34", "215.134.206.41")
ip_element <- group(
"25" %R% char_range(0, 5) %|%
"2" %R% char_range(0, 4) %R% ascii_digit() %|%
optional(char_class("01")) %R% optional(ascii_digit()) %R% ascii_digit()
)
# Capture the subnets
ip_address_match <- BOUNDARY %R%
capture(repeated(group(ip_element %R% DOT), 3)) %R%
ip_element %R%
BOUNDARY
ip_matches <- str_match(ips, ip_address_match)
# Subset to subnet capture groups
matches <- str_match(ips, ip_matches)[9:16,1]
# Remove trailing dots
subnets <- str_remove(matches, "\\.$")
subnets
```