You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 05-Regular-Expressions.md
+73-22Lines changed: 73 additions & 22 deletions
Original file line number
Diff line number
Diff line change
@@ -30,11 +30,13 @@ In addition to the utilities we will discuss in this chapter, nearly every progr
30
30
The general idea of writing a regular expression is that you're writing a *search string*.
31
31
They may look complicated, but break them down piece by piece and you should be able to puzzle out what is going on.
32
32
33
-
There are several websites that will visually show you what a regular expression matches.
33
+
There are several websites that will visually show you what each part of a regular expression matches.
34
34
We recommend you try out examples from this chapter in one of these websites; try [https://regex101.com/](https://regex101.com/).
35
35
36
36
### Syntax
37
37
38
+
#### Character Classes
39
+
38
40
Letters, numbers, and spaces match themselves: the regex `abc` matches the string "abc".
39
41
In addition to literal character matches, there are several single-character-long patterns:
40
42
@@ -57,22 +59,24 @@ Custom character classes can include other character classes, and you can use `-
57
59
For instance, if you wanted to match a hexadecimal digit, you could write the following: `[\da-fA-F]` to match a digit (`\d`) or a hex letter, either uppercase or lowercase.
58
60
You can also negate character classes by including a `^` at the beginning. `[^a-z]` matches everything except lowercase letters.
59
61
62
+
#### Repetition
63
+
60
64
Now, if you want to match names, you can use `\w\w\w\w` to match "Finn" or "Jake", but that won't work to match "Bob" or "Summer".
61
65
What you really need is a variable-length match. Fortunately there are several of these!
62
66
63
67
-`{n}`: matches *n* of the previous character class.
64
68
-`{n,m}`: matches between *n* and *m* of the previous character class (inclusive).
65
69
-`{n,}`: matches at least *n* of the previous character.
66
70
67
-
So you could write `/\w{4}/` to match four-letter words, or `\w{1,}` to match one or more word characters.
71
+
So you could write `\w{4}` to match four-letter words, or `\w{1,}` to match one or more word characters.
68
72
69
73
Because some of these patterns are so common, there's shorthand for them:
70
74
71
75
-`*`: matches 0 or more of the previous character; short for `{0,}`.
72
76
-`+`: matches 1 or more of the previous character; short for `{1,}`.
73
77
-`?`: matches 0 or 1 of the previous character; short for `{0,1}`.
74
78
75
-
So we could write our name regex as `/\w+/`.
79
+
So we could write our name regex as `\w+`.
76
80
77
81
More examples:
78
82
@@ -81,12 +85,15 @@ More examples:
81
85
-`\d{5}` : Matches any string containing five digits (a regular ZIP code).
82
86
-`\d{5}-\d{4}` : Matches any string containing 5 digits followed by a dash and 4 more digits (a ZIP+4 code).
83
87
88
+
#### Groups
89
+
84
90
What if you wanted to match a ZIP code either with or without the extension?
85
91
It's tempting to write `\d{5}-?\d{0,4}`, but this would also match "12345-", "12345-6", and so on, which are not valid ZIP+4 codes.
86
92
87
93
What we really need is a way to group parts of the match together.
88
94
Fortunately, you can do this with `()`s!
89
-
`\d{5}(-\d{4})?` matches any ZIP code with an optional +4 extension.
95
+
You can then apply modifiers (like `+`) to the group as a whole.
96
+
`\d{5}(-\d{4})?` matches any ZIP code with an optional +4 extension --- either it matches a 5 digit ZIP and none of `-\d{4}` or a 5 digit ZIP and all of `-\d{4}`.
90
97
91
98
A group can match one of several options, denoted by `|`.
92
99
For example, `[ac][bd]` matches "ab", "cd", "ad", and "cb".
@@ -95,27 +102,49 @@ To match "ab" or "cd" but not "ad" or "cb", use `(ab|cd)`.
95
102
The real power of groups is in backreferences, which come in handy both when matching expressions and doing string transformations.
96
103
You can refer to the substring matched by the first group with `\1`, the second group with `\2`, etc.
97
104
We can match "abab" or "cdcd" but not "abcd" or "cdab" with `(ab|cd)\1`.
105
+
The backreference there says "Match another of whatever `(ab|cd)` matched".
98
106
99
107
If you have a pattern where you need to refer to both a backreference and a digit immediately afterward, use an empty group to separate the backreference and digit.
100
108
For example, let's say you want to match "110", "220", ..., "990".
101
109
If you wrote `(\d)\10`, your regex engine would be confused because `\10` looks like a backreference to the 10th group.
102
110
Instead, write `(\d)\1()0` -- the `()` matches an empty string (i.e. nothing), so it's as if it wasn't there.
103
111
112
+
#### Anchors
113
+
104
114
By default, regular expressions match a substring anywhere in the string.
105
115
So if you have the regex `a+b+` and the string "cccaabbddddd", that will count as a match because `a+b+` matches "aabb".
106
116
To specify that a match must start at the beginning of a line, use `^`, and to specify that the match ends at the end of a line, use `$`.
107
117
So, `a+b+$` matches "cccaabb" but not "aabbcc", and `^a+b+$` matches only lines containing some "a"s followed by some "b"s.
108
118
119
+
#### Greedy and Non-greedy matching
120
+
109
121
Now, it's the nature of regular expressions to be greedy and gobble up (match) as much as they can.
110
122
Usually this sort of self-interested behavior is fine, but sometimes it goes too far.[^politics]
111
-
You can use `?` on part of a regular expression to make that part polite (i.e. non-greedy), in which case it matches only as much as it needs for the whole regex to match.
123
+
You can use `?` on part of a regular expression to make that part polite (i.e., non-greedy),
124
+
in which case it matches only as much as it needs for the whole regex to match.
112
125
113
126
One example of this is if you are trying to match complete sentences from lines of text.
114
-
Using `(.+\.)` (i.e. match one or more things, followed by a period) is fine, as long as there is just one sentence per line.
127
+
Using `.+\.` (i.e. match one or more characters, followed by a period) is fine, as long as there is just one sentence per line, like so:
128
+
129
+
```
130
+
This is one sentence.
131
+
And this is another.
132
+
Maybe we'll be daring and include a third.
133
+
```
134
+
115
135
But if there's more than one sentence on a line, this regex will match all of them, because `.` matches "."!
116
-
If you want it to match one and only one sentence, you have to tell the `.+` to match only as much as needed, so `(.+?\.)`.
136
+
For example, if we were to run it on the following file:
137
+
138
+
```
139
+
This file has three sentences. But only two lines.
140
+
I guess technically the second sentence isn't one.
141
+
We lied to you--or did we?
142
+
```
143
+
144
+
The first match would be "This file has three sentences. But only two lines.", which contains two sentences.[^grammar]
145
+
If you want it to match one and only one sentence, you have to tell the `.+` to match only as much as needed, so `.+?\.`.
117
146
118
-
Alternatively, you could rewrite it using a custom character class: `([^\.]+\.)`-- match one or more things that aren't a period, followed by a period.
147
+
Alternatively, you could rewrite it using a custom character class: `[^\.]+\.` --- match one or more things that aren't a period, followed by a period.
119
148
120
149
### `grep`
121
150
@@ -131,8 +160,8 @@ Your reverie is cut short when you suddenly remember that you have a big file th
131
160
132
161
Okay, enough imagining. There *is* a command to use that line noise to look through files: `grep`.
133
162
This interjection of a command name is short for "global regular expression print", and it does exactly just that.
134
-
In this case, "just that" means it prints strings from files (or standard in) that match a given regular expression.
135
-
If you want to look for "bob" in "cool\_people.txt", you could do it with `grep` like so: `grep bob cool\_people.txt`.
163
+
In this case, "just that" means it prints strings from files (or standard input) that match a given regular expression.
164
+
If you want to look for "bob" in "cool_people.txt", you could do it with `grep` like so: `grep bob cool_people.txt`.
136
165
If you don't specify a filename, `grep` reads from standard input, so you can pipe stuff into it as well.
137
166
138
167
`grep` has a few handy options:
@@ -143,8 +172,9 @@ If you don't specify a filename, `grep` reads from standard input, so you can pi
143
172
-`-P`: Use Perl-style regular expressions.
144
173
-`-o`: Only print the part of the line the regex matches. Handy for figuring out what a regex is matching.
145
174
146
-
Without Perl-style regexes, `grep` requires you to escape special characters to get the special meaning.[^huh]
175
+
If you don't enable Perl-style regexes, `grep` requires you to escape special characters to get the special meaning.[^huh]
147
176
In other words, `a+` matches "a+", whereas `a\+` matches one or more "a"s.
177
+
If you want to use the syntax shown in the previous section, you'll want to pass the `-P` flag to `grep`.
148
178
149
179
For these examples, we'll use STDIN as our search text.
150
180
That is, grep will use the pattern (passed as an argument) to search the input received over STDIN.
@@ -159,9 +189,19 @@ banas
159
189
$ echo "banana" | grep 'b\(an\)\+as'
160
190
```
161
191
192
+
Here's a more practical example of where `grep` comes in handy:
193
+
194
+
```
195
+
$ grep -i todo big_homework_file.cpp
196
+
// TODO: reticulate the splay tree
197
+
// ToDo copy this part from stackoverflow
198
+
/* It would be nice todo some more consistent comment formatting */
199
+
/* TODO remove completed todos from code */
200
+
```
201
+
162
202
### `sed`
163
203
164
-
`grep` is great and all but it's really only for printing out matches of regular expressions.
204
+
`grep` is great and all but it just prints out matches of regular expressions.
165
205
We can do so much more with regular expressions, though!
166
206
`sed` is a 'stream editor': it reads in a file (or standard in), makes edits, and prints the edited stream to standard out.
167
207
`sed` is noninteractive; while you *can* use it to perform any old edit, it's best for situations where you want to automate editing.
@@ -171,13 +211,13 @@ Some handy `sed` flags:
171
211
-`-r`: Use extended regular expressions. **NOTE**: even with extended regexes, `sed` is missing some character classes, such as `\d`.
172
212
-`-n`: Only print lines that match (handy for debugging).
173
213
174
-
`sed` has several commands that you can use in conjunction with reglar expressions to perform edits.
214
+
`sed` has several commands that you can use in conjunction with regular expressions to perform edits.
175
215
One such command is the print command, `p`. It prints every line that a particular regex matches.
176
216
`sed -n '/REGEX/ p'` works almost exactly like `grep REGEX` does.
177
217
Use this command to make sure your regexes match what you think they should.
178
218
179
219
The substitute command, `s`, substitutes the string matched by a regular expression with another string.
180
-
`sed 's/REGEX/REPLACEMENT/'` replaces the match for `REGEX` with `REPLACEMENT.
220
+
`sed 's/REGEX/REPLACEMENT/'` replaces the match for `REGEX` with `REPLACEMENT`.
181
221
This lets you perform string transformations, or edits.
182
222
183
223
For example,
@@ -217,12 +257,17 @@ grape pie
217
257
There are even more `sed` commands, and more ways to combine them together.
218
258
Fortunately for you, though, this is not a book on `sed`, so we'll leave it at that.
219
259
It's definitely worthwhile to spend a bit of time looking through the `sed` manual if you find yourself needing to do something it's good for.
260
+
Speaking of which, what is `sed` good for?
261
+
262
+
- Renaming variables, functions, etc. in code.
263
+
- Making changes to text file databases (CSV files, etc.).
264
+
- Impressing your friends with your ability to write arcane commands.
220
265
221
266
\newpage
222
267
## Questions
223
268
Name: `______________________________`
224
269
225
-
1. Suppose, for the sake of simplicity[^email], that we want to match email addresses whose addresses and domains are letters and numbers, like "[email protected]".
270
+
1. Suppose, for the sake of simplicity,[^email] that we want to match email addresses whose addresses and domains are letters and numbers, like "[email protected]".
226
271
Write a regular expression to match an email address.
227
272
\vspace{8em}
228
273
@@ -233,11 +278,14 @@ Write a regular expression to match an email address.
233
278
You're trying to reinvent the business as a hip, fancy eatery, "The Garden of Olives (And Also Peperoncinis)".
234
279
As part of this reinvention, you need to jazz up that menu by replacing "pizza" with "foccacia and fresh tomato sauce".
235
280
Suppose your menu is stored in "menu.txt". Write a command to update every instance of "pizza" and place the new, hip menu in "carte-du-jour.txt".
281
+
(**Hint**: you'll need to use some of the I/O redirection stuff you learned in Chapter 2.)
236
282
\newpage
237
283
238
284
## Quick Reference
239
285
240
-
Regex:
286
+
### Regex
287
+
288
+
Character classes:
241
289
242
290
-`.`: Matches one of any character.
243
291
-`\w`: Matches a word character (letters, numbers, and \_).
-`{n}`: matches *n* of the previous character class.
253
301
-`{n,m}`: matches between *n* and *m* of the previous character class (inclusive).
254
302
-`{n,}`: matches at least *n* of the previous character.
255
-
-`\*`: matches 0 or more of the previous character; short for `{0,}`.
303
+
-`*`: matches 0 or more of the previous character; short for `{0,}`.
256
304
-`+`: matches 1 or more of the previous character; short for `{1,}`.
257
305
-`?`: matches 0 or 1 of the previous character; short for `{0,1}`.
258
306
259
-
`grep REGEX [FILE]`: Search for `REGEX` in `FILE`, or standard input if no file is specified
307
+
### `grep <REGEX> [<FILE>]`
308
+
Search for `REGEX` in `FILE`, or standard input if no file is specified.
260
309
261
310
-`-C LINES`: Give `LINES` lines of context around the match.
262
311
-`-v`: Print every line that doesn’t match (it inverts the match).
263
312
-`-i`: Ignore case when matching.
264
313
-`-P`: Use Perl-style regular expressions.
265
314
-`-o`: Only print the part of the line the regex matches.
266
315
267
-
`sed COMMANDS [FILE]`: Perform `COMMANDS` to the contents of `FILE`, or standard input if no file is specified, and print the results to standard output
316
+
### `sed <COMMANDS> [<FILE>]`
317
+
Perform `COMMANDS` to the contents of `FILE`, or standard input if no file is specified, and print the results to standard output.
268
318
269
319
-`-r`: Use extended regular expressions.
270
320
-`-n`: Only print lines that match.
271
321
272
-
<!---->
322
+
Common commands:
273
323
274
324
-`/REGEX/ p`: Print lines that match `REGEX`
275
-
-`s/REGEX/REPLACEMENT/`: Replace strings that match `REGEX` with `REPLACEMENT
325
+
-`s/REGEX/REPLACEMENT/`: Replace strings that match `REGEX` with `REPLACEMENT`
276
326
- `g`: Replace every match on each line, rather than just the first match
277
327
- `i`: Make matches case insensitive
278
328
@@ -295,3 +345,4 @@ Regex:
295
345
You may even feel slightly despondent as you realize that a piece of software being popular doesn't mean that it's good.
296
346
That's what you get for thinking.
297
347
[^email]: In practice, email addresses can have all sorts of things in them! Like spaces! Or quotes!
0 commit comments