Skip to content

Commit fe777ba

Browse files
Merge pull request #10 from LearnYouSomeComputer/ch5
Chapter 5: Regexes
2 parents 03da6d9 + 3086b69 commit fe777ba

File tree

1 file changed

+73
-22
lines changed

1 file changed

+73
-22
lines changed

05-Regular-Expressions.md

Lines changed: 73 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,13 @@ In addition to the utilities we will discuss in this chapter, nearly every progr
3030
The general idea of writing a regular expression is that you're writing a *search string*.
3131
They may look complicated, but break them down piece by piece and you should be able to puzzle out what is going on.
3232

33-
There are several websites that will visually show you what a regular expression matches.
33+
There are several websites that will visually show you what each part of a regular expression matches.
3434
We recommend you try out examples from this chapter in one of these websites; try [https://regex101.com/](https://regex101.com/).
3535

3636
### Syntax
3737

38+
#### Character Classes
39+
3840
Letters, numbers, and spaces match themselves: the regex `abc` matches the string "abc".
3941
In addition to literal character matches, there are several single-character-long patterns:
4042

@@ -57,22 +59,24 @@ Custom character classes can include other character classes, and you can use `-
5759
For instance, if you wanted to match a hexadecimal digit, you could write the following: `[\da-fA-F]` to match a digit (`\d`) or a hex letter, either uppercase or lowercase.
5860
You can also negate character classes by including a `^` at the beginning. `[^a-z]` matches everything except lowercase letters.
5961

62+
#### Repetition
63+
6064
Now, if you want to match names, you can use `\w\w\w\w` to match "Finn" or "Jake", but that won't work to match "Bob" or "Summer".
6165
What you really need is a variable-length match. Fortunately there are several of these!
6266

6367
- `{n}`: matches *n* of the previous character class.
6468
- `{n,m}`: matches between *n* and *m* of the previous character class (inclusive).
6569
- `{n,}`: matches at least *n* of the previous character.
6670

67-
So you could write `/\w{4}/` to match four-letter words, or `\w{1,}` to match one or more word characters.
71+
So you could write `\w{4}` to match four-letter words, or `\w{1,}` to match one or more word characters.
6872

6973
Because some of these patterns are so common, there's shorthand for them:
7074

7175
- `*`: matches 0 or more of the previous character; short for `{0,}`.
7276
- `+`: matches 1 or more of the previous character; short for `{1,}`.
7377
- `?`: matches 0 or 1 of the previous character; short for `{0,1}`.
7478

75-
So we could write our name regex as `/\w+/`.
79+
So we could write our name regex as `\w+`.
7680

7781
More examples:
7882

@@ -81,12 +85,15 @@ More examples:
8185
- `\d{5}` : Matches any string containing five digits (a regular ZIP code).
8286
- `\d{5}-\d{4}` : Matches any string containing 5 digits followed by a dash and 4 more digits (a ZIP+4 code).
8387

88+
#### Groups
89+
8490
What if you wanted to match a ZIP code either with or without the extension?
8591
It's tempting to write `\d{5}-?\d{0,4}`, but this would also match "12345-", "12345-6", and so on, which are not valid ZIP+4 codes.
8692

8793
What we really need is a way to group parts of the match together.
8894
Fortunately, you can do this with `()`s!
89-
`\d{5}(-\d{4})?` matches any ZIP code with an optional +4 extension.
95+
You can then apply modifiers (like `+`) to the group as a whole.
96+
`\d{5}(-\d{4})?` matches any ZIP code with an optional +4 extension --- either it matches a 5 digit ZIP and none of `-\d{4}` or a 5 digit ZIP and all of `-\d{4}`.
9097

9198
A group can match one of several options, denoted by `|`.
9299
For example, `[ac][bd]` matches "ab", "cd", "ad", and "cb".
@@ -95,27 +102,49 @@ To match "ab" or "cd" but not "ad" or "cb", use `(ab|cd)`.
95102
The real power of groups is in backreferences, which come in handy both when matching expressions and doing string transformations.
96103
You can refer to the substring matched by the first group with `\1`, the second group with `\2`, etc.
97104
We can match "abab" or "cdcd" but not "abcd" or "cdab" with `(ab|cd)\1`.
105+
The backreference there says "Match another of whatever `(ab|cd)` matched".
98106

99107
If you have a pattern where you need to refer to both a backreference and a digit immediately afterward, use an empty group to separate the backreference and digit.
100108
For example, let's say you want to match "110", "220", ..., "990".
101109
If you wrote `(\d)\10`, your regex engine would be confused because `\10` looks like a backreference to the 10th group.
102110
Instead, write `(\d)\1()0` -- the `()` matches an empty string (i.e. nothing), so it's as if it wasn't there.
103111

112+
#### Anchors
113+
104114
By default, regular expressions match a substring anywhere in the string.
105115
So if you have the regex `a+b+` and the string "cccaabbddddd", that will count as a match because `a+b+` matches "aabb".
106116
To specify that a match must start at the beginning of a line, use `^`, and to specify that the match ends at the end of a line, use `$`.
107117
So, `a+b+$` matches "cccaabb" but not "aabbcc", and `^a+b+$` matches only lines containing some "a"s followed by some "b"s.
108118

119+
#### Greedy and Non-greedy matching
120+
109121
Now, it's the nature of regular expressions to be greedy and gobble up (match) as much as they can.
110122
Usually this sort of self-interested behavior is fine, but sometimes it goes too far.[^politics]
111-
You can use `?` on part of a regular expression to make that part polite (i.e. non-greedy), in which case it matches only as much as it needs for the whole regex to match.
123+
You can use `?` on part of a regular expression to make that part polite (i.e., non-greedy),
124+
in which case it matches only as much as it needs for the whole regex to match.
112125

113126
One example of this is if you are trying to match complete sentences from lines of text.
114-
Using `(.+\.)` (i.e. match one or more things, followed by a period) is fine, as long as there is just one sentence per line.
127+
Using `.+\.` (i.e. match one or more characters, followed by a period) is fine, as long as there is just one sentence per line, like so:
128+
129+
```
130+
This is one sentence.
131+
And this is another.
132+
Maybe we'll be daring and include a third.
133+
```
134+
115135
But if there's more than one sentence on a line, this regex will match all of them, because `.` matches "."!
116-
If you want it to match one and only one sentence, you have to tell the `.+` to match only as much as needed, so `(.+?\.)`.
136+
For example, if we were to run it on the following file:
137+
138+
```
139+
This file has three sentences. But only two lines.
140+
I guess technically the second sentence isn't one.
141+
We lied to you--or did we?
142+
```
143+
144+
The first match would be "This file has three sentences. But only two lines.", which contains two sentences.[^grammar]
145+
If you want it to match one and only one sentence, you have to tell the `.+` to match only as much as needed, so `.+?\.`.
117146

118-
Alternatively, you could rewrite it using a custom character class: `([^\.]+\.)` -- match one or more things that aren't a period, followed by a period.
147+
Alternatively, you could rewrite it using a custom character class: `[^\.]+\.` --- match one or more things that aren't a period, followed by a period.
119148

120149
### `grep`
121150

@@ -131,8 +160,8 @@ Your reverie is cut short when you suddenly remember that you have a big file th
131160

132161
Okay, enough imagining. There *is* a command to use that line noise to look through files: `grep`.
133162
This interjection of a command name is short for "global regular expression print", and it does exactly just that.
134-
In this case, "just that" means it prints strings from files (or standard in) that match a given regular expression.
135-
If you want to look for "bob" in "cool\_people.txt", you could do it with `grep` like so: `grep bob cool\_people.txt`.
163+
In this case, "just that" means it prints strings from files (or standard input) that match a given regular expression.
164+
If you want to look for "bob" in "cool_people.txt", you could do it with `grep` like so: `grep bob cool_people.txt`.
136165
If you don't specify a filename, `grep` reads from standard input, so you can pipe stuff into it as well.
137166

138167
`grep` has a few handy options:
@@ -143,8 +172,9 @@ If you don't specify a filename, `grep` reads from standard input, so you can pi
143172
- `-P`: Use Perl-style regular expressions.
144173
- `-o`: Only print the part of the line the regex matches. Handy for figuring out what a regex is matching.
145174

146-
Without Perl-style regexes, `grep` requires you to escape special characters to get the special meaning.[^huh]
175+
If you don't enable Perl-style regexes, `grep` requires you to escape special characters to get the special meaning.[^huh]
147176
In other words, `a+` matches "a+", whereas `a\+` matches one or more "a"s.
177+
If you want to use the syntax shown in the previous section, you'll want to pass the `-P` flag to `grep`.
148178

149179
For these examples, we'll use STDIN as our search text.
150180
That is, grep will use the pattern (passed as an argument) to search the input received over STDIN.
@@ -159,9 +189,19 @@ banas
159189
$ echo "banana" | grep 'b\(an\)\+as'
160190
```
161191

192+
Here's a more practical example of where `grep` comes in handy:
193+
194+
```
195+
$ grep -i todo big_homework_file.cpp
196+
// TODO: reticulate the splay tree
197+
// ToDo copy this part from stackoverflow
198+
/* It would be nice todo some more consistent comment formatting */
199+
/* TODO remove completed todos from code */
200+
```
201+
162202
### `sed`
163203

164-
`grep` is great and all but it's really only for printing out matches of regular expressions.
204+
`grep` is great and all but it just prints out matches of regular expressions.
165205
We can do so much more with regular expressions, though!
166206
`sed` is a 'stream editor': it reads in a file (or standard in), makes edits, and prints the edited stream to standard out.
167207
`sed` is noninteractive; while you *can* use it to perform any old edit, it's best for situations where you want to automate editing.
@@ -171,13 +211,13 @@ Some handy `sed` flags:
171211
- `-r`: Use extended regular expressions. **NOTE**: even with extended regexes, `sed` is missing some character classes, such as `\d`.
172212
- `-n`: Only print lines that match (handy for debugging).
173213

174-
`sed` has several commands that you can use in conjunction with reglar expressions to perform edits.
214+
`sed` has several commands that you can use in conjunction with regular expressions to perform edits.
175215
One such command is the print command, `p`. It prints every line that a particular regex matches.
176216
`sed -n '/REGEX/ p'` works almost exactly like `grep REGEX` does.
177217
Use this command to make sure your regexes match what you think they should.
178218

179219
The substitute command, `s`, substitutes the string matched by a regular expression with another string.
180-
`sed 's/REGEX/REPLACEMENT/'` replaces the match for `REGEX` with `REPLACEMENT.
220+
`sed 's/REGEX/REPLACEMENT/'` replaces the match for `REGEX` with `REPLACEMENT`.
181221
This lets you perform string transformations, or edits.
182222

183223
For example,
@@ -217,12 +257,17 @@ grape pie
217257
There are even more `sed` commands, and more ways to combine them together.
218258
Fortunately for you, though, this is not a book on `sed`, so we'll leave it at that.
219259
It's definitely worthwhile to spend a bit of time looking through the `sed` manual if you find yourself needing to do something it's good for.
260+
Speaking of which, what is `sed` good for?
261+
262+
- Renaming variables, functions, etc. in code.
263+
- Making changes to text file databases (CSV files, etc.).
264+
- Impressing your friends with your ability to write arcane commands.
220265

221266
\newpage
222267
## Questions
223268
Name: `______________________________`
224269

225-
1. Suppose, for the sake of simplicity[^email], that we want to match email addresses whose addresses and domains are letters and numbers, like "[email protected]".
270+
1. Suppose, for the sake of simplicity,[^email] that we want to match email addresses whose addresses and domains are letters and numbers, like "[email protected]".
226271
Write a regular expression to match an email address.
227272
\vspace{8em}
228273

@@ -233,11 +278,14 @@ Write a regular expression to match an email address.
233278
You're trying to reinvent the business as a hip, fancy eatery, "The Garden of Olives (And Also Peperoncinis)".
234279
As part of this reinvention, you need to jazz up that menu by replacing "pizza" with "foccacia and fresh tomato sauce".
235280
Suppose your menu is stored in "menu.txt". Write a command to update every instance of "pizza" and place the new, hip menu in "carte-du-jour.txt".
281+
(**Hint**: you'll need to use some of the I/O redirection stuff you learned in Chapter 2.)
236282
\newpage
237283

238284
## Quick Reference
239285

240-
Regex:
286+
### Regex
287+
288+
Character classes:
241289

242290
- `.`: Matches one of any character.
243291
- `\w`: Matches a word character (letters, numbers, and \_).
@@ -247,32 +295,34 @@ Regex:
247295
- `\s`: Matches whitespace (space, tab, newline, carriage return, etc.).
248296
- `\S`: Matches non-whitespace (everything `\s` doesn’t match).
249297

250-
<!-- -->
298+
Repetition:
251299

252300
- `{n}`: matches *n* of the previous character class.
253301
- `{n,m}`: matches between *n* and *m* of the previous character class (inclusive).
254302
- `{n,}`: matches at least *n* of the previous character.
255-
- `\*`: matches 0 or more of the previous character; short for `{0,}`.
303+
- `*`: matches 0 or more of the previous character; short for `{0,}`.
256304
- `+`: matches 1 or more of the previous character; short for `{1,}`.
257305
- `?`: matches 0 or 1 of the previous character; short for `{0,1}`.
258306

259-
`grep REGEX [FILE]`: Search for `REGEX` in `FILE`, or standard input if no file is specified
307+
### `grep <REGEX> [<FILE>]`
308+
Search for `REGEX` in `FILE`, or standard input if no file is specified.
260309

261310
- `-C LINES`: Give `LINES` lines of context around the match.
262311
- `-v`: Print every line that doesn’t match (it inverts the match).
263312
- `-i`: Ignore case when matching.
264313
- `-P`: Use Perl-style regular expressions.
265314
- `-o`: Only print the part of the line the regex matches.
266315

267-
`sed COMMANDS [FILE]`: Perform `COMMANDS` to the contents of `FILE`, or standard input if no file is specified, and print the results to standard output
316+
### `sed <COMMANDS> [<FILE>]`
317+
Perform `COMMANDS` to the contents of `FILE`, or standard input if no file is specified, and print the results to standard output.
268318

269319
- `-r`: Use extended regular expressions.
270320
- `-n`: Only print lines that match.
271321

272-
<!-- -->
322+
Common commands:
273323

274324
- `/REGEX/ p`: Print lines that match `REGEX`
275-
- `s/REGEX/REPLACEMENT/`: Replace strings that match `REGEX` with `REPLACEMENT
325+
- `s/REGEX/REPLACEMENT/`: Replace strings that match `REGEX` with `REPLACEMENT`
276326
- `g`: Replace every match on each line, rather than just the first match
277327
- `i`: Make matches case insensitive
278328

@@ -295,3 +345,4 @@ Regex:
295345
You may even feel slightly despondent as you realize that a piece of software being popular doesn't mean that it's good.
296346
That's what you get for thinking.
297347
[^email]: In practice, email addresses can have all sorts of things in them! Like spaces! Or quotes!
348+
[^grammar]: English professors, please look away.

0 commit comments

Comments
 (0)