Merge pull request #10 from LearnYouSomeComputer/ch5

LinuxMercedes · web-flow · commit fe777ba21152 · 2018-01-15T13:37:24.000-06:00
Chapter 5: Regexes
diff --git a/05-Regular-Expressions.md b/05-Regular-Expressions.md
@@ -30,11 +30,13 @@ In addition to the utilities we will discuss in this chapter, nearly every progr
 The general idea of writing a regular expression is that you're writing a *search string*.
 They may look complicated, but break them down piece by piece and you should be able to puzzle out what is going on.
 
-There are several websites that will visually show you what a regular expression matches.
+There are several websites that will visually show you what each part of a regular expression matches.
 We recommend you try out examples from this chapter in one of these websites; try [https://regex101.com/](https://regex101.com/).
 
 ### Syntax
 
+#### Character Classes
+
 Letters, numbers, and spaces match themselves: the regex `abc` matches the string "abc".
 In addition to literal character matches, there are several single-character-long patterns:
 
@@ -57,22 +59,24 @@ Custom character classes can include other character classes, and you can use `-
 For instance, if you wanted to match a hexadecimal digit, you could write the following: `[\da-fA-F]` to match a digit (`\d`) or a hex letter, either uppercase or lowercase.
 You can also negate character classes by including a `^` at the beginning. `[^a-z]` matches everything except lowercase letters.
 
+#### Repetition
+
 Now, if you want to match names, you can use `\w\w\w\w` to match "Finn" or "Jake", but that won't work to match "Bob" or "Summer".
 What you really need is a variable-length match. Fortunately there are several of these!
 
 - `{n}`: matches *n* of the previous character class.
 - `{n,m}`: matches between *n* and *m* of the previous character class (inclusive).
 - `{n,}`: matches at least *n* of the previous character.
 
-So you could write `/\w{4}/` to match four-letter words, or `\w{1,}` to match one or more word characters.
+So you could write `\w{4}` to match four-letter words, or `\w{1,}` to match one or more word characters.
 
 Because some of these patterns are so common, there's shorthand for them:
 
 - `*`: matches 0 or more of the previous character; short for `{0,}`.
 - `+`: matches 1 or more of the previous character; short for `{1,}`.
 - `?`: matches 0 or 1 of the previous character; short for `{0,1}`.
 
-So we could write our name regex as `/\w+/`.
+So we could write our name regex as `\w+`.
 
 More examples:
 
@@ -81,12 +85,15 @@ More examples:
 - `\d{5}` : Matches any string containing five digits (a regular ZIP code).
 - `\d{5}-\d{4}` : Matches any string containing 5 digits followed by a dash and 4 more digits (a ZIP+4 code).
 
+#### Groups
+
 What if you wanted to match a ZIP code either with or without the extension?
 It's tempting to write `\d{5}-?\d{0,4}`, but this would also match "12345-", "12345-6", and so on, which are not valid ZIP+4 codes.
 
 What we really need is a way to group parts of the match together.
 Fortunately, you can do this with `()`s!
-`\d{5}(-\d{4})?` matches any ZIP code with an optional +4 extension.
+You can then apply modifiers (like `+`) to the group as a whole.
+`\d{5}(-\d{4})?` matches any ZIP code with an optional +4 extension --- either it matches a 5 digit ZIP and none of `-\d{4}` or a 5 digit ZIP and all of `-\d{4}`.
 
 A group can match one of several options, denoted by `|`.
 For example, `[ac][bd]` matches "ab", "cd", "ad", and "cb".
@@ -95,27 +102,49 @@ To match "ab" or "cd" but not "ad" or "cb", use `(ab|cd)`.
 The real power of groups is in backreferences, which come in handy both when matching expressions and doing string transformations.
 You can refer to the substring matched by the first group with `\1`, the second group with `\2`, etc.
 We can match "abab" or "cdcd" but not "abcd" or "cdab" with `(ab|cd)\1`.
+The backreference there says "Match another of whatever `(ab|cd)` matched".
 
 If you have a pattern where you need to refer to both a backreference and a digit immediately afterward, use an empty group to separate the backreference and digit.
 For example, let's say you want to match "110", "220", ..., "990".
 If you wrote `(\d)\10`, your regex engine would be confused because `\10` looks like a backreference to the 10th group.
 Instead, write `(\d)\1()0` -- the `()` matches an empty string (i.e. nothing), so it's as if it wasn't there.
 
+#### Anchors
+
 By default, regular expressions match a substring anywhere in the string.
 So if you have the regex `a+b+` and the string "cccaabbddddd", that will count as a match because `a+b+` matches "aabb".
 To specify that a match must start at the beginning of a line, use `^`, and to specify that the match ends at the end of a line, use `$`.
 So, `a+b+$` matches "cccaabb" but not "aabbcc", and `^a+b+$` matches only lines containing some "a"s followed by some "b"s.
 
+#### Greedy and Non-greedy matching
+
 Now, it's the nature of regular expressions to be greedy and gobble up (match) as much as they can.
 Usually this sort of self-interested behavior is fine, but sometimes it goes too far.[^politics]
-You can use `?` on part of a regular expression to make that part polite (i.e. non-greedy), in which case it matches only as much as it needs for the whole regex to match.
+You can use `?` on part of a regular expression to make that part polite (i.e., non-greedy),
+in which case it matches only as much as it needs for the whole regex to match.
 
 One example of this is if you are trying to match complete sentences from lines of text.
-Using `(.+\.)` (i.e. match one or more things, followed by a period) is fine, as long as there is just one sentence per line.
+Using `.+\.` (i.e. match one or more characters, followed by a period) is fine, as long as there is just one sentence per line, like so:
+
+```
+This is one sentence.
+And this is another.
+Maybe we'll be daring and include a third.
+```
+
 But if there's more than one sentence on a line, this regex will match all of them, because `.` matches "."!
-If you want it to match one and only one sentence, you have to tell the `.+` to match only as much as needed, so `(.+?\.)`.
+For example, if we were to run it on the following file:
+
+```
+This file has three sentences. But only two lines.
+I guess technically the second sentence isn't one.
+We lied to you--or did we?
+```
+
+The first match would be "This file has three sentences. But only two lines.", which contains two sentences.[^grammar]
+If you want it to match one and only one sentence, you have to tell the `.+` to match only as much as needed, so `.+?\.`.
 
-Alternatively, you could rewrite it using a custom character class: `([^\.]+\.)` -- match one or more things that aren't a period, followed by a period.
+Alternatively, you could rewrite it using a custom character class: `[^\.]+\.` --- match one or more things that aren't a period, followed by a period.
 
 ### `grep`
 
@@ -131,8 +160,8 @@ Your reverie is cut short when you suddenly remember that you have a big file th
 
 Okay, enough imagining. There *is* a command to use that line noise to look through files: `grep`.
 This interjection of a command name is short for "global regular expression print", and it does exactly just that.
-In this case, "just that" means it prints strings from files (or standard in) that match a given regular expression.
-If you want to look for "bob" in "cool\_people.txt", you could do it with `grep` like so: `grep bob cool\_people.txt`.
+In this case, "just that" means it prints strings from files (or standard input) that match a given regular expression.
+If you want to look for "bob" in "cool_people.txt", you could do it with `grep` like so: `grep bob cool_people.txt`.
 If you don't specify a filename, `grep` reads from standard input, so you can pipe stuff into it as well.
 
 `grep` has a few handy options:
@@ -143,8 +172,9 @@ If you don't specify a filename, `grep` reads from standard input, so you can pi
 - `-P`: Use Perl-style regular expressions.
 - `-o`: Only print the part of the line the regex matches. Handy for figuring out what a regex is matching.
 
-Without Perl-style regexes, `grep` requires you to escape special characters to get the special meaning.[^huh]
+If you don't enable Perl-style regexes, `grep` requires you to escape special characters to get the special meaning.[^huh]
 In other words, `a+` matches "a+", whereas `a\+` matches one or more "a"s.
+If you want to use the syntax shown in the previous section, you'll want to pass the `-P` flag to `grep`.
 
 For these examples, we'll use STDIN as our search text.
 That is, grep will use the pattern (passed as an argument) to search the input received over STDIN.
@@ -159,9 +189,19 @@ banas
 $ echo "banana" | grep 'b\(an\)\+as'
 ```
 
+Here's a more practical example of where `grep` comes in handy:
+
+```
+$ grep -i todo big_homework_file.cpp
+// TODO: reticulate the splay tree
+// ToDo copy this part from stackoverflow
+/* It would be nice todo some more consistent comment formatting */
+/* TODO remove completed todos from code */
+```
+
 ### `sed`
 
-`grep` is great and all but it's really only for printing out matches of regular expressions.
+`grep` is great and all but it just prints out matches of regular expressions.
 We can do so much more with regular expressions, though!
 `sed` is a 'stream editor': it reads in a file (or standard in), makes edits, and prints the edited stream to standard out.
 `sed` is noninteractive; while you *can* use it to perform any old edit, it's best for situations where you want to automate editing.
@@ -171,13 +211,13 @@ Some handy `sed` flags:
 - `-r`: Use extended regular expressions. **NOTE**: even with extended regexes, `sed` is missing some character classes, such as `\d`.
 - `-n`: Only print lines that match (handy for debugging).
 
-`sed` has several commands that you can use in conjunction with reglar expressions to perform edits.
+`sed` has several commands that you can use in conjunction with regular expressions to perform edits.
 One such command is the print command, `p`. It prints every line that a particular regex matches.
 `sed -n '/REGEX/ p'` works almost exactly like `grep REGEX` does.
 Use this command to make sure your regexes match what you think they should.
 
 The substitute command, `s`, substitutes the string matched by a regular expression with another string.
-`sed 's/REGEX/REPLACEMENT/'` replaces the match for `REGEX` with `REPLACEMENT.
+`sed 's/REGEX/REPLACEMENT/'` replaces the match for `REGEX` with `REPLACEMENT`.
 This lets you perform string transformations, or edits.
 
 For example,
@@ -217,12 +257,17 @@ grape pie
 There are even more `sed` commands, and more ways to combine them together.
 Fortunately for you, though, this is not a book on `sed`, so we'll leave it at that.
 It's definitely worthwhile to spend a bit of time looking through the `sed` manual if you find yourself needing to do something it's good for.
+Speaking of which, what is `sed` good for?
+
+- Renaming variables, functions, etc. in code.
+- Making changes to text file databases (CSV files, etc.).
+- Impressing your friends with your ability to write arcane commands.
 
 \newpage
 ## Questions
 Name: `______________________________`
 
-1. Suppose, for the sake of simplicity[^email], that we want to match email addresses whose addresses and domains are letters and numbers, like "abc123@xyz.wibble".
+1. Suppose, for the sake of simplicity,[^email] that we want to match email addresses whose addresses and domains are letters and numbers, like "abc123@xyz.wibble".
 Write a regular expression to match an email address.
 \vspace{8em}
 
@@ -233,11 +278,14 @@ Write a regular expression to match an email address.
 You're trying to reinvent the business as a hip, fancy eatery, "The Garden of Olives (And Also Peperoncinis)".
 As part of this reinvention, you need to jazz up that menu by replacing "pizza" with "foccacia and fresh tomato sauce".
 Suppose your menu is stored in "menu.txt". Write a command to update every instance of "pizza" and place the new, hip menu in "carte-du-jour.txt".
+(**Hint**: you'll need to use some of the I/O redirection stuff you learned in Chapter 2.)
 \newpage
 
 ## Quick Reference
 
-Regex:
+### Regex
+
+Character classes:
 
 - `.`: Matches one of any character.
 - `\w`: Matches a word character (letters, numbers, and \_).
@@ -247,32 +295,34 @@ Regex:
 - `\s`: Matches whitespace (space, tab, newline, carriage return, etc.).
 - `\S`: Matches non-whitespace (everything `\s` doesn’t match).
 
-<!-- -->
+Repetition:
 
 - `{n}`: matches *n* of the previous character class.
 - `{n,m}`: matches between *n* and *m* of the previous character class (inclusive).
 - `{n,}`: matches at least *n* of the previous character.
-- `\*`: matches 0 or more of the previous character; short for `{0,}`.
+- `*`: matches 0 or more of the previous character; short for `{0,}`.
 - `+`: matches 1 or more of the previous character; short for `{1,}`.
 - `?`: matches 0 or 1 of the previous character; short for `{0,1}`.
 
-`grep REGEX [FILE]`: Search for `REGEX` in `FILE`, or standard input if no file is specified
+### `grep <REGEX> [<FILE>]`
+Search for `REGEX` in `FILE`, or standard input if no file is specified.
 
 - `-C LINES`: Give `LINES` lines of context around the match.
 - `-v`: Print every line that doesn’t match (it inverts the match).
 - `-i`: Ignore case when matching.
 - `-P`: Use Perl-style regular expressions.
 - `-o`: Only print the part of the line the regex matches.
 
-`sed COMMANDS [FILE]`: Perform `COMMANDS` to the contents of `FILE`, or standard input if no file is specified, and print the results to standard output
+### `sed <COMMANDS> [<FILE>]`
+Perform `COMMANDS` to the contents of `FILE`, or standard input if no file is specified, and print the results to standard output.
 
 - `-r`: Use extended regular expressions.
 - `-n`: Only print lines that match.
 
-<!-- -->
+Common commands:
 
 - `/REGEX/ p`: Print lines that match `REGEX`
-- `s/REGEX/REPLACEMENT/`: Replace strings that match `REGEX` with `REPLACEMENT
+- `s/REGEX/REPLACEMENT/`: Replace strings that match `REGEX` with `REPLACEMENT`
 	- `g`: Replace every match on each line, rather than just the first match
 	- `i`: Make matches case insensitive
 
@@ -295,3 +345,4 @@ Regex:
 You may even feel slightly despondent as you realize that a piece of software being popular doesn't mean that it's good.
 That's what you get for thinking.
 [^email]: In practice, email addresses can have all sorts of things in them! Like spaces! Or quotes!
+[^grammar]: English professors, please look away.