Expand on examples

tomice · tomice · commit 8b8c1faf8bb7 · 2024-10-23T17:02:51.000Z
diff --git a/_posts/2024-10-23-looking-around-logs.md b/_posts/2024-10-23-looking-around-logs.md
@@ -10,12 +10,12 @@ Imagine a scenario where you're parsing a log, and you need to extract certain
 portions of data out of it. What tool do you reach for these days? Despite
 concerns about readability, most of us reach for regular expressions when faced
 with text parsing. Regular expressions have been a powerful tool in developers'
-toolboxes ever since they appeared in a rewritten version of
+toolboxes ever since they appeared in Ken Thompson's rewritten version of
 [QED](https://en.wikipedia.org/wiki/QED_(text_editor)).
 
 Most people today, including myself, probably reach for their favorite robust
 scripting language and have at it. Today's languages of choice tend to be
-things such as Python, Ruby, PHP, Go, and whatever JavaScript library happens
+things such as Python, Rust, Go, and whatever JavaScript library happens
 to be the flavor of the week. 😊
 
 The thing is, Perl has long been regarded as a language that takes regular
@@ -26,33 +26,38 @@ Perl's capabilities around regular expressions still have few rivals in 2024.
 One of Perl's secret weapons that many modern day languages like to take
 inspiration from is its support for zero-length assertions such as lookaheads
 and lookbehinds, collectively known as lookarounds. These assertions allow you
-to match based on context without actually consuming characters, making your
-regular expressions more precise and expressive.
+to match patterns based on context without actually consuming characters,
+making your regular expressions more precise and expressive.
 
 In this post, we will dive into Perl's implementation of lookaround assertions.
 We'll explore what they are, when to use them, and why Perl remains one of the
 most copied tools for these advanced regex patterns. We’ll also compare these
 to more straightforward approaches in scripting such as trying to reach for
 `sed` or `awk`, the benefits of lookarounds, potential pitfalls, and more.
+Note that this post assumes the reader already has a general understanding of
+regular expressions, but I'll try to touch on some general basics just in case.
 
 ## What is Perl?
 
 Perl is a high-level, general-purpose programming language known for its
 text-processing capabilities. Created by [Larry Wall](https://perldoc.perl.org/perlfaq1#What-is-Perl?),
 Perl quickly gained popularity because of its flexibility with regular
 expressions, making it a go-to language for parsing and manipulating text data.
-Even though newer languages like Python and JavaScript have gained traction,
-and their regular expression engines pull direct inspiration from Perl's,
+When [CGI scripting](https://en.wikipedia.org/wiki/Common_Gateway_Interface)
+became popular, giving developers the ability to generate dynamic web content,
+it established itself as the de facto scripting language of the Internet.
+Even though newer languages like Python, JavaScript, and Go have diminished
+Perl's dominance as the primary scripting language for the Internet,
 Perl's regex engine remains one of the most powerful out there even in 2024.
 
 ## Understanding Lookahead and Lookbehind Assertions
 
 ### Lookahead Assertions
 
-First, let's take a look at lookaheads. Lookahead assertions allow you to match
-a string only if it is followed by a certain pattern. Lookahead is
-non-consuming, meaning it checks for a condition without including that
-condition in the match.
+First, let's take a look at [lookaheads](https://perldoc.perl.org/perlre#(?=pattern)).
+Lookahead assertions allow you to match a string only if it is followed by a
+certain pattern. Lookahead is non-consuming, meaning it checks for a condition
+without including that condition in the match.
 
 #### Syntax
 
@@ -67,16 +72,17 @@ follow the current position in the string.
 Suppose you want to match a word only if it is followed by a number.
 Here's our string we'll use as an example: `apple123 banana grape456`
 
-Let's first start with what most programmer would reach for.
-We can do this with `sed` and `grep` with a bit of regex-fu:
+In this example, we want to grab the words `apple` and `grape`.
+Let's first start with what most programmers would reach for.
+We can do this with `grep` and `sed` with a bit of regex-fu:
 
 ```bash
 text="apple123 banana grape456"
 echo "$text" | grep -o '[a-zA-Z]\+[0-9]\+' | sed 's/[0-9]\+//g'
 ```
 
 But because Perl has capabilities that mimic `sed` and `grep`, we can do this
-same operaiton with a single command:
+same operation with a single command:
 
 ```perl
 my $text = "apple123 banana grape456";
@@ -87,9 +93,13 @@ while ($text =~ /([a-zA-Z]+)(?=\d)/g) {
 
 In this example, the words `apple` and `grape` are matched because they are
 followed by a number. Notice that the number itself is not captured as part
-of the match. While some of this will probably be review for those who have
-worked with regular expressions before, we'll go ahead and go through it in
-detail. The breakdown of the regex is as follows:
+of the match. That's part of the beauty of lookarounds. We are able to match
+a pattern and grab only the portion we need with little overhead.
+
+While some of this will probably be review for those familiar with regular
+expressions, we'll go through it in detail just in case. This explanation
+will also help if you were confused during the `grep` and `sed` example above.
+The breakdown of the regex is as follows:
 
 1. `([a-zA-Z]+)`:
   - `[a-zA-Z]` is a character class that matches any uppercase or
@@ -103,6 +113,9 @@ detail. The breakdown of the regex is as follows:
     position in the string is a digit (`\d`).
   - The lookahead does not consume any characters, meaning it only verifies
     the presence of a digit without including it in the match.
+  - In the previous `grep` and `sed` example, we had to use a character class
+    that only grabbed the digits 0 through 9 and had to apply it multiple times.
+    With Perl, this single lookahead handles both instances efficiently.
 3. `/g` modifier:
   - The `g` modifier stands for "global," which allows the regex to find all
     occurrences in the string, not just the first one.
@@ -114,23 +127,37 @@ followed by a four digit sequence:
 
 ```perl
 my $text = "apple1234 banana grape456 orange7890 pear2468";
-while ($text =~ /(\w+)(?=\d{4}+)/g) {
+while ($text =~ /(\w+)(?=\d{4})/g) {
     print "$1\n"; # Matches apple, orange, and pear
 }
 ```
 
-In this example, it will output  `apple`, `orange`, and `pear`. In fact, we are
-using another nice-ism that Perl provides for us which is the `\w+` portion of
-this code. The `\w` character class matches any "word" character which includes
-letters, digits, and underscores. The `+` performs the same action as stated
-above where it will match as many word characters (including digits)
-as possible. `{4}` specifies that it must contain exactly four digits to be
-considered successful.
+In this example, it will output  `apple`, `orange`, and `pear`. Let's break
+down the regex:
+
+1. `(\w+)`:
+  - The `\w` character class matches any "word" character, which includes
+    letters (both uppercase and lowercase), digits, and underscores.
+  - The `+` quantifier means "one or more occurrences" of the preceding
+    character class. So, `\w+` matches one or more word characters.
+2. `(?=\d{4})`:
+  - This is a lookahead assertion that checks if what follows the current
+    position in the string is exactly four digits.
+  - The `{4}` is the quantifier that explicitly specifies that there must be
+    exactly four digits `(\d)` following the matched word. Without this, it
+    can match any possible digit.
+
+The while loop iterates through the string, capturing words that are
+immediately followed by a four-digit sequence, and successfully grabs.
+`apple`, `orange`, and `pear` because these words are followed by 1234, 7890,
+and 2468, respectively.
 
 However, there are some slight drawbacks, as this isn't quite as intuitive as
 it might seem. For example, let's say we wanted to match *any* word that
-contained any number of digits after it. You might think that we would need to
-simple remove the `{4}` and keep everything else.
+contained *any* number of digits after it. You might think that we would need
+to simply remove the `{4}` and keep everything else. After all, `\d` matches
+all digits, and the `+` quantifier will extend it to be one or more digits.
+Let's try that:
 
 ```perl
 my $text = "apple1234 banana grape456 orange7890 pear2468";
@@ -139,14 +166,34 @@ while ($text =~ /(\w+)(?=\d+)/g) {
 }
 ```
 
-If we did that, we would actually get `apple123`, `grape45`, `orange789`,
-and `pear246`. The reason is because we used the `\w` character class. In the
-previous example, it was successful because we only ever considered a string
-to be successful if it contained exactly 4 digits. The word character class
-naturally will grab any number of letters, digits, or underscores and consider
-them to be a successful match, so you need to be careful when combining certain
-character classes together, as the results might not be exactly what you would
-expect if you didn't pay close attention to the documentation.
+In this case, you might expect to get words like `apple`, `grape`, `orange`,
+and `pear`. However, we would actually get `apple123`, `grape45`, `orange789`,
+and `pear246`. Not only do we get the word with digits, but every matching
+word grabbed contains a missing digit. Let's break down why:
+
+1. Initial Match Attempt:
+  - The regex engine starts at the beginning of the string and matches `\w+` as
+    much as possible. For `apple1234`, `\w+` matches `apple1234`.
+2. Lookahead Check:
+  - The lookahead `(?=\d+)` checks if there are digits following the current
+    match. Since `\w+` consumed all characters including the digits,
+    the lookahead fails because there are no digits left to match.
+3. Backtracking:
+  - The regex engine backtracks one character at a time from the end of the
+    current match. It reduces the match to `apple123` and checks the
+    lookahead again. This continues until the lookahead finds a digit to match.
+
+The resulting matches end up being `apple123`, `grape45`, `orange789`, and
+`pear246`. Each match is missing a digit because the last digit is what allows
+the lookahead to succeed.
+
+To correct this, we need to use a less greedy character class. Using the
+initial example's `([a-zA-Z]+)` will allow this to successfully match
+`apple`, `grape`, `orange`, and `pear`.
+
+The point here is that you need to be careful when combining certain character
+classes together, as the results might not be exactly what you would expect if
+you didn't pay close attention to the documentation.
 
 ### Lookbehind Assertions
 
@@ -261,7 +308,9 @@ has hundreds of thousands of entries like this:
 
 Your goal is to extract the usernames from entries that contain the path
 `/admin/` and return a 404 status code. In this case, it would be `john_doe`
-and `alice`.
+and `alice`. Let's try this with a few popular methods. We won't be going into
+detail on all of these like before, but I have provided some links at the end
+that might help for further reading.
 
 #### Using Sed
 
@@ -308,12 +357,28 @@ while (my $line = <$fh>) {
 close $fh;
 ```
 
-In this Perl example, we use a lookbehind to ensure the line contains the
-username after the dash (`- `), and a lookahead to ensure it is followed by a
-request to the `/admin/` path that resulted in a `404` status code.
+This regex uses both positive lookbehind and positive lookahead together.
+
+1. Positive Lookbehind `(?<=- )`:
+  - This asserts that what precedes the current position in the string is `- `.
+    This part does not consume characters; it only checks that the string
+    before the current match contains `- `.
+2. Capture Group `([^ ]+)`:
+  - This captures one or more characters that are not a space
+    (`[^ ]` means "any character *except* a space"). This part effectively
+    captures the text that follows the `- ` until the next space.
+3. Positive Lookahead `(?= \[.*\] "GET \/admin\/.*" 404 )`:
+  - This asserts that what follows the captured group is a space,
+    followed by square brackets containing any characters `(\[.*\])`,
+    followed by a space, the string `"GET /admin/.*"`
+    (matching any request to `/admin/`), and then a space, and `404`.
+
 The regex is semi-readable, and for those who have learned about lookarounds,
 the lookahead/lookbehind mechanism makes it clear what context we are looking
-for without consuming unnecessary portions of the line.
+for without consuming unnecessary portions of the line. I say semi-readable as
+someone who hasn't studied regular expressions may struggle with this at first
+glance, but the power comes from its ability to expand on this in a safer
+manner than the `sed` or `awk` versions.
 
 #### Using Python