Skip to content

Commit 8b8c1fa

Browse files
committed
Expand on examples
1 parent 70cda99 commit 8b8c1fa

File tree

1 file changed

+104
-39
lines changed

1 file changed

+104
-39
lines changed

_posts/2024-10-23-looking-around-logs.md

Lines changed: 104 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,12 @@ Imagine a scenario where you're parsing a log, and you need to extract certain
1010
portions of data out of it. What tool do you reach for these days? Despite
1111
concerns about readability, most of us reach for regular expressions when faced
1212
with text parsing. Regular expressions have been a powerful tool in developers'
13-
toolboxes ever since they appeared in a rewritten version of
13+
toolboxes ever since they appeared in Ken Thompson's rewritten version of
1414
[QED](https://en.wikipedia.org/wiki/QED_(text_editor)).
1515

1616
Most people today, including myself, probably reach for their favorite robust
1717
scripting language and have at it. Today's languages of choice tend to be
18-
things such as Python, Ruby, PHP, Go, and whatever JavaScript library happens
18+
things such as Python, Rust, Go, and whatever JavaScript library happens
1919
to be the flavor of the week. 😊
2020

2121
The thing is, Perl has long been regarded as a language that takes regular
@@ -26,33 +26,38 @@ Perl's capabilities around regular expressions still have few rivals in 2024.
2626
One of Perl's secret weapons that many modern day languages like to take
2727
inspiration from is its support for zero-length assertions such as lookaheads
2828
and lookbehinds, collectively known as lookarounds. These assertions allow you
29-
to match based on context without actually consuming characters, making your
30-
regular expressions more precise and expressive.
29+
to match patterns based on context without actually consuming characters,
30+
making your regular expressions more precise and expressive.
3131

3232
In this post, we will dive into Perl's implementation of lookaround assertions.
3333
We'll explore what they are, when to use them, and why Perl remains one of the
3434
most copied tools for these advanced regex patterns. We’ll also compare these
3535
to more straightforward approaches in scripting such as trying to reach for
3636
`sed` or `awk`, the benefits of lookarounds, potential pitfalls, and more.
37+
Note that this post assumes the reader already has a general understanding of
38+
regular expressions, but I'll try to touch on some general basics just in case.
3739

3840
## What is Perl?
3941

4042
Perl is a high-level, general-purpose programming language known for its
4143
text-processing capabilities. Created by [Larry Wall](https://perldoc.perl.org/perlfaq1#What-is-Perl?),
4244
Perl quickly gained popularity because of its flexibility with regular
4345
expressions, making it a go-to language for parsing and manipulating text data.
44-
Even though newer languages like Python and JavaScript have gained traction,
45-
and their regular expression engines pull direct inspiration from Perl's,
46+
When [CGI scripting](https://en.wikipedia.org/wiki/Common_Gateway_Interface)
47+
became popular, giving developers the ability to generate dynamic web content,
48+
it established itself as the de facto scripting language of the Internet.
49+
Even though newer languages like Python, JavaScript, and Go have diminished
50+
Perl's dominance as the primary scripting language for the Internet,
4651
Perl's regex engine remains one of the most powerful out there even in 2024.
4752

4853
## Understanding Lookahead and Lookbehind Assertions
4954

5055
### Lookahead Assertions
5156

52-
First, let's take a look at lookaheads. Lookahead assertions allow you to match
53-
a string only if it is followed by a certain pattern. Lookahead is
54-
non-consuming, meaning it checks for a condition without including that
55-
condition in the match.
57+
First, let's take a look at [lookaheads](https://perldoc.perl.org/perlre#(?=pattern)).
58+
Lookahead assertions allow you to match a string only if it is followed by a
59+
certain pattern. Lookahead is non-consuming, meaning it checks for a condition
60+
without including that condition in the match.
5661

5762
#### Syntax
5863

@@ -67,16 +72,17 @@ follow the current position in the string.
6772
Suppose you want to match a word only if it is followed by a number.
6873
Here's our string we'll use as an example: `apple123 banana grape456`
6974

70-
Let's first start with what most programmer would reach for.
71-
We can do this with `sed` and `grep` with a bit of regex-fu:
75+
In this example, we want to grab the words `apple` and `grape`.
76+
Let's first start with what most programmers would reach for.
77+
We can do this with `grep` and `sed` with a bit of regex-fu:
7278

7379
```bash
7480
text="apple123 banana grape456"
7581
echo "$text" | grep -o '[a-zA-Z]\+[0-9]\+' | sed 's/[0-9]\+//g'
7682
```
7783

7884
But because Perl has capabilities that mimic `sed` and `grep`, we can do this
79-
same operaiton with a single command:
85+
same operation with a single command:
8086

8187
```perl
8288
my $text = "apple123 banana grape456";
@@ -87,9 +93,13 @@ while ($text =~ /([a-zA-Z]+)(?=\d)/g) {
8793

8894
In this example, the words `apple` and `grape` are matched because they are
8995
followed by a number. Notice that the number itself is not captured as part
90-
of the match. While some of this will probably be review for those who have
91-
worked with regular expressions before, we'll go ahead and go through it in
92-
detail. The breakdown of the regex is as follows:
96+
of the match. That's part of the beauty of lookarounds. We are able to match
97+
a pattern and grab only the portion we need with little overhead.
98+
99+
While some of this will probably be review for those familiar with regular
100+
expressions, we'll go through it in detail just in case. This explanation
101+
will also help if you were confused during the `grep` and `sed` example above.
102+
The breakdown of the regex is as follows:
93103

94104
1. `([a-zA-Z]+)`:
95105
- `[a-zA-Z]` is a character class that matches any uppercase or
@@ -103,6 +113,9 @@ detail. The breakdown of the regex is as follows:
103113
position in the string is a digit (`\d`).
104114
- The lookahead does not consume any characters, meaning it only verifies
105115
the presence of a digit without including it in the match.
116+
- In the previous `grep` and `sed` example, we had to use a character class
117+
that only grabbed the digits 0 through 9 and had to apply it multiple times.
118+
With Perl, this single lookahead handles both instances efficiently.
106119
3. `/g` modifier:
107120
- The `g` modifier stands for "global," which allows the regex to find all
108121
occurrences in the string, not just the first one.
@@ -114,23 +127,37 @@ followed by a four digit sequence:
114127

115128
```perl
116129
my $text = "apple1234 banana grape456 orange7890 pear2468";
117-
while ($text =~ /(\w+)(?=\d{4}+)/g) {
130+
while ($text =~ /(\w+)(?=\d{4})/g) {
118131
print "$1\n"; # Matches apple, orange, and pear
119132
}
120133
```
121134

122-
In this example, it will output `apple`, `orange`, and `pear`. In fact, we are
123-
using another nice-ism that Perl provides for us which is the `\w+` portion of
124-
this code. The `\w` character class matches any "word" character which includes
125-
letters, digits, and underscores. The `+` performs the same action as stated
126-
above where it will match as many word characters (including digits)
127-
as possible. `{4}` specifies that it must contain exactly four digits to be
128-
considered successful.
135+
In this example, it will output `apple`, `orange`, and `pear`. Let's break
136+
down the regex:
137+
138+
1. `(\w+)`:
139+
- The `\w` character class matches any "word" character, which includes
140+
letters (both uppercase and lowercase), digits, and underscores.
141+
- The `+` quantifier means "one or more occurrences" of the preceding
142+
character class. So, `\w+` matches one or more word characters.
143+
2. `(?=\d{4})`:
144+
- This is a lookahead assertion that checks if what follows the current
145+
position in the string is exactly four digits.
146+
- The `{4}` is the quantifier that explicitly specifies that there must be
147+
exactly four digits `(\d)` following the matched word. Without this, it
148+
can match any possible digit.
149+
150+
The while loop iterates through the string, capturing words that are
151+
immediately followed by a four-digit sequence, and successfully grabs.
152+
`apple`, `orange`, and `pear` because these words are followed by 1234, 7890,
153+
and 2468, respectively.
129154

130155
However, there are some slight drawbacks, as this isn't quite as intuitive as
131156
it might seem. For example, let's say we wanted to match *any* word that
132-
contained any number of digits after it. You might think that we would need to
133-
simple remove the `{4}` and keep everything else.
157+
contained *any* number of digits after it. You might think that we would need
158+
to simply remove the `{4}` and keep everything else. After all, `\d` matches
159+
all digits, and the `+` quantifier will extend it to be one or more digits.
160+
Let's try that:
134161

135162
```perl
136163
my $text = "apple1234 banana grape456 orange7890 pear2468";
@@ -139,14 +166,34 @@ while ($text =~ /(\w+)(?=\d+)/g) {
139166
}
140167
```
141168

142-
If we did that, we would actually get `apple123`, `grape45`, `orange789`,
143-
and `pear246`. The reason is because we used the `\w` character class. In the
144-
previous example, it was successful because we only ever considered a string
145-
to be successful if it contained exactly 4 digits. The word character class
146-
naturally will grab any number of letters, digits, or underscores and consider
147-
them to be a successful match, so you need to be careful when combining certain
148-
character classes together, as the results might not be exactly what you would
149-
expect if you didn't pay close attention to the documentation.
169+
In this case, you might expect to get words like `apple`, `grape`, `orange`,
170+
and `pear`. However, we would actually get `apple123`, `grape45`, `orange789`,
171+
and `pear246`. Not only do we get the word with digits, but every matching
172+
word grabbed contains a missing digit. Let's break down why:
173+
174+
1. Initial Match Attempt:
175+
- The regex engine starts at the beginning of the string and matches `\w+` as
176+
much as possible. For `apple1234`, `\w+` matches `apple1234`.
177+
2. Lookahead Check:
178+
- The lookahead `(?=\d+)` checks if there are digits following the current
179+
match. Since `\w+` consumed all characters including the digits,
180+
the lookahead fails because there are no digits left to match.
181+
3. Backtracking:
182+
- The regex engine backtracks one character at a time from the end of the
183+
current match. It reduces the match to `apple123` and checks the
184+
lookahead again. This continues until the lookahead finds a digit to match.
185+
186+
The resulting matches end up being `apple123`, `grape45`, `orange789`, and
187+
`pear246`. Each match is missing a digit because the last digit is what allows
188+
the lookahead to succeed.
189+
190+
To correct this, we need to use a less greedy character class. Using the
191+
initial example's `([a-zA-Z]+)` will allow this to successfully match
192+
`apple`, `grape`, `orange`, and `pear`.
193+
194+
The point here is that you need to be careful when combining certain character
195+
classes together, as the results might not be exactly what you would expect if
196+
you didn't pay close attention to the documentation.
150197

151198
### Lookbehind Assertions
152199

@@ -261,7 +308,9 @@ has hundreds of thousands of entries like this:
261308

262309
Your goal is to extract the usernames from entries that contain the path
263310
`/admin/` and return a 404 status code. In this case, it would be `john_doe`
264-
and `alice`.
311+
and `alice`. Let's try this with a few popular methods. We won't be going into
312+
detail on all of these like before, but I have provided some links at the end
313+
that might help for further reading.
265314

266315
#### Using Sed
267316

@@ -308,12 +357,28 @@ while (my $line = <$fh>) {
308357
close $fh;
309358
```
310359

311-
In this Perl example, we use a lookbehind to ensure the line contains the
312-
username after the dash (`- `), and a lookahead to ensure it is followed by a
313-
request to the `/admin/` path that resulted in a `404` status code.
360+
This regex uses both positive lookbehind and positive lookahead together.
361+
362+
1. Positive Lookbehind `(?<=- )`:
363+
- This asserts that what precedes the current position in the string is `- `.
364+
This part does not consume characters; it only checks that the string
365+
before the current match contains `- `.
366+
2. Capture Group `([^ ]+)`:
367+
- This captures one or more characters that are not a space
368+
(`[^ ]` means "any character *except* a space"). This part effectively
369+
captures the text that follows the `- ` until the next space.
370+
3. Positive Lookahead `(?= \[.*\] "GET \/admin\/.*" 404 )`:
371+
- This asserts that what follows the captured group is a space,
372+
followed by square brackets containing any characters `(\[.*\])`,
373+
followed by a space, the string `"GET /admin/.*"`
374+
(matching any request to `/admin/`), and then a space, and `404`.
375+
314376
The regex is semi-readable, and for those who have learned about lookarounds,
315377
the lookahead/lookbehind mechanism makes it clear what context we are looking
316-
for without consuming unnecessary portions of the line.
378+
for without consuming unnecessary portions of the line. I say semi-readable as
379+
someone who hasn't studied regular expressions may struggle with this at first
380+
glance, but the power comes from its ability to expand on this in a safer
381+
manner than the `sed` or `awk` versions.
317382

318383
#### Using Python
319384

0 commit comments

Comments
 (0)