@@ -32,7 +32,8 @@ The code to find prefixes and search for prefixes is in src/literals.rs. When
32
32
more than one literal prefix is found, we fall back to an Aho-Corasick DFA
33
33
using the aho-corasick crate. For one literal, we use a variant of the
34
34
Boyer-Moore algorithm. Both Aho-Corasick and Boyer-Moore use ` memchr ` when
35
- appropriate.
35
+ appropriate. The Boyer-Moore variant in this library also uses elementary
36
+ frequency analysis to choose the write byte to run ` memchr ` with.
36
37
37
38
Of course, detecting prefix literals can only take us so far. Not all regular
38
39
expressions have literal prefixes. To remedy this, we try another approach to
@@ -53,10 +54,12 @@ text results in at most one new DFA state. It is made fast by caching states.
53
54
DFAs are susceptible to exponential state blow up (where the worst case is
54
55
computing a new state for every input byte, regardless of what's in the state
55
56
cache). To avoid using a lot of memory, the lazy DFA uses a bounded cache. Once
56
- the cache is full, it is wiped and state computation starts over again.
57
+ the cache is full, it is wiped and state computation starts over again. If the
58
+ cache is wiped too frequently, then the DFA gives up and searching falls back
59
+ to one of the aforementioned algorithms.
57
60
58
- All of the above matching engines expose precisely the matching semantics. This
59
- is indeed tested. (See the section below about testing.)
61
+ All of the above matching engines expose precisely the same matching semantics.
62
+ This is indeed tested. (See the section below about testing.)
60
63
61
64
The following sub-sections describe the rest of the library and how each of the
62
65
matching engines are actually used.
@@ -70,6 +73,9 @@ encountered. Parsing is done in a separate crate so that others may benefit
70
73
from its existence, and because it is relatively divorced from the rest of the
71
74
regex library.
72
75
76
+ The regex-syntax crate also provides sophisticated support for extracting
77
+ prefix and suffix literals from regular expressions.
78
+
73
79
### Compilation
74
80
75
81
The compiler is in src/compile.rs. The input to the compiler is some abstract
@@ -162,7 +168,7 @@ knows what the caller wants. Using this information, we can determine which
162
168
engine (or engines) to use.
163
169
164
170
The logic for choosing which engine to execute is in src/exec.rs and is
165
- documented on the Exec type. Exec values collection regular expression
171
+ documented on the Exec type. Exec values contain regular expression
166
172
Programs (defined in src/prog.rs), which contain all the necessary tidbits
167
173
for actually executing a regular expression on search text.
168
174
@@ -172,6 +178,14 @@ of src/exec.rs by far is the execution of the lazy DFA, since it requires a
172
178
forwards and backwards search, and then falls back to either the NFA algorithm
173
179
or backtracking if the caller requested capture locations.
174
180
181
+ The parameterization of every search is defined in src/params.rs. Among other
182
+ things, search parameters provide storage for recording capture locations and
183
+ matches (for regex sets). The existence and nature of storage is itself a
184
+ configuration for how each matching engine behaves. For example, if no storage
185
+ for capture locations is provided, then the matching engines can give up as
186
+ soon as a match is witnessed (which may occur well before the leftmost-first
187
+ match).
188
+
175
189
### Programs
176
190
177
191
A regular expression program is essentially a sequence of opcodes produced by
@@ -268,48 +282,46 @@ N.B. To run tests for the `regex!` macro, use:
268
282
269
283
The benchmarking in this crate is made up of many micro-benchmarks. Currently,
270
284
there are two primary sets of benchmarks: the benchmarks that were adopted at
271
- this library's inception (in ` benches/bench .rs ` ) and a newer set of benchmarks
285
+ this library's inception (in ` benches/src/misc .rs ` ) and a newer set of benchmarks
272
286
meant to test various optimizations. Specifically, the latter set contain some
273
- analysis and are in ` benches/bench_sherlock .rs ` . Also, the latter set are all
287
+ analysis and are in ` benches/src/sherlock .rs ` . Also, the latter set are all
274
288
executed on the same lengthy input whereas the former benchmarks are executed
275
289
on strings of varying length.
276
290
277
291
There is also a smattering of benchmarks for parsing and compilation.
278
292
293
+ Benchmarks are in a separate crate so that its dependencies can be managed
294
+ separately from the main regex crate.
295
+
279
296
Benchmarking follows a similarly wonky setup as tests. There are multiple
280
297
entry points:
281
298
282
- * ` bench_native.rs ` - benchmarks the ` regex! ` macro
283
- * ` bench_dynamic.rs ` - benchmarks ` Regex::new `
284
- * ` bench_dynamic_nfa.rs ` benchmarks ` Regex::new ` , forced to use the NFA
285
- algorithm on every regex. (N.B. This can take a few minutes to run.)
299
+ * ` bench_rust_plugin.rs ` - benchmarks the ` regex! ` macro
300
+ * ` bench_rust.rs ` - benchmarks ` Regex::new `
301
+ * ` bench_rust_bytes.rs ` benchmarks ` bytes::Regex::new `
286
302
* ` bench_pcre.rs ` - benchmarks PCRE
303
+ * ` bench_onig.rs ` - benchmarks Oniguruma
287
304
288
- The PCRE benchmarks exist as a comparison point to a mature regular expression
289
- library. In general, this regex library compares favorably (there are even a
290
- few benchmarks that PCRE simply runs too slowly on or outright can't execute at
291
- all). I would love to add other regular expression library benchmarks
292
- (especially RE2), but PCRE is the only one with reasonable bindings .
305
+ The PCRE and Oniguruma benchmarks exist as a comparison point to a mature
306
+ regular expression library. In general, this regex library compares favorably
307
+ (there are even a few benchmarks that PCRE simply runs too slowly on or
308
+ outright can't execute at all). I would love to add other regular expression
309
+ library benchmarks (especially RE2).
293
310
294
311
If you're hacking on one of the matching engines and just want to see
295
312
benchmarks, then all you need to run is:
296
313
297
- $ cargo bench -- bench dynamic
314
+ $ ./run- bench rust
298
315
299
316
If you want to compare your results with older benchmarks, then try:
300
317
301
- $ cargo bench -- bench dynamic | tee old
318
+ $ ./run- bench rust | tee old
302
319
$ ... make it faster
303
- $ cargo bench -- bench dynamic | tee new
320
+ $ ./run- bench rust | tee new
304
321
$ cargo-benchcmp old new --improvements
305
322
306
323
The ` cargo-benchcmp ` utility is available here:
307
324
https://github.com/BurntSushi/cargo-benchcmp
308
325
309
- To run the same benchmarks on PCRE, you'll need to use the sub-crate in
310
- ` regex-pcre-benchmark ` like so:
311
-
312
- $ cargo bench --manifest-path regex-pcre-benchmark/Cargo.toml
313
-
314
- The PCRE benchmarks are separated from the main regex crate so that its
315
- dependency doesn't break builds in environments without PCRE.
326
+ The ` run-bench ` utility can run benchmarks for PCRE and Oniguruma too. See
327
+ ` ./run-bench --help ` .
0 commit comments