Skip to content

Commit 175761e

Browse files
committed
Add speculative Unicode word boundary support to the lazy DFA.
Hooray! The DFA will now try to interpret Unicode word boundaries as if they were ASCII word boundaries. If the DFA comes across a non-ASCII byte, then it will give up and fall back to the slower NFA simulation. Nevertheless, this prevents us from degrading to very slow matching in a large number of cases. Thanks very much to @raphlinus who had the essential idea of "speculative matching."
1 parent 8d81a54 commit 175761e

File tree

4 files changed

+64
-32
lines changed

4 files changed

+64
-32
lines changed

PERFORMANCE.md

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -200,17 +200,19 @@ just by examing the first (or last) three bytes of the haystack.
200200
**Advice**: Literals can reduce the work that the regex engine needs to do. Use
201201
them if you can, especially as prefixes.
202202

203-
## Unicode word boundaries prevent the DFA from being used
203+
## Unicode word boundaries may prevent the DFA from being used
204204

205-
It's a sad state of the current implementation. It's not clear when or if
206-
Unicode word boundaries will be salvaged, but as it stands right now, using
207-
them automatically disqualifies use of the DFA, which can mean an order of
208-
magnitude slowdown in search time. There are two ways to ameliorate this:
205+
It's a sad state of the current implementation. At the moment, the DFA will try
206+
to interpret Unicode word boundaries as if they were ASCII word boundaries.
207+
If the DFA comes across any non-ASCII byte, it will quit and fall back to an
208+
alternative matching engine that can handle Unicode word boundaries correctly.
209+
The alternate matching engine is generally quite a bit slower (perhaps by an
210+
order of magnitude). If necessary, this can be ameliorated in two ways.
209211

210212
The first way is to add some number of literal prefixes to your regular
211-
expression. Even though the DFA won't be used, specialized routines will still
212-
kick in to find prefix literals quickly, which limits how much work the NFA
213-
simulation will need to do.
213+
expression. Even though the DFA may not be used, specialized routines will
214+
still kick in to find prefix literals quickly, which limits how much work the
215+
NFA simulation will need to do.
214216

215217
The second way is to give up on Unicode and use an ASCII word boundary instead.
216218
One can use an ASCII word boundary by disabling Unicode support. That is,
@@ -221,11 +223,18 @@ to a syntax error if the regex could match arbitrary bytes. For example, if one
221223
wrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches
222224
any *byte* when the Unicode flag is disabled.
223225

226+
The second way isn't appreciably different than just using a Unicode word
227+
boundary in the first place, since the DFA will speculatively interpret it as
228+
an ASCII word boundary anyway. The key difference is that if an ASCII word
229+
boundary is used explicitly, then the DFA won't quit in the presence of
230+
non-ASCII UTF-8 bytes. This results in giving up correctness in exchange for
231+
more consistent performance.
232+
224233
N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
225234
can simply write `\b` to get an ASCII word boundary.
226235

227-
**Advice**: Use `(?-u:\b)` instead of `\b` if you care about performance more
228-
than correctness.
236+
**Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
237+
instead of `\b` if you care about consistent performance more than correctness.
229238

230239
## Excessive counting can lead to exponential state blow up in the DFA
231240

src/compile.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -269,10 +269,12 @@ impl Compiler {
269269
self.c_empty_look(prog::EmptyLook::EndText)
270270
}
271271
WordBoundary => {
272+
self.compiled.has_unicode_word_boundary = true;
272273
self.byte_classes.set_word_boundary();
273274
self.c_empty_look(prog::EmptyLook::WordBoundary)
274275
}
275276
NotWordBoundary => {
277+
self.compiled.has_unicode_word_boundary = true;
276278
self.byte_classes.set_word_boundary();
277279
self.c_empty_look(prog::EmptyLook::NotWordBoundary)
278280
}

src/dfa.rs

Lines changed: 40 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,6 @@ const CACHE_LIMIT: usize = 2 * (1<<20);
7373
/// of tracking multi-byte assertions in the DFA.
7474
pub fn can_exec(insts: &Program) -> bool {
7575
use prog::Inst::*;
76-
use prog::EmptyLook::*;
7776
// If for some reason we manage to allocate a regex program with more
7877
// than STATE_MAX instructions, then we can't execute the DFA because we
7978
// use 32 bit pointers with some of the bits reserved for special use.
@@ -83,14 +82,7 @@ pub fn can_exec(insts: &Program) -> bool {
8382
for inst in insts {
8483
match *inst {
8584
Char(_) | Ranges(_) => return false,
86-
EmptyLook(ref inst) => {
87-
match inst.look {
88-
WordBoundary | NotWordBoundary => return false,
89-
WordBoundaryAscii | NotWordBoundaryAscii => {}
90-
StartLine | EndLine | StartText | EndText => {}
91-
}
92-
}
93-
Match(_) | Save(_) | Split(_) | Bytes(_) => {}
85+
EmptyLook(_) | Match(_) | Save(_) | Split(_) | Bytes(_) => {}
9486
}
9587
}
9688
true
@@ -296,17 +288,22 @@ const STATE_UNKNOWN: StatePtr = 1<<31;
296288
/// once it is entered, no match can ever occur.
297289
const STATE_DEAD: StatePtr = 1<<30;
298290

291+
/// A quit state means that the DFA came across some input that it doesn't
292+
/// know how to process correctly. The DFA should quit and another matching
293+
/// engine should be run in its place.
294+
const STATE_QUIT: StatePtr = 1<<29;
295+
299296
/// A start state is a state that the DFA can start in.
300297
///
301298
/// Note that unlike unknown and dead states, start states have their lower
302299
/// bits set to a state pointer.
303-
const STATE_START: StatePtr = 1<<29;
300+
const STATE_START: StatePtr = 1<<28;
304301

305302
/// A match state means that the regex has successfully matched.
306303
///
307304
/// Note that unlike unknown and dead states, match states have their lower
308305
/// bits set to a state pointer.
309-
const STATE_MATCH: StatePtr = 1<<28;
306+
const STATE_MATCH: StatePtr = 1<<27;
310307

311308
/// The maximum state pointer.
312309
const STATE_MAX: StatePtr = STATE_MATCH - 1;
@@ -591,7 +588,10 @@ impl<'a> Fsm<'a> {
591588
None => return Result::NoMatch,
592589
Some(i) => i,
593590
};
594-
} else if next_si >= STATE_DEAD {
591+
} else if next_si >= STATE_QUIT {
592+
if next_si & STATE_QUIT > 0 {
593+
return Result::Quit;
594+
}
595595
// Finally, this corresponds to the case where the transition
596596
// entered a state that can never lead to a match or a state
597597
// that hasn't been computed yet. The latter being the "slow"
@@ -697,7 +697,10 @@ impl<'a> Fsm<'a> {
697697
if self.at < cur {
698698
result = Result::Match(self.at + 2);
699699
}
700-
} else if next_si >= STATE_DEAD {
700+
} else if next_si >= STATE_QUIT {
701+
if next_si & STATE_QUIT > 0 {
702+
return Result::Quit;
703+
}
701704
let byte = Byte::byte(text[self.at]);
702705
prev_si &= STATE_MAX;
703706
next_si = match self.next_state(qcur, qnext, prev_si, byte) {
@@ -986,10 +989,15 @@ impl<'a> Fsm<'a> {
986989
NotWordBoundaryAscii if flags.not_word_boundary => {
987990
self.cache.stack.push(inst.goto as InstPtr);
988991
}
992+
WordBoundary if flags.word_boundary => {
993+
self.cache.stack.push(inst.goto as InstPtr);
994+
}
995+
NotWordBoundary if flags.not_word_boundary => {
996+
self.cache.stack.push(inst.goto as InstPtr);
997+
}
989998
StartLine | EndLine | StartText | EndText => {}
990999
WordBoundaryAscii | NotWordBoundaryAscii => {}
991-
// The DFA doesn't support Unicode word boundaries. :-(
992-
WordBoundary | NotWordBoundary => unreachable!(),
1000+
WordBoundary | NotWordBoundary => {}
9931001
}
9941002
}
9951003
Save(ref inst) => self.cache.stack.push(inst.goto as InstPtr),
@@ -1057,7 +1065,12 @@ impl<'a> Fsm<'a> {
10571065

10581066
// OK, now there's enough room to push our new state.
10591067
// We do this even if the cache size is set to 0!
1060-
let trans = Transitions::new(self.num_byte_classes());
1068+
let mut trans = Transitions::new(self.num_byte_classes());
1069+
if self.prog.has_unicode_word_boundary {
1070+
for b in 128..256 {
1071+
trans[self.byte_class(Byte::byte(b as u8))] = STATE_QUIT;
1072+
}
1073+
}
10611074
let si = usize_to_u32(self.cache.states.len());
10621075
self.cache.states.push(State {
10631076
insts: key.insts.clone(),
@@ -1120,15 +1133,14 @@ impl<'a> Fsm<'a> {
11201133
state_flags.set_empty();
11211134
insts.push(ip);
11221135
}
1123-
WordBoundaryAscii => {
1136+
WordBoundary | WordBoundaryAscii => {
11241137
state_flags.set_empty();
11251138
insts.push(ip);
11261139
}
1127-
NotWordBoundaryAscii => {
1140+
NotWordBoundary | NotWordBoundaryAscii => {
11281141
state_flags.set_empty();
11291142
insts.push(ip);
11301143
}
1131-
WordBoundary | NotWordBoundary => unreachable!(),
11321144
}
11331145
}
11341146
Match(_) => {
@@ -1226,7 +1238,12 @@ impl<'a> Fsm<'a> {
12261238
return si;
12271239
}
12281240
let si = usize_to_u32(self.cache.states.len());
1229-
let trans = Transitions::new(self.num_byte_classes());
1241+
let mut trans = Transitions::new(self.num_byte_classes());
1242+
if self.prog.has_unicode_word_boundary {
1243+
for b in 128..256 {
1244+
trans[self.byte_class(Byte::byte(b as u8))] = STATE_QUIT;
1245+
}
1246+
}
12301247
self.cache.states.push(state);
12311248
self.cache.trans.push(trans);
12321249
self.cache.compiled.insert(key, si);
@@ -1257,8 +1274,9 @@ impl<'a> Fsm<'a> {
12571274
}
12581275
match self.cache.trans[si as usize][self.byte_class(b)] {
12591276
STATE_UNKNOWN => self.exec_byte(qcur, qnext, si, b),
1260-
STATE_DEAD => return Some(STATE_DEAD),
1261-
nsi => return Some(nsi),
1277+
STATE_QUIT => None,
1278+
STATE_DEAD => Some(STATE_DEAD),
1279+
nsi => Some(nsi),
12621280
}
12631281
}
12641282

src/prog.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,8 @@ pub struct Program {
5252
pub is_anchored_start: bool,
5353
/// Whether the regex must match at the end of the input.
5454
pub is_anchored_end: bool,
55+
/// Whether this program contains a Unicode word boundary instruction.
56+
pub has_unicode_word_boundary: bool,
5557
/// A possibly empty machine for very quickly matching prefix literals.
5658
pub prefixes: LiteralSearcher,
5759
}
@@ -73,6 +75,7 @@ impl Program {
7375
is_reverse: false,
7476
is_anchored_start: false,
7577
is_anchored_end: false,
78+
has_unicode_word_boundary: false,
7679
prefixes: LiteralSearcher::empty(),
7780
}
7881
}

0 commit comments

Comments
 (0)