Add speculative Unicode word boundary support to the lazy DFA.

BurntSushi · BurntSushi · commit 175761e34d7d · 2016-04-13T21:04:06.000-04:00
Hooray! The DFA will now try to interpret Unicode word boundaries as if they were ASCII word boundaries. If the DFA comes across a non-ASCII byte, then it will give up and fall back to the slower NFA simulation. Nevertheless, this prevents us from degrading to very slow matching in a large number of cases. Thanks very much to @raphlinus who had the essential idea of "speculative matching."
diff --git a/PERFORMANCE.md b/PERFORMANCE.md
@@ -200,17 +200,19 @@ just by examing the first (or last) three bytes of the haystack.
 **Advice**: Literals can reduce the work that the regex engine needs to do. Use
 them if you can, especially as prefixes.
 
-## Unicode word boundaries prevent the DFA from being used
+## Unicode word boundaries may prevent the DFA from being used
 
-It's a sad state of the current implementation. It's not clear when or if
-Unicode word boundaries will be salvaged, but as it stands right now, using
-them automatically disqualifies use of the DFA, which can mean an order of
-magnitude slowdown in search time. There are two ways to ameliorate this:
+It's a sad state of the current implementation. At the moment, the DFA will try
+to interpret Unicode word boundaries as if they were ASCII word boundaries.
+If the DFA comes across any non-ASCII byte, it will quit and fall back to an
+alternative matching engine that can handle Unicode word boundaries correctly.
+The alternate matching engine is generally quite a bit slower (perhaps by an
+order of magnitude). If necessary, this can be ameliorated in two ways.
 
 The first way is to add some number of literal prefixes to your regular
-expression. Even though the DFA won't be used, specialized routines will still
-kick in to find prefix literals quickly, which limits how much work the NFA
-simulation will need to do.
+expression. Even though the DFA may not be used, specialized routines will
+still kick in to find prefix literals quickly, which limits how much work the
+NFA simulation will need to do.
 
 The second way is to give up on Unicode and use an ASCII word boundary instead.
 One can use an ASCII word boundary by disabling Unicode support. That is,
@@ -221,11 +223,18 @@ to a syntax error if the regex could match arbitrary bytes. For example, if one
 wrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches
 any *byte* when the Unicode flag is disabled.
 
+The second way isn't appreciably different than just using a Unicode word
+boundary in the first place, since the DFA will speculatively interpret it as
+an ASCII word boundary anyway. The key difference is that if an ASCII word
+boundary is used explicitly, then the DFA won't quit in the presence of
+non-ASCII UTF-8 bytes. This results in giving up correctness in exchange for
+more consistent performance.
+
 N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
 can simply write `\b` to get an ASCII word boundary.
 
-**Advice**: Use `(?-u:\b)` instead of `\b` if you care about performance more
-than correctness.
+**Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
+instead of `\b` if you care about consistent performance more than correctness.
 
 ## Excessive counting can lead to exponential state blow up in the DFA
 
diff --git a/src/compile.rs b/src/compile.rs
@@ -269,10 +269,12 @@ impl Compiler {
                 self.c_empty_look(prog::EmptyLook::EndText)
             }
             WordBoundary => {
+                self.compiled.has_unicode_word_boundary = true;
                 self.byte_classes.set_word_boundary();
                 self.c_empty_look(prog::EmptyLook::WordBoundary)
             }
             NotWordBoundary => {
+                self.compiled.has_unicode_word_boundary = true;
                 self.byte_classes.set_word_boundary();
                 self.c_empty_look(prog::EmptyLook::NotWordBoundary)
             }
diff --git a/src/dfa.rs b/src/dfa.rs
@@ -73,7 +73,6 @@ const CACHE_LIMIT: usize = 2 * (1<<20);
 /// of tracking multi-byte assertions in the DFA.
 pub fn can_exec(insts: &Program) -> bool {
     use prog::Inst::*;
-    use prog::EmptyLook::*;
     // If for some reason we manage to allocate a regex program with more
     // than STATE_MAX instructions, then we can't execute the DFA because we
     // use 32 bit pointers with some of the bits reserved for special use.
@@ -83,14 +82,7 @@ pub fn can_exec(insts: &Program) -> bool {
     for inst in insts {
         match *inst {
             Char(_) | Ranges(_) => return false,
-            EmptyLook(ref inst) => {
-                match inst.look {
-                    WordBoundary | NotWordBoundary => return false,
-                    WordBoundaryAscii | NotWordBoundaryAscii => {}
-                    StartLine | EndLine | StartText | EndText => {}
-                }
-            }
-            Match(_) | Save(_) | Split(_) | Bytes(_) => {}
+            EmptyLook(_) | Match(_) | Save(_) | Split(_) | Bytes(_) => {}
         }
     }
     true
@@ -296,17 +288,22 @@ const STATE_UNKNOWN: StatePtr = 1<<31;
 /// once it is entered, no match can ever occur.
 const STATE_DEAD: StatePtr = 1<<30;
 
+/// A quit state means that the DFA came across some input that it doesn't
+/// know how to process correctly. The DFA should quit and another matching
+/// engine should be run in its place.
+const STATE_QUIT: StatePtr = 1<<29;
+
 /// A start state is a state that the DFA can start in.
 ///
 /// Note that unlike unknown and dead states, start states have their lower
 /// bits set to a state pointer.
-const STATE_START: StatePtr = 1<<29;
+const STATE_START: StatePtr = 1<<28;
 
 /// A match state means that the regex has successfully matched.
 ///
 /// Note that unlike unknown and dead states, match states have their lower
 /// bits set to a state pointer.
-const STATE_MATCH: StatePtr = 1<<28;
+const STATE_MATCH: StatePtr = 1<<27;
 
 /// The maximum state pointer.
 const STATE_MAX: StatePtr = STATE_MATCH - 1;
@@ -591,7 +588,10 @@ impl<'a> Fsm<'a> {
                     None => return Result::NoMatch,
                     Some(i) => i,
                 };
-            } else if next_si >= STATE_DEAD {
+            } else if next_si >= STATE_QUIT {
+                if next_si & STATE_QUIT > 0 {
+                    return Result::Quit;
+                }
                 // Finally, this corresponds to the case where the transition
                 // entered a state that can never lead to a match or a state
                 // that hasn't been computed yet. The latter being the "slow"
@@ -697,7 +697,10 @@ impl<'a> Fsm<'a> {
                 if self.at < cur {
                     result = Result::Match(self.at + 2);
                 }
-            } else if next_si >= STATE_DEAD {
+            } else if next_si >= STATE_QUIT {
+                if next_si & STATE_QUIT > 0 {
+                    return Result::Quit;
+                }
                 let byte = Byte::byte(text[self.at]);
                 prev_si &= STATE_MAX;
                 next_si = match self.next_state(qcur, qnext, prev_si, byte) {
@@ -986,10 +989,15 @@ impl<'a> Fsm<'a> {
                         NotWordBoundaryAscii if flags.not_word_boundary => {
                             self.cache.stack.push(inst.goto as InstPtr);
                         }
+                        WordBoundary if flags.word_boundary => {
+                            self.cache.stack.push(inst.goto as InstPtr);
+                        }
+                        NotWordBoundary if flags.not_word_boundary => {
+                            self.cache.stack.push(inst.goto as InstPtr);
+                        }
                         StartLine | EndLine | StartText | EndText => {}
                         WordBoundaryAscii | NotWordBoundaryAscii => {}
-                        // The DFA doesn't support Unicode word boundaries. :-(
-                        WordBoundary | NotWordBoundary => unreachable!(),
+                        WordBoundary | NotWordBoundary => {}
                     }
                 }
                 Save(ref inst) => self.cache.stack.push(inst.goto as InstPtr),
@@ -1057,7 +1065,12 @@ impl<'a> Fsm<'a> {
 
         // OK, now there's enough room to push our new state.
         // We do this even if the cache size is set to 0!
-        let trans = Transitions::new(self.num_byte_classes());
+        let mut trans = Transitions::new(self.num_byte_classes());
+        if self.prog.has_unicode_word_boundary {
+            for b in 128..256 {
+                trans[self.byte_class(Byte::byte(b as u8))] = STATE_QUIT;
+            }
+        }
         let si = usize_to_u32(self.cache.states.len());
         self.cache.states.push(State {
             insts: key.insts.clone(),
@@ -1120,15 +1133,14 @@ impl<'a> Fsm<'a> {
                             state_flags.set_empty();
                             insts.push(ip);
                         }
-                        WordBoundaryAscii => {
+                        WordBoundary | WordBoundaryAscii => {
                             state_flags.set_empty();
                             insts.push(ip);
                         }
-                        NotWordBoundaryAscii => {
+                        NotWordBoundary | NotWordBoundaryAscii => {
                             state_flags.set_empty();
                             insts.push(ip);
                         }
-                        WordBoundary | NotWordBoundary => unreachable!(),
                     }
                 }
                 Match(_) => {
@@ -1226,7 +1238,12 @@ impl<'a> Fsm<'a> {
             return si;
         }
         let si = usize_to_u32(self.cache.states.len());
-        let trans = Transitions::new(self.num_byte_classes());
+        let mut trans = Transitions::new(self.num_byte_classes());
+        if self.prog.has_unicode_word_boundary {
+            for b in 128..256 {
+                trans[self.byte_class(Byte::byte(b as u8))] = STATE_QUIT;
+            }
+        }
         self.cache.states.push(state);
         self.cache.trans.push(trans);
         self.cache.compiled.insert(key, si);
@@ -1257,8 +1274,9 @@ impl<'a> Fsm<'a> {
         }
         match self.cache.trans[si as usize][self.byte_class(b)] {
             STATE_UNKNOWN => self.exec_byte(qcur, qnext, si, b),
-            STATE_DEAD => return Some(STATE_DEAD),
-            nsi => return Some(nsi),
+            STATE_QUIT => None,
+            STATE_DEAD => Some(STATE_DEAD),
+            nsi => Some(nsi),
         }
     }
 
diff --git a/src/prog.rs b/src/prog.rs
@@ -52,6 +52,8 @@ pub struct Program {
     pub is_anchored_start: bool,
     /// Whether the regex must match at the end of the input.
     pub is_anchored_end: bool,
+    /// Whether this program contains a Unicode word boundary instruction.
+    pub has_unicode_word_boundary: bool,
     /// A possibly empty machine for very quickly matching prefix literals.
     pub prefixes: LiteralSearcher,
 }
@@ -73,6 +75,7 @@ impl Program {
             is_reverse: false,
             is_anchored_start: false,
             is_anchored_end: false,
+            has_unicode_word_boundary: false,
             prefixes: LiteralSearcher::empty(),
         }
     }

Original file line number	Diff line number	Diff line change
`@@ -269,10 +269,12 @@ impl Compiler {`
`269`	`269`	`self.c_empty_look(prog::EmptyLook::EndText)`
`270`	`270`	`}`
`271`	`271`	`WordBoundary => {`
	`272`	`+ self.compiled.has_unicode_word_boundary = true;`
`272`	`273`	`self.byte_classes.set_word_boundary();`
`273`	`274`	`self.c_empty_look(prog::EmptyLook::WordBoundary)`
`274`	`275`	`}`
`275`	`276`	`NotWordBoundary => {`
	`277`	`+ self.compiled.has_unicode_word_boundary = true;`
`276`	`278`	`self.byte_classes.set_word_boundary();`
`277`	`279`	`self.c_empty_look(prog::EmptyLook::NotWordBoundary)`
`278`	`280`	`}`
Original file line number	Diff line number	Diff line change
`@@ -52,6 +52,8 @@ pub struct Program {`
`52`	`52`	`pub is_anchored_start: bool,`
`53`	`53`	`/// Whether the regex must match at the end of the input.`
`54`	`54`	`pub is_anchored_end: bool,`
	`55`	`+ /// Whether this program contains a Unicode word boundary instruction.`
	`56`	`+ pub has_unicode_word_boundary: bool,`
`55`	`57`	`/// A possibly empty machine for very quickly matching prefix literals.`
`56`	`58`	`pub prefixes: LiteralSearcher,`
`57`	`59`	`}`
`@@ -73,6 +75,7 @@ impl Program {`
`73`	`75`	`is_reverse: false,`
`74`	`76`	`is_anchored_start: false,`
`75`	`77`	`is_anchored_end: false,`
	`78`	`+ has_unicode_word_boundary: false,`
`76`	`79`	`prefixes: LiteralSearcher::empty(),`
`77`	`80`	`}`
`78`	`81`	`}`