Description
What version of regex are you using?
1.10.2 (latest)
Describe the bug at a high level.
Parsing a regex string with combined characters/grapheme clusters and an error causes the error message to highlight the wrong character. This can make the error impossible to accurately track down based on the error information.
What are the steps to reproduce the behavior?
Consider this program creating a Regex
from a string with a grapheme cluster and then an error (in this case, invalid hex character in the unicode escape).
use regex;
fn main() {
let str = r"क्\u12G4";
regex::Regex::new(str).unwrap();
}
The error is:
called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
क्\u12G4
^
error: invalid hexadecimal digit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
)
as you can see the 4
is highlighted, while the actual error occurs one grapheme earlier, at the G
. This is relatively benign, but note that grapheme clusters can cause the error arrow to move arbitrarily far in the stream, for example to an otherwise correct occurrence of a syntax element:
use regex;
fn main() {
let str = r"षकषषक्षक्षक्षक्षक्षक्षक्षक्ष्\u12G4 some text \u1234";
regex::Regex::new(str).unwrap();
}
called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
षकषषक्षक्षक्षक्षक्षक्षक्षक्ष्\u12G4 some text \u1234
^
error: invalid hexadecimal digit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
)
What is the expected behavior?
An ideal solution would be to have the error arrow be aware of grapheme clusters and show up underneath the correct one for the user.
In absence of that, it would help to include the index of the character that caused the error in textual format. In the first example it'd say "character 7" or at least "byte index 10".