Java vs Python regex #94

rob-ross · 2025-08-06T01:19:14Z

rob-ross
Aug 6, 2025

Your implementation for Lexer tokenization is rather elegant. Every token kind that can be scanned has a regex pattern associated with it, and you assemble them all into a single regex with each token kind's regex as an alternative pattern ( separated by a pipe '|' ) and given a group name matching the token kind. Then when you find a match you get the name of the group that matched and convert that into a token kind.

Unfortunately, the standard Java regex library doesn't have this same API. There's no method to find out what group name matched nor even which group number. I would have to iterate over all 57 alternatives in the rules pattern for each iteration of the outer loop to find the group name or group number that did not have a null match value. This would lead to O(N^2) complexity.

So, I will need to implement the tokenize() method differently in Java, probably closer to what I implemented in my own Python RFC 9535 project. As long as I generate the same list of tokens for the parser that you do, I shouldn't have any further problems with tokens further down the processing pipeline.

I do want to ask about some semantic processing you're doing in your Lexer. Typically the Lexer would just match input characters to well known tokens in the grammar, and let the parser or compiler deal with semantic processing. But I notice you have things like:

TOKEN_DOT_PROPERTY

which will scan a member-name-shorthand (dot plus identifier string) but eat the dot and just create a TOKEN_PROPERTY token for it. I think this means that your implementation supports spaces in the member-name-shorthand whereas this is not allowed in the strict spec. Is this accurate?

And I see that you also use two different token kinds, TOKEN_PROPERTY and TOKEN_BARE_PROPERTY that both yield a PropertySelector AST class. This allows you to "remember" if the original jsonpath string was using a member-name-shorthand or a bracketed name-selector, while your Parser can treat both token kinds as the same semantic unit. Do you use this distinction anywhere downstream? (I've only looked at Token, Lexer, JSONPathEnvironment, JSONPointer, and am just starting to look at JSONPath classes. )

Is there any other semantic processing you do in the Lexer like the above? I just want to be aware of the landscape as I start to implement the Lexer in Java.

Thanks!

Disorn1998 · 2025-08-06T01:53:40Z

Disorn1998
Aug 6, 2025

Hi, here are some quick answers to your questions:

TOKEN_DOT_PROPERTY and spaces: You are correct that the strict spec doesn't allow spaces. The Lexer's regex pattern for this token would be written to explicitly disallow spaces between the dot and the identifier. It's a single unit that gets tokenized together.

TOKEN_PROPERTY and TOKEN_BARE_PROPERTY: The distinction is kept in the Lexer to preserve the original syntax from the input string. This is useful for providing more precise error messages or for future extensions. Downstream, the Parser and AST can treat them as the same semantic unit (PropertySelector), as you observed.

Regarding other semantic processing: You've correctly identified the main example. The Lexer's primary job is to tokenize, but it often does minor semantic work like unescaping strings or handling numeric literals. In this case, the key is ensuring your Java Lexer produces the same exact token stream, with the same values, for the Parser to handle.

0 replies

jg-rp · 2025-08-06T08:31:47Z

jg-rp
Aug 6, 2025
Maintainer

Hi Rob,

The decision to blur the lines between tokenizing and parsing was motivated by a performance assumption rather than good design. Using this regex approach in Python, I expect fewer tokens mapping to larger string slices to perform better than lots of smaller tokens. This is why there's no TOKEN_DOT and why we don't emit whitespace tokens.

(I don't recall if I actually tested this assumption here, but it has proved to be the case in the past.)

So while allowing whitepsace in shorthand name selectors was a deliberate choice, it was not the deciding factor behind how we defined some of the "rules" and tokens, and could defiantly be implemented more cleanly.

Having TOKEN_PROPERTY and TOKEN_BARE_PROPERTY is another example of a workaround that came after initial design. At the time I would have been trying to maintain backward compatibility and get closer to RFC 9535 compliance with as little disruption as possible. If we were starting from scratch (which is what python-jsonpath-rfc9535 is), we would certainly not choose this token scheme.

Is there any other semantic processing you do in the Lexer like the above? I just want to be aware of the landscape as I start to implement the Lexer in Java.

TOKEN_LIST_SLICE, TOKEN_SLICE_START, etc. and ListSelector spring to mind. None of these things should/would exist in a "clean" implementation.

python-jsonpath-rfc9535 learns from all the mistakes I made in this project. From a design perspective, it is better in every way, albeit without JSON Pointer, JSON Patch and non-standard syntax.

For a cleaner design and implementation of JSONPath, JSON Pointer, JSON Patch and some well defined extensions to JSON Path, see json-p3 (TypeScript). jg-rp/json-p3#11 contains a useful discussion about JSONPath internal representations, which avoid the ListSelector approach. (I understand that this does not help if you're main goal is to get more Python time).

0 replies

rob-ross · 2025-08-06T08:49:50Z

rob-ross
Aug 6, 2025
Author

I was just thinking that after I get the Java Lexer to pass all the existing unit tests, I could create a subclass that has strict compliance with RFC9535. Although this would be just to have the ability to check off that box in the feature set. The core RFC is a bit limiting compared to other implementation of JSON Path that are out there. But your Token/Env/Lexer approach makes this easier to achieve.

For example, (I'm using Enums instead of string constants for my TokenKinds), your idea of scanning one token kind but emitting a different one lets me do something like:

    AND(false, "&&"),
...
    AND_EXT(true, "(?:and\\b)", AND),

The second Enum uses the constructor that takes a TokenKind, the emitKind parameter. When scanned, either of these lexemes will produce the AND token. Then in my "strict" Lexer, I just don't support AND_EXT. I can keep them separate, but related.

1 reply

jg-rp Aug 7, 2025
Maintainer

👍Easier customisation through subclassing is one of the main reasons to choose a stateless regex approach to tokenization. I do exactly that in a project that depends on python-jsonpath.

rob-ross · 2025-08-06T19:44:38Z

rob-ross
Aug 6, 2025
Author

Also, regarding your strategy of one huge regex for all the token kinds: when I first saw it I thought "Wow that's a huge regex that will be slow to process." But I did a little research (AI research, so it could be inaccurate.) But it turns out this might be a more efficient way of scanning and matching the input text than the traditional scanner you wrote for the jsonpath-RFC9535 package. Re is implemented in C in Python, and it's supposed to be very efficient at building an internal scanner like the one you built explicitly in the RFC package. But re can build highly optimized scanners. And since it's implemented in C, it's going to be much faster than manual Python scanner code. So not only is your Lexer code in json-path more elegant and simple, it's also potentially much faster than the Lexer in jsonpath-RFC. Unless you have benchmarks that show otherwise?

1 reply

jg-rp Aug 7, 2025
Maintainer

My benchmarks from many different projects back that up.

Writing a Tokenizer from Python's re docs demonstrates the "single master regular expression" method.

But sometimes, for some grammars, we can't get away with a single regular expression and have to introduce some kind of state management. Pygments and Jinja are popular Python packages that use multiple regular expressions for different lexer states.

For reference, python-jsonpath-rfc9535 follows techniques described in Lexical Scanning in Go, as well as regular expressions to avoid stepping through every single character where possible.

rob-ross · 2025-08-07T20:12:23Z

rob-ross
Aug 7, 2025
Author

I am rather proud of the Lexer I wrote in my Python implementation. I'm only using 5 regexes for the "heavy lifting". The are for

String literals
member-name-shorthand , which works for dot-properties but also for identifiers like function names and keywords
slice selectors
numbers
whitespace

All other lexemes are put into two sets, one-char lexemes and two-char lexemes. They are in a map of lexeme to TokenType. So I first scan the next two characters in the stream, and look that up in the two-char lexeme map, and if found, the map value is the TokenType to emit. This is an O(1) operation. If that lexeme isn't found, I search the one-char lexeme map with the current scanner char. Again, another O(1) lookup.

I also use 'first sets" of characters before trying a regex match (although I didn't name them formally as such in my Python code) to check if I should try one of the regex matches, which is a simple set lookup, another O(1) operation.

I'm going to duplicate this in my Java port.

0 replies

Uh oh!

Java vs Python regex #94

Uh oh!

rob-ross Aug 6, 2025

Replies: 5 comments · 2 replies

Uh oh!

Disorn1998 Aug 6, 2025

Uh oh!

jg-rp Aug 6, 2025 Maintainer

Uh oh!

Uh oh!

rob-ross Aug 6, 2025 Author

Uh oh!

jg-rp Aug 7, 2025 Maintainer

Uh oh!

Uh oh!

rob-ross Aug 6, 2025 Author

Uh oh!

jg-rp Aug 7, 2025 Maintainer

Uh oh!

rob-ross Aug 7, 2025 Author

rob-ross
Aug 6, 2025

Replies: 5 comments 2 replies

Disorn1998
Aug 6, 2025

jg-rp
Aug 6, 2025
Maintainer

rob-ross
Aug 6, 2025
Author

jg-rp Aug 7, 2025
Maintainer

rob-ross
Aug 6, 2025
Author

jg-rp Aug 7, 2025
Maintainer

rob-ross
Aug 7, 2025
Author