Replies: 5 comments 2 replies
-
Hi, here are some quick answers to your questions: TOKEN_DOT_PROPERTY and spaces: You are correct that the strict spec doesn't allow spaces. The Lexer's regex pattern for this token would be written to explicitly disallow spaces between the dot and the identifier. It's a single unit that gets tokenized together. TOKEN_PROPERTY and TOKEN_BARE_PROPERTY: The distinction is kept in the Lexer to preserve the original syntax from the input string. This is useful for providing more precise error messages or for future extensions. Downstream, the Parser and AST can treat them as the same semantic unit (PropertySelector), as you observed. Regarding other semantic processing: You've correctly identified the main example. The Lexer's primary job is to tokenize, but it often does minor semantic work like unescaping strings or handling numeric literals. In this case, the key is ensuring your Java Lexer produces the same exact token stream, with the same values, for the Parser to handle. |
Beta Was this translation helpful? Give feedback.
-
Hi Rob, The decision to blur the lines between tokenizing and parsing was motivated by a performance assumption rather than good design. Using this regex approach in Python, I expect fewer tokens mapping to larger string slices to perform better than lots of smaller tokens. This is why there's no (I don't recall if I actually tested this assumption here, but it has proved to be the case in the past.) So while allowing whitepsace in shorthand name selectors was a deliberate choice, it was not the deciding factor behind how we defined some of the "rules" and tokens, and could defiantly be implemented more cleanly. Having
python-jsonpath-rfc9535 learns from all the mistakes I made in this project. From a design perspective, it is better in every way, albeit without JSON Pointer, JSON Patch and non-standard syntax. For a cleaner design and implementation of JSONPath, JSON Pointer, JSON Patch and some well defined extensions to JSON Path, see json-p3 (TypeScript). jg-rp/json-p3#11 contains a useful discussion about JSONPath internal representations, which avoid the |
Beta Was this translation helpful? Give feedback.
-
I was just thinking that after I get the Java Lexer to pass all the existing unit tests, I could create a subclass that has strict compliance with RFC9535. Although this would be just to have the ability to check off that box in the feature set. The core RFC is a bit limiting compared to other implementation of JSON Path that are out there. But your Token/Env/Lexer approach makes this easier to achieve. For example, (I'm using Enums instead of string constants for my TokenKinds), your idea of scanning one token kind but emitting a different one lets me do something like: AND(false, "&&"),
...
AND_EXT(true, "(?:and\\b)", AND), The second Enum uses the constructor that takes a TokenKind, the emitKind parameter. When scanned, either of these lexemes will produce the AND token. Then in my "strict" Lexer, I just don't support AND_EXT. I can keep them separate, but related. |
Beta Was this translation helpful? Give feedback.
-
Also, regarding your strategy of one huge regex for all the token kinds: when I first saw it I thought "Wow that's a huge regex that will be slow to process." But I did a little research (AI research, so it could be inaccurate.) But it turns out this might be a more efficient way of scanning and matching the input text than the traditional scanner you wrote for the jsonpath-RFC9535 package. Re is implemented in C in Python, and it's supposed to be very efficient at building an internal scanner like the one you built explicitly in the RFC package. But re can build highly optimized scanners. And since it's implemented in C, it's going to be much faster than manual Python scanner code. So not only is your Lexer code in json-path more elegant and simple, it's also potentially much faster than the Lexer in jsonpath-RFC. Unless you have benchmarks that show otherwise? |
Beta Was this translation helpful? Give feedback.
-
I am rather proud of the Lexer I wrote in my Python implementation. I'm only using 5 regexes for the "heavy lifting". The are for
All other lexemes are put into two sets, one-char lexemes and two-char lexemes. They are in a map of lexeme to TokenType. So I first scan the next two characters in the stream, and look that up in the two-char lexeme map, and if found, the map value is the TokenType to emit. This is an O(1) operation. If that lexeme isn't found, I search the one-char lexeme map with the current scanner char. Again, another O(1) lookup. I also use 'first sets" of characters before trying a regex match (although I didn't name them formally as such in my Python code) to check if I should try one of the regex matches, which is a simple set lookup, another O(1) operation. I'm going to duplicate this in my Java port. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Your implementation for Lexer tokenization is rather elegant. Every token kind that can be scanned has a regex pattern associated with it, and you assemble them all into a single regex with each token kind's regex as an alternative pattern ( separated by a pipe '|' ) and given a group name matching the token kind. Then when you find a match you get the name of the group that matched and convert that into a token kind.
Unfortunately, the standard Java regex library doesn't have this same API. There's no method to find out what group name matched nor even which group number. I would have to iterate over all 57 alternatives in the
rules
pattern for each iteration of the outer loop to find the group name or group number that did not have a null match value. This would lead to O(N^2) complexity.So, I will need to implement the tokenize() method differently in Java, probably closer to what I implemented in my own Python RFC 9535 project. As long as I generate the same list of tokens for the parser that you do, I shouldn't have any further problems with tokens further down the processing pipeline.
I do want to ask about some semantic processing you're doing in your Lexer. Typically the Lexer would just match input characters to well known tokens in the grammar, and let the parser or compiler deal with semantic processing. But I notice you have things like:
TOKEN_DOT_PROPERTY
which will scan a member-name-shorthand (dot plus identifier string) but eat the dot and just create a TOKEN_PROPERTY token for it. I think this means that your implementation supports spaces in the member-name-shorthand whereas this is not allowed in the strict spec. Is this accurate?
And I see that you also use two different token kinds, TOKEN_PROPERTY and TOKEN_BARE_PROPERTY that both yield a PropertySelector AST class. This allows you to "remember" if the original jsonpath string was using a member-name-shorthand or a bracketed name-selector, while your Parser can treat both token kinds as the same semantic unit. Do you use this distinction anywhere downstream? (I've only looked at Token, Lexer, JSONPathEnvironment, JSONPointer, and am just starting to look at JSONPath classes. )
Is there any other semantic processing you do in the Lexer like the above? I just want to be aware of the landscape as I start to implement the Lexer in Java.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions