|
| 1 | +# Lexmatch expression (longest match strategy) |
| 2 | + |
| 3 | +- Proposal: |
| 4 | + [ME-0011](https://github.com/moonbitlang/moonbit-evolution/blob/0011-lexmatch-expression/proposals/0011-lexmatch-expression-longest.mbt.md) |
| 5 | +- Author: [Wen Yuxiang](https://github.com/hackwaly) |
| 6 | +- Status: Experimental |
| 7 | +- Review and discussion: [GitHub |
| 8 | + issue](https://github.com/moonbitlang/moonbit-evolution/pull/15) |
| 9 | + |
| 10 | +## Introduction |
| 11 | + |
| 12 | +MoonBit aims to be perfect for data processing tasks. Currently, the `match` |
| 13 | +expression is the primary way to destructure and analyze data. However, it is |
| 14 | +not as powerful as Regular Expressions in string processing. This proposal |
| 15 | +introduces a new expression called `lexmatch`, which combines the capabilities |
| 16 | +of `match` and Regular Expressions to provide a more flexible and powerful way |
| 17 | +to analyze and destructure strings. |
| 18 | + |
| 19 | +## Examples |
| 20 | + |
| 21 | +### Word count |
| 22 | + |
| 23 | +The following function counts the number of lines, words, and characters in a |
| 24 | +given input string. It uses the `lexmatch` expression to match different |
| 25 | +patterns in the input string. |
| 26 | + |
| 27 | +```moonbit |
| 28 | +///| |
| 29 | +pub fn wordcount( |
| 30 | + input : BytesView, |
| 31 | + lines : Int, |
| 32 | + words : Int, |
| 33 | + chars : Int, |
| 34 | +) -> (Int, Int, Int) { |
| 35 | + lexmatch input with longest { |
| 36 | + ("\n", rest) => wordcount(rest, lines + 1, words, chars) |
| 37 | + ("[^ \t\r\n]+" as word, rest) => |
| 38 | + wordcount(rest, lines, words + 1, chars + word.length()) |
| 39 | + (".", rest) => wordcount(rest, lines, words, chars + 1) |
| 40 | + "" => (lines, words, chars) |
| 41 | + _ => panic() |
| 42 | + } |
| 43 | +} |
| 44 | +``` |
| 45 | + |
| 46 | +## Explanation |
| 47 | + |
| 48 | +### Terminology |
| 49 | + |
| 50 | +- **Target**: The `StringView` or `BytesView` to be `lexmatch`ed. |
| 51 | +- **Match Strategy**: The strategy used to match patterns. It can be either |
| 52 | + `longest` or `first` (default). |
| 53 | + |
| 54 | + In this proposal, we only focus on the `longest` match strategy. |
| 55 | + |
| 56 | +- **Catch-all case**: A case with a variable or wildcard `_` as its left-hand |
| 57 | + side, which matches any target. It is required to be placed at the end of the |
| 58 | + `lexmatch` arms/cases to handle unmatched situations. |
| 59 | +- **Lex Pattern**: The pattern part (differ with guard part) in left-hand side |
| 60 | + of a `lexmatch` arm/case (before `=>`). |
| 61 | + |
| 62 | + E.g. `("[^ \t\r\n]+" as word, rest)` |
| 63 | + |
| 64 | + A lex pattern can be one of the following: |
| 65 | + |
| 66 | + - Bare regex pattern: the regex pattern will match against the whole target. |
| 67 | + |
| 68 | + E.g. `""` |
| 69 | + |
| 70 | + In this case, the regex pattern is `""`, which matches an empty `StringView` |
| 71 | + or `BytesView`. |
| 72 | + |
| 73 | + - Regex pattern followed by a comma and a rest variable: the regex pattern |
| 74 | + will match against the prefix of the target, and the rest variable will bind |
| 75 | + to the remaining suffix. |
| 76 | + |
| 77 | + The rest variable can be either a variable or a wildcard `_`. |
| 78 | + |
| 79 | + In this form, the parentheses are required to improve readability. |
| 80 | + |
| 81 | + E.g. `("\n", rest)` |
| 82 | + |
| 83 | + In this case, the regex pattern is `"\n"`, which matches a newline character |
| 84 | + at the beginning of the target. The `rest` variable will bind to the |
| 85 | + remaining suffix of the target after removing the matched prefix. |
| 86 | + |
| 87 | +- **Regex Pattern**: Regex patterns have three forms: |
| 88 | + |
| 89 | + - **Regex Literal**: A string literal representing a regex pattern. |
| 90 | + |
| 91 | + E.g. `"[^ \t\r\n]+"` |
| 92 | + |
| 93 | + - **Capture**: A regex pattern followed by `as` and a variable name to capture |
| 94 | + the matched substring. |
| 95 | + |
| 96 | + E.g. `"[^ \t\r\n]+" as word` |
| 97 | + |
| 98 | + If the lex pattern is a bare regex pattern of this form, the parentheses are |
| 99 | + required. |
| 100 | + |
| 101 | + - **Sequence**: A sequence of regex patterns separated by whitespace. |
| 102 | + |
| 103 | + E.g. `"//" ("[^\r\n]*" as comment)` |
| 104 | + |
| 105 | + If the lex pattern is a bare regex pattern of this form, the parentheses are |
| 106 | + required. |
| 107 | + |
| 108 | + |
| 109 | + Regex patterns can be nested to form more complex patterns. |
| 110 | + |
| 111 | +### Semantics |
| 112 | + |
| 113 | +The `lexmatch` expression works similarly to the `match` expression, with the |
| 114 | +following differences: |
| 115 | + |
| 116 | +1. The target of a `lexmatch` expression must be a `StringView` or `BytesView`. |
| 117 | +2. Each arm/case except the catch-all case of a `lexmatch` expression must have |
| 118 | + a lex pattern as its left-hand side. |
| 119 | +3. The match strategy can be specified after the `with` keyword. If not |
| 120 | + specified, the default strategy is `first`. `first` strategy is considered |
| 121 | + unavailable at the moment. |
| 122 | +4. The regex patterns in lex patterns are matched against the target using the |
| 123 | + specified match strategy. |
| 124 | +5. If a regex pattern matches the target, any capture variables in the pattern |
| 125 | + will be bound to the corresponding matched substrings. |
| 126 | +6. If a regex pattern followed by a comma and a rest variable matches the |
| 127 | + target, the regex pattern will match the prefix of the target, and the rest |
| 128 | + variable will bind to the remaining suffix. |
| 129 | +7. If no lex pattern matches the target, the catch-all case will be executed. |
| 130 | + |
| 131 | +### Subtleties |
| 132 | + |
| 133 | +- When capture a single character, the matched substring is a `Char` or `Byte`, |
| 134 | + instead of a `StringView` or `BytesView`. E.g. `("[+-]" as sign)` |
| 135 | + |
| 136 | +- The `"(abc)"` regex pattern does not introduce a capture group. To capture the |
| 137 | + matched substring, you need to use the `as` syntax. E.g. `"abc" as group` |
| 138 | + instead of `"(abc)"`. |
| 139 | + |
| 140 | +- The `"$"` regex pattern matches the end of the target `StringView` or |
| 141 | + `BytesView`. The `"^"` regex pattern matches the start of the target (not |
| 142 | + implemented for now). |
| 143 | + |
| 144 | +### Recipes |
| 145 | + |
| 146 | +### Search a marker in a string |
| 147 | + |
| 148 | +```moonbit |
| 149 | +pub fn search_marker(str: StringView) -> StringView? { |
| 150 | + for curr = str { |
| 151 | + lexmatch curr with longest { |
| 152 | + "" => return None |
| 153 | + ("MARKER", right) => return Some(right) |
| 154 | + (".", rest) => continue rest |
| 155 | + _ => panic() |
| 156 | + } |
| 157 | + } |
| 158 | +} |
| 159 | +``` |
| 160 | + |
| 161 | +### FAQ |
| 162 | + |
| 163 | +- Why not use the `match` expression with regex patterns directly? |
| 164 | + |
| 165 | + The `match` expression is designed for structural pattern matching, while the |
| 166 | + `lexmatch` expression is designed for lexical analysis. Mixing the two |
| 167 | + concepts may lead to confusion and complexity. By introducing a separate |
| 168 | + expression for lexical analysis, we can keep the semantics clear and focused. |
| 169 | + |
| 170 | +- Which syntax/feature can be used in regex literals? |
| 171 | + |
| 172 | + Bascially, the syntax aligned with JavaScript regex literals (with v flag |
| 173 | + enabled), with the following not supported for now: |
| 174 | + |
| 175 | + - Regex flags (e.g. `i`, `g`, `m`, etc.) |
| 176 | + - Lookahead and lookbehind assertions (e.g. `(?=...)`, `(?!...)`, etc.) |
| 177 | + - Backreferences (e.g. `\1`, `\2`, etc.) |
| 178 | + - Named capture groups (e.g. `(?<name>...)`) |
| 179 | + - Unicode property escapes (e.g. `\p{...}`, `\P{...}`) |
| 180 | + - Scoped modifiers (e.g. `(?i:...)`, `(?-i:...)`) |
0 commit comments