Skip to content

Commit da55b5c

Browse files
committed
initial lexmatch longest
1 parent 532eabe commit da55b5c

File tree

1 file changed

+180
-0
lines changed

1 file changed

+180
-0
lines changed
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Lexmatch expression (longest match strategy)
2+
3+
- Proposal:
4+
[ME-0011](https://github.com/moonbitlang/moonbit-evolution/blob/0011-lexmatch-expression/proposals/0011-lexmatch-expression-longest.mbt.md)
5+
- Author: [Wen Yuxiang](https://github.com/hackwaly)
6+
- Status: Experimental
7+
- Review and discussion: [GitHub
8+
issue](https://github.com/moonbitlang/moonbit-evolution/pull/15)
9+
10+
## Introduction
11+
12+
MoonBit aims to be perfect for data processing tasks. Currently, the `match`
13+
expression is the primary way to destructure and analyze data. However, it is
14+
not as powerful as Regular Expressions in string processing. This proposal
15+
introduces a new expression called `lexmatch`, which combines the capabilities
16+
of `match` and Regular Expressions to provide a more flexible and powerful way
17+
to analyze and destructure strings.
18+
19+
## Examples
20+
21+
### Word count
22+
23+
The following function counts the number of lines, words, and characters in a
24+
given input string. It uses the `lexmatch` expression to match different
25+
patterns in the input string.
26+
27+
```moonbit
28+
///|
29+
pub fn wordcount(
30+
input : BytesView,
31+
lines : Int,
32+
words : Int,
33+
chars : Int,
34+
) -> (Int, Int, Int) {
35+
lexmatch input with longest {
36+
("\n", rest) => wordcount(rest, lines + 1, words, chars)
37+
("[^ \t\r\n]+" as word, rest) =>
38+
wordcount(rest, lines, words + 1, chars + word.length())
39+
(".", rest) => wordcount(rest, lines, words, chars + 1)
40+
"" => (lines, words, chars)
41+
_ => panic()
42+
}
43+
}
44+
```
45+
46+
## Explanation
47+
48+
### Terminology
49+
50+
- **Target**: The `StringView` or `BytesView` to be `lexmatch`ed.
51+
- **Match Strategy**: The strategy used to match patterns. It can be either
52+
`longest` or `first` (default).
53+
54+
In this proposal, we only focus on the `longest` match strategy.
55+
56+
- **Catch-all case**: A case with a variable or wildcard `_` as its left-hand
57+
side, which matches any target. It is required to be placed at the end of the
58+
`lexmatch` arms/cases to handle unmatched situations.
59+
- **Lex Pattern**: The pattern part (differ with guard part) in left-hand side
60+
of a `lexmatch` arm/case (before `=>`).
61+
62+
E.g. `("[^ \t\r\n]+" as word, rest)`
63+
64+
A lex pattern can be one of the following:
65+
66+
- Bare regex pattern: the regex pattern will match against the whole target.
67+
68+
E.g. `""`
69+
70+
In this case, the regex pattern is `""`, which matches an empty `StringView`
71+
or `BytesView`.
72+
73+
- Regex pattern followed by a comma and a rest variable: the regex pattern
74+
will match against the prefix of the target, and the rest variable will bind
75+
to the remaining suffix.
76+
77+
The rest variable can be either a variable or a wildcard `_`.
78+
79+
In this form, the parentheses are required to improve readability.
80+
81+
E.g. `("\n", rest)`
82+
83+
In this case, the regex pattern is `"\n"`, which matches a newline character
84+
at the beginning of the target. The `rest` variable will bind to the
85+
remaining suffix of the target after removing the matched prefix.
86+
87+
- **Regex Pattern**: Regex patterns have three forms:
88+
89+
- **Regex Literal**: A string literal representing a regex pattern.
90+
91+
E.g. `"[^ \t\r\n]+"`
92+
93+
- **Capture**: A regex pattern followed by `as` and a variable name to capture
94+
the matched substring.
95+
96+
E.g. `"[^ \t\r\n]+" as word`
97+
98+
If the lex pattern is a bare regex pattern of this form, the parentheses are
99+
required.
100+
101+
- **Sequence**: A sequence of regex patterns separated by whitespace.
102+
103+
E.g. `"//" ("[^\r\n]*" as comment)`
104+
105+
If the lex pattern is a bare regex pattern of this form, the parentheses are
106+
required.
107+
108+
109+
Regex patterns can be nested to form more complex patterns.
110+
111+
### Semantics
112+
113+
The `lexmatch` expression works similarly to the `match` expression, with the
114+
following differences:
115+
116+
1. The target of a `lexmatch` expression must be a `StringView` or `BytesView`.
117+
2. Each arm/case except the catch-all case of a `lexmatch` expression must have
118+
a lex pattern as its left-hand side.
119+
3. The match strategy can be specified after the `with` keyword. If not
120+
specified, the default strategy is `first`. `first` strategy is considered
121+
unavailable at the moment.
122+
4. The regex patterns in lex patterns are matched against the target using the
123+
specified match strategy.
124+
5. If a regex pattern matches the target, any capture variables in the pattern
125+
will be bound to the corresponding matched substrings.
126+
6. If a regex pattern followed by a comma and a rest variable matches the
127+
target, the regex pattern will match the prefix of the target, and the rest
128+
variable will bind to the remaining suffix.
129+
7. If no lex pattern matches the target, the catch-all case will be executed.
130+
131+
### Subtleties
132+
133+
- When capture a single character, the matched substring is a `Char` or `Byte`,
134+
instead of a `StringView` or `BytesView`. E.g. `("[+-]" as sign)`
135+
136+
- The `"(abc)"` regex pattern does not introduce a capture group. To capture the
137+
matched substring, you need to use the `as` syntax. E.g. `"abc" as group`
138+
instead of `"(abc)"`.
139+
140+
- The `"$"` regex pattern matches the end of the target `StringView` or
141+
`BytesView`. The `"^"` regex pattern matches the start of the target (not
142+
implemented for now).
143+
144+
### Recipes
145+
146+
### Search a marker in a string
147+
148+
```moonbit
149+
pub fn search_marker(str: StringView) -> StringView? {
150+
for curr = str {
151+
lexmatch curr with longest {
152+
"" => return None
153+
("MARKER", right) => return Some(right)
154+
(".", rest) => continue rest
155+
_ => panic()
156+
}
157+
}
158+
}
159+
```
160+
161+
### FAQ
162+
163+
- Why not use the `match` expression with regex patterns directly?
164+
165+
The `match` expression is designed for structural pattern matching, while the
166+
`lexmatch` expression is designed for lexical analysis. Mixing the two
167+
concepts may lead to confusion and complexity. By introducing a separate
168+
expression for lexical analysis, we can keep the semantics clear and focused.
169+
170+
- Which syntax/feature can be used in regex literals?
171+
172+
Bascially, the syntax aligned with JavaScript regex literals (with v flag
173+
enabled), with the following not supported for now:
174+
175+
- Regex flags (e.g. `i`, `g`, `m`, etc.)
176+
- Lookahead and lookbehind assertions (e.g. `(?=...)`, `(?!...)`, etc.)
177+
- Backreferences (e.g. `\1`, `\2`, etc.)
178+
- Named capture groups (e.g. `(?<name>...)`)
179+
- Unicode property escapes (e.g. `\p{...}`, `\P{...}`)
180+
- Scoped modifiers (e.g. `(?i:...)`, `(?-i:...)`)

0 commit comments

Comments
 (0)