-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathregex.fdoc
303 lines (264 loc) · 11 KB
/
regex.fdoc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
@tangler re1.flx = examples/regex/re1.flx
@title Regular Expression DSSL
@h1 Regular Definitions
Felix has a DSSL (Domain Specific SubLanguage) for regular expressions, or more
precisely, <em>regular definitions</em>. We'll first show example of using it,
and then show how he DSSL is constructed.
@h2 Example Definitions
Here is a simple example defining C identifiers:
@tangle re1.flx
regdef digit = charset "01234356789";
regdef lower = charset "abcdefghijklmnopqrstuvwxyz";
regdef upper = charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
regdef letter = lower | upper;
regdef underscore = "-";
// C identifier
regdef cidlead = underscore | letter;
regdef cidtrail = underscore | letter | digit;
regdef cid = cidlead cidtrail*;
@
The @{regdef} binder defines an ordinary @{var} containing a representation
of the specified regular expression. But it does a lot more than that:
it enables a <em>regular definition grammar</em> based on EBNF.
Regular definitions are <em>vastly superior</em> to regular expressions
because they allow the expression to be factored into components;
in other words, they provide <em>modularity</em>.
In addition, the use of program language syntax is a killer advantage
over the usual regular expression string forms because the distinction
between operators and data is not only clear .. the syntax is checked
by the parser at compile time.
We'll now define something a bit more complicated:
@tangle re1.flx
// C int
regdef cint = digit+;
// C decimal fixed point float
regdef ffixed = digit* "." digit+ | digit+ "." digit*;
// Scifloat
regdef expon = ("E" | "e") ("+"|"-")? digit+;
regdef sci = (cint | ffixed) expon;
regdef ffloat = ffixed | sci;
@
These definitions should be quite clear as written: for fixed point float,
a sole dot is not a floating point number, but leading or trailing digists can be left out
provcided there is a dot, but not both.
For a escientific float, we need either a plain integer of fixed point
float, followed by an exponent, which allows but does not require a sign.
@h2Simple Usage
A variable defined by a @{regdef} has type @{regex}. However we are using Google RE2
as our regular expression engine, and that requires a compiled regular expression or type @{RE2}
to work. The RE2 compiler compiles the usual kinds of strings, so we need first to convert
a @{regex} to a string.
So here is our first test case:
@tangle re1.flx
var float_s : string = ffloat.Regdef::render; // convert regex to string
var float_r : RE2 = float_s.RE2; // compile string to RE2 regex
println$ "123.4e-7" in float_r; // if it is a float?
println$ float_r;
@
Now we printed out the actually rendered regular expression encoded as a string
and here it is, with some newlines added so it fits on our page:
@pre
(?:[01234356789])*\x2E(?:[01234356789])+|(?:[01234356789])+
\x2E(?:[01234356789])*|(?:(?:[01234356789])+|(?:[01234356789])*
\x2E(?:[01234356789])+|(?:[01234356789])+\x2E(?:[01234356789])*)
(?:E|e)(?:\x2B|\x2D)?(?:[01234356789])+
@
The chance of getting this very simple regex correct as a string
does not seem very high.
An anecdote: there was once a site that performance tested various languages,
and one of the tests was a simple phone number regex. I think everyone copied
the regex used by the first contribution, except Felix which used regular
definitions. <em>Every single language except Felix got it wrong.</em>
@h2 Captures
Ofen we want to capture data from a regex. Although it made a mess and
got the maths wrong, Perl had the ability to perform captures.
In light of some mathematical analysis new techniques were developed,
but in the end Google RE used a simple capture engine and threw out
balancing captures to ensure it worked.
Here's s simple example
@tangle re1.flx
regdef assignment = " "* group(cid) " "* "=" " "* (group(cint) | group(ffloat)) " "* ";";
var a_s = assignment.Regdef::render;
var a_r = a_s.RE2;
var result = Match (a_r, "x = 123.45;");
match result with
| None => println$ "No match";
| Some v =>
print$ "Variable is " + v.1; // group 1
if v.2 == "" do // if group 2 didn't match
println$ " Initialiser is float " + v.3; // group 3 matched
else
println$ " Initialiser is int " + v.2; // group 2 matched
done
endmatch;
@
The @{Match} function returns an option type which is @{None} is there is no match,
otherwise @{Some v} if there is. The @{v} is a Felix @{varray[string]} with slot 0
containing the whole matchingh pattern (which is always the whole input)
and then one slot for each catpture group. The capture groups are numbered by a left
to right depth first recursive descent of the implicit tree structure.
If a group doesn't match anything it is set to the empty string.
If a group is inside a repetition only the last matching substring is retained.
@h2 The Sub bit of the Domain Specific Sub Language
So far, we have shown what is effectively a Domain Specific Language
or DSL. Yes, it is nested in Felix and interacts with it, but to truly
be a <em>sub</em> language we need to more completely integrate it
with Felix code.
The @{Match} function always matches the whole string. To work around this
we can define this:
@tangle re1.flx
var anys = ".*";
regdef anystring = perl(anys);
@
and stick it at the end. Alternatively we can do this:
@tangle re1.flx
var prefix_assignment = a_s + anys;
@
The important point here is that we can <em>lift a Felix string</em>
which is a traditional Perl regex string into the computation,
and the string can be denoted by <em>any</em> Felix expression of
type @{string}. You can write any kind of functional rendering of
repex string generating code you like.
@h2 The term tree
There is a second way to generate a regex by using a combinator tree
or type @{regex} directly. In fact, the Regdef DSSL grammar just provides a convenient
syntax for generating these trees. Here is the important part of the
definition from the library:
@felix
class Regdef {
variant regex =
| Alts of list[regex]
| Seqs of list[regex]
| Rpt of regex * int * int
| Charset of string
| String of string
| Group of regex
| Perl of string
;
...
@
Although the grammar provides a convenient way to construct some of these
trees, it cannot construct all of them. For example suppose you want to
syntax colour a programming language by highlighting keywords.
You would of course like to do this:
@felix
regdef kw = "proc" | "fun" | "variant";
@
but alas, this grammar only works with a fixed list of known keywords
which have to be cited literally as shown. What if you wanted to load
the keyword list from a file?
You mean, like this?
@tangle re1.flx
var data = "proc fun variant";
var lines = split(data," ");
var kws = unbox$ map (fun (x:string) => Regdef::String x) lines;
var kws_r = kws.Regdef::Alts;
println$ kws_r.str;
@
@h2 Inline regex
It is possible to build regex inline like this:
@tangle re1.flx
var cid2 = regexp ( cidlead cidtrail*);
@
which in this case is precisely equivalent to saying
@felix
regdef cide = cidlead | cidtrail*;
@
The parser switched to the regex syntax inside the parens.
Remember, a @{regex} is just a simple variant that builds a term tree.
The grammar that does all this is defined in the library; that is,
in user space.
@h1 Pattern matching
It is possible to pattern match with regexps. The ability to do
this is quite general built in to the way the compiler works,
rather than a feature of the regex DSSL: we enable user defined
pattern matches using a feature of Felix pattern matching
known as <em>higher order pattern matching</em>.
The way pattern matches work in Felix requires two functions:
<ol>
<li>The <em>match checker</em> tests to see if a pattern matches the match argument</li>
<li>The <em>extractor</em> fetches the argument of the variant constructor matched</li>
</ol>
The way this works with Felix data types is built in to the compiler,
but there is a way to write your own checker and extractor functions:
@tangle re1.flx
// the match checker
fun _match_ctor_re (r:RE2) (x:string) => x in r;
// the extractor
fun _ctor_arg_re (r:RE2) (x:string) =>
match Match (r,x) with
| Some y => y
// None case shouldn't happen!
endmatch
;
// test case
match "Hello" with
| re "H(ell)o".RE2 y => println$ y.1;
endmatch;
@
In this code we invent a new pretend constructor @{re} which takes two arguments.
The first is the pattern we want to consider, and the second is the data we're
checking against. In our case the pattern is an @{RE2} and the data is a string;
but the technique is fully general.
The first function has a magic name used for checking is the string is in
the regexp, the second provides the way to get data out of it. Notice the
extractor is called <em>if, and only if</em> the match checker returns true.
Therefore the word @{re} followed by a term of type @{RE2} will be treated
as a pattern with one argument.
Although the higher order pattern matching feature is not specific to
regular expressions .. it was in fact added to the compiler specifically
to support that use case.
@h1 Iteration
In Felix we have a concept of an iterator: it corresponds with
the C++ notation of an input iterator. The library contains
an iterator allowing regexps to find matching substrings of a string.
Unlike @{Match} our iterator scans for the first match, and returns.
If the iterator is called again, it finds the next match.
Here's how to use it:
@tangle re1.flx
var kw = kws_r.Regdef::render.RE2;
for xxv in iterator (kw, "proc blah fun proc") do
println$ xxv.0;
done
@
Here's the implementation
@felix
gen iterator (r:RE2, var target:string) () : opt[varray[string]] = {
var emptystring = "";
var l = len target;
var s = StringPiece target;
var p1 = s.data;
var p = 0;
var n = NumberOfCapturingGroups(r)+1;
var v1 = varray[StringPiece] (n.size,StringPiece emptystring);
var v2 = varray[string] (n.size,"");
again:>
var result = Match(r, s, p, UNANCHORED,v1.stl_begin, n);
if not result goto endoff;
for var i in 0 upto n - 1 do set(v2, i.size, string(v1.i)); done
var p2 = v1.0.data;
assert(v1.0.len.int > 0); // prevent infinite loop
p = (p2 - p1).int+v1.0.len.int;
yield Some v2;
goto again;
endoff:>
return None[varray[string]];
}
@
This uses a whole lot of features of the Google RE2 system
which are lifted into Felix, such as the data type @{StringPiece}
which is roughly a view of a string, and which I am not going
to explain.
Rather the take away is that we can implement high level DSSL
features based on a C/C++ library, by lifting the library API
into Felix, more or less secretly, and then use them to implement
the operations we actually want, with the syntax we actually want.
In many case the core system deliberately has features to support
this kind of modelling technology. In the code above, the key
piece of magic is the @{yield} operation, which returns a value,
but also saves the current location in the code, so that re-invoking
the generator will restart the code with the same context and at the
same point it previously left off.
<em>Yielding Generators</em> are a common feature of many
languages in including Python and Rust. They are in fact a weak
form of coroutine.