Skip to content

Commit ac7342b

Browse files
committed
More regex
1 parent 1aeb679 commit ac7342b

File tree

3 files changed

+279
-53
lines changed

3 files changed

+279
-53
lines changed

Diff for: examples/regex/re1.flx

+29
Original file line numberDiff line numberDiff line change
@@ -46,3 +46,32 @@ var anys = ".*";
4646
regdef anystring = perl(anys);
4747
#line 133 "regex.fdoc"
4848
var prefix_assignment = a_s + anys;
49+
#line 173 "regex.fdoc"
50+
var data = "proc fun variant";
51+
var lines = split(data," ");
52+
var kws = unbox$ map (fun (x:string) => Regdef::String x) lines;
53+
var kws_r = kws.Regdef::Alts;
54+
println$ kws_r.str;
55+
#line 183 "regex.fdoc"
56+
var cid2 = regexp ( cidlead cidtrail*);
57+
#line 210 "regex.fdoc"
58+
// the match checker
59+
fun _match_ctor_re (r:RE2) (x:string) => x in r;
60+
61+
// the extractor
62+
fun _ctor_arg_re (r:RE2) (x:string) =>
63+
match Match (r,x) with
64+
| Some y => y
65+
// None case shouldn't happen!
66+
endmatch
67+
;
68+
69+
// test case
70+
match "Hello" with
71+
| re "H(ell)o".RE2 y => println$ y.1;
72+
endmatch;
73+
#line 252 "regex.fdoc"
74+
var kw = kws_r.Regdef::render.RE2;
75+
for xxv in iterator (kw, "proc blah fun proc") do
76+
println$ xxv.0;
77+
done

Diff for: regex.fdoc

+123-27
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ be a <em>sub</em> language we need to more completely integrate it
123123
with Felix code.
124124

125125
The @{Match} function always matches the whole string. To work around this
126-
we can define do this:
126+
we can define this:
127127
@tangle re1.flx
128128
var anys = ".*";
129129
regdef anystring = perl(anys);
@@ -140,7 +140,7 @@ repex string generating code you like.
140140

141141
@h2 The term tree
142142
There is a second way to generate a regex by using a combinator tree
143-
or type @[regex} directly. In fact, the Regdef DSSL grammar just provides a convenient
143+
or type @{regex} directly. In fact, the Regdef DSSL grammar just provides a convenient
144144
syntax for generating these trees. Here is the important part of the
145145
definition from the library:
146146
@felix
@@ -169,39 +169,135 @@ which have to be cited literally as shown. What if you wanted to load
169169
the keyword list from a file?
170170

171171
You mean, like this?
172-
@felix
173-
var kws : Regdef::regex = Regdef::Alts(load("keywords.txt").split("\n"));
174-
regdef appl = " "* felix (kws) " "+ ffloat;
172+
@tangle re1.flx
173+
var data = "proc fun variant";
174+
var lines = split(data," ");
175+
var kws = unbox$ map (fun (x:string) => Regdef::String x) lines;
176+
var kws_r = kws.Regdef::Alts;
177+
println$ kws_r.str;
175178
@
176-
Here we constructed the alternatives term directly from a list of strings
177-
loaded from a file, and then lifted it into the grammar. As you can guess
178-
since the parser is building a @{regex} tree anyhow, the @{felix} term is
179-
a kind of escape of quotation which has no semantics, it just avoid
180-
translating the quoted term.
181-
182-
And as you can see you can put Perl strings directly in there too
183-
using the @{Perl} constructor of the variant. Which of course
184-
is exactly what the @{perl} syntax element of the grammar does!
185-
186-
SO you basically have a three level language system: a simple
187-
DSSL, the combinatorial form which is properly type checked,
188-
and a super lame <em>do not use except in emergency</em> form
189-
using strings with Perl encoded regexps.
190-
191-
The primary point to be demonsrated here is the <em>sub</em> part
192-
of the DSSL concept. We have a domain specific language, yes, but
193-
it integrates completely with Felix. This ensure all the power
194-
of the base language is available in the sub language, whilst
195-
the sub language grammar eliminates error prone and hard to read boilerplate
196-
although it might not fully cover all capabilities.
197179

180+
@h2 Inline regex
181+
It is possible to build regex inline like this:
182+
@tangle re1.flx
183+
var cid2 = regexp ( cidlead cidtrail*);
184+
@
185+
which in this case is precisely equivalent to saying
186+
@felix
187+
regdef cide = cidlead | cidtrail*;
188+
@
189+
The parser switched to the regex syntax inside the parens.
190+
Remember, a @{regex} is just a simple variant that builds a term tree.
191+
The grammar that does all this is defined in the library; that is,
192+
in user space.
193+
194+
@h1 Pattern matching
195+
It is possible to pattern match with regexps. The ability to do
196+
this is quite general built in to the way the compiler works,
197+
rather than a feature of the regex DSSL: we enable user defined
198+
pattern matches using a feature of Felix pattern matching
199+
known as <em>higher order pattern matching</em>.
200+
201+
The way pattern matches work in Felix requires two functions:
202+
<ol>
203+
<li>The <em>match checker</em> tests to see if a pattern matches the match argument</li>
204+
<li>The <em>extractor</em> fetches the argument of the variant constructor matched</li>
205+
</ol>
206+
207+
The way this works with Felix data types is built in to the compiler,
208+
but there is a way to write your own checker and extractor functions:
209+
@tangle re1.flx
210+
// the match checker
211+
fun _match_ctor_re (r:RE2) (x:string) => x in r;
212+
213+
// the extractor
214+
fun _ctor_arg_re (r:RE2) (x:string) =>
215+
match Match (r,x) with
216+
| Some y => y
217+
// None case shouldn't happen!
218+
endmatch
219+
;
220+
221+
// test case
222+
match "Hello" with
223+
| re "H(ell)o".RE2 y => println$ y.1;
224+
endmatch;
225+
@
198226

227+
In this code we invent a new pretend constructor @{re} which takes two arguments.
228+
The first is the pattern we want to consider, and the second is the data we're
229+
checking against. In our case the pattern is an @{RE2} and the data is a string;
230+
but the technique is fully general.
199231

232+
The first function has a magic name used for checking is the string is in
233+
the regexp, the second provides the way to get data out of it. Notice the
234+
extractor is called <em>if, and only if</em> the match checker returns true.
200235

236+
Therefore the word @{re} followed by a term of type @{RE2} will be treated
237+
as a pattern with one argument.
201238

239+
Although the higher order pattern matching feature is not specific to
240+
regular expressions .. it was in fact added to the compiler specifically
241+
to support that use case.
202242

243+
@h1 Iteration
244+
In Felix we have a concept of an iterator: it corresponds with
245+
the C++ notation of an input iterator. The library contains
246+
an iterator allowing regexps to find matching substrings of a string.
247+
Unlike @{Match} our iterator scans for the first match, and returns.
248+
If the iterator is called again, it finds the next match.
203249

250+
Here's how to use it:
251+
@tangle re1.flx
252+
var kw = kws_r.Regdef::render.RE2;
253+
for xxv in iterator (kw, "proc blah fun proc") do
254+
println$ xxv.0;
255+
done
256+
@
204257

205-
258+
Here's the implementation
259+
@felix
260+
gen iterator (r:RE2, var target:string) () : opt[varray[string]] = {
261+
var emptystring = "";
262+
var l = len target;
263+
var s = StringPiece target;
264+
var p1 = s.data;
265+
var p = 0;
266+
var n = NumberOfCapturingGroups(r)+1;
267+
var v1 = varray[StringPiece] (n.size,StringPiece emptystring);
268+
var v2 = varray[string] (n.size,"");
269+
again:>
270+
var result = Match(r, s, p, UNANCHORED,v1.stl_begin, n);
271+
if not result goto endoff;
272+
for var i in 0 upto n - 1 do set(v2, i.size, string(v1.i)); done
273+
var p2 = v1.0.data;
274+
assert(v1.0.len.int > 0); // prevent infinite loop
275+
p = (p2 - p1).int+v1.0.len.int;
276+
yield Some v2;
277+
goto again;
278+
endoff:>
279+
return None[varray[string]];
280+
}
281+
@
206282

283+
This uses a whole lot of features of the Google RE2 system
284+
which are lifted into Felix, such as the data type @{StringPiece}
285+
which is roughly a view of a string, and which I am not going
286+
to explain.
287+
288+
Rather the take away is that we can implement high level DSSL
289+
features based on a C/C++ library, by lifting the library API
290+
into Felix, more or less secretly, and then use them to implement
291+
the operations we actually want, with the syntax we actually want.
292+
293+
In many case the core system deliberately has features to support
294+
this kind of modelling technology. In the code above, the key
295+
piece of magic is the @{yield} operation, which returns a value,
296+
but also saves the current location in the code, so that re-invoking
297+
the generator will restart the code with the same context and at the
298+
same point it previously left off.
299+
300+
<em>Yielding Generators</em> are a common feature of many
301+
languages in including Python and Rust. They are in fact a weak
302+
form of coroutine.
207303

0 commit comments

Comments
 (0)