Skip to content

Conversation

zazedd
Copy link

@zazedd zazedd commented Aug 1, 2025

Related to #12

This PR adds support for pattern guards. The scope for the guards contains the variables bound by the regexes.

Details

This feature changes the generated code the most.

Before: Each pattern case generated inline matching code:

match Re.exec_opt (fst _ppx_regexp_1) _ppx_regexp_v with
  | None -> default_rhs
  | Some _g ->
      if Re.Mark.test _g marks.(0) then
        let x = Re.Group.get _g 1 in
        ...
      else if Re.Mark.test _g marks.(1) then
        ...
      else
        assert false

After: Pattern cases are compiled into handler functions that return option values:

match Re.exec_opt (fst _ppx_regexp_1) _ppx_regexp_v with
  | None -> default_rhs
  | Some _g ->
      let _case_0 _g =
        (* RE bindings are put here *)
        let x = Re.Group.get _g 1 in
        (* potential guard checks here, for example *)
        if int_of_string x > 100 then Some (`Large x)
        else if int_of_string x > 10 then Some (`Medium x)
        else Some (`Small x)
      in
      let _case_1 _g = ... in
      ...
      let _case_n _g = ... in
      match
        if Re.Mark.test _g marks.(0) then _case_0 _g
        else if Re.Mark.test _g marks.(1) then _case_1 _g
        ...
        else if Re.Mark.test _g marks.(1) then _case_n _g
        else None
      with
      | Some result -> result
      | None -> default_rhs

Cases with the same pattern but different guards are grouped together.

Note

This updates the opam package to use ppxlib <= "0.35.0" only, as ppxlib.0.36.0 has breaking changes in the Parsetree AST.

@zazedd zazedd force-pushed the upstream/pattern-guards branch from 30496fa to eca58e3 Compare August 1, 2025 12:45
@zazedd zazedd marked this pull request as ready for review August 1, 2025 17:40
@Drup
Copy link
Collaborator

Drup commented Aug 5, 2025

How does this work precisely ? In particular, how do you compute cases on which to "fall through" in case of guard failure ?

@zazedd
Copy link
Author

zazedd commented Aug 7, 2025

This snippet is what controls the fall-through semantics:

let rec mk_guard_chains = function
  | [] -> [%expr None]
  | (_, bs, rhs, guard) :: rest ->
    let bs = List.rev bs in
    begin
      match guard with
      | None -> [%expr Some [%e wrap_group_bindings ~captured_acc:[] ~loc rhs offG bs]]
      | Some guard_expr ->
        let guarded = [%expr if [%e guard_expr] then Some [%e rhs] else [%e mk_guard_chains rest]] in
        wrap_group_bindings ~captured_acc:[] ~loc guarded offG bs
    end

If there is no guard, then simply return the rhs with the bindings
If there is a guard, then generate: if guard then Some result else (* try next case *)

Effectively, if one writes:

match str with
| {|regex1|} when x > 0 -> rhs1
| {|regex1|} when x < 0 -> rhs2
| {|regex1|} -> rhs3
| _ -> ...

It will generate:

if Re.Mark.test _g regex1_mark then
  let x = Re.Group.get _g 1 in
  if x > 0 then Some rhs1
  else if x < 0 then Some rhs2
  else Some rhs3
else

And, if one writes a similar match pattern but where there is no case without a guard

match str with
| {|regex1|} when x > 0 -> rhs1
| {|regex1|} when x < 0 -> rhs2
| _ -> ...

Then it generates:

if Re.Mark.test _g regex1_mark then
  let x = Re.Group.get _g 1 in
  if x > 0 then Some rhs1
  else if x < 0 then Some rhs2
  else None
else

Afterwards, the last match ... with | Some result -> result | None -> default_rhs takes care of providing the correct result.

@Drup
Copy link
Collaborator

Drup commented Aug 7, 2025

What if you write

match str with
| {|regex1|} when x > 0 -> rhs1
| {|regex2|} -> rhs2
| _ -> ...

Such that regex1 and regex2 intersect ?

@zazedd
Copy link
Author

zazedd commented Aug 7, 2025

Before the last match, there is a sequence of if Re.Mark.test _g marks.(n) then _case_n _g else .... The first one that matches (and has its pattern guard pass) is the one that takes precedence.
In your example, if str matches regex1 and x > 0 then rhs1 would execute, else if str matches regex2 then rhs2 would execute, else the catch all case would.

@Drup
Copy link
Collaborator

Drup commented Aug 7, 2025

The question is : what if str matches both regex1 and regex2, but doesn't validate x > 0. It seems to me the proposed code would enter case 1, return None (since it doesn't validate the guard), but never visit case 2.

Additionally, I implemented this kind of dispatch for Tyre, and remember marks being quite a bit tricky to use across different branches. I'll have to do some tests with the generate code.

I would definitely want some additional test cases here.

(Some benchmarks also wouldn't hurt, but that's for the whole library, really ...)

@zazedd
Copy link
Author

zazedd commented Aug 8, 2025

You're totally right! Big oversight on my part.
I've thought about it for a bit and I like this solution, essentially a search for the first Some result in all branches for which Re.Mark.test _g ... passes:

open struct
 ...
 let rec __ppx_regexp_dispatch marks handlers _g i =
  if i >= Array.length marks then None
  else if Re.Mark.test _g marks.(i) then
    match handlers.(i) _g with
    | Some result -> Some result
    | None -> __ppx_regexp_dispatch (i + 1)
  else __ppx_regexp_dispatch (i + 1)
  ...
end

...
  let handlers = [| _case_0; _case_1; ... |] in
  match __ppx_regexp_dispatch marks handlers _g 0 with
  | Some result -> result
  | None -> default_rhs

(basically a List.find_map, but because the lib is for OCaml 4.03+ and that function is only from 4.10+ we can't use it, and because we would potentially need to allocate a very big list which I don't like)

But there's still the problem of the Marks, it refuses to mark until the end, only until it finds the first match... And I'm guessing it isn't a trivial feature to add to the Marking machinery of ocaml-re

@zazedd
Copy link
Author

zazedd commented Sep 9, 2025

I was able to get around this, unfortunately resorting to not compiling to a single group.
The ahrefs fork currently builds various compilation groups.

If the regexes in the current group:

  • do not have pattern guards, then if the current regex:
    • is pattern guardless as well, it can belong to the same group
    • has a pattern guard, it starts a new group
  • have pattern guards, then if the current regex
    • has the same RE and flags, it can belong to the same group
    • doesn't have the same RE and flags, then start new group

Unfortunately the cases where a new group is started for guarded cases is the most common, but for this not to be the case we would need to have a way to understand if the RE's are intersecting or not, which isn't easy with ocaml-re due to it using a DFA.

I will update this PR to use this system as well, unless of course there's better ideas.

@paurkedal
Copy link
Owner

It seems reasonable to split up into multiple groups to handle guards. The only alternatives I can think of have other inefficiencies. I think the grouping can be slightly simplified/optimised by allowing a pattern with a guard to be the final case of each group, i.e.:

  • Merge consecutive cases having identical patterns. There will typically be one or more guarded patterns, possibly followed by an unguarded pattern.
    • If at least one case is unguarded, then the merged cases will be unguarded, thus optimising the next step.
    • Otherwise, the effective guard will be a disjunction, but it will be implied by chain of if statements terminating in the continuation where we proceed to the next group of cases.
  • Split cases into the largest groups each containing consecutive unguarded cases followed by a single guarded case, which is optional for a final unguarded group.
  • Consider each group in order until a match is found, in which case:
    • If the matching pattern is not a final guarded case, it's a match.
    • If the matching pattern is a final guarded case:
      • If the guard evaluates to true, it's a match.
      • Otherwise, proceed with the next group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants