-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support custom extraction with cost functions #241
Comments
I think we would also need a way to get all parent nodes from a node in the serialized format for doing custom cost traversal... For example you might set some kind of |
For your reference, I wrote a PoC custom extraction with custom cost model at sklam/prototypes_2025@983b81d The example expands
This is currently hard to do and therefore omitted in my PoC. I would want to know that it is a For short term, is there a way to associate node in the serialized json back to the egglog-python
I'm very interested in the |
Is using an extractor like an ILP extractor currently supported in upstream egglog? Or support would be needed both there and in the python bindings? |
In #265 (released as 9.0.0) the ability to automatically extract a builtin using a global egraph context was added. This PR removes that feature, requiring all builtins to be in a normalized form. I realized that for #241 we want facts to compare structural equality when converting to a boolean, instead of using the e-graph. Looking at that previous PR, it seems like a mistake to add this implicit context, making things more confusing and opaque with minimal UX improvements.
If it is of any interest, I built a small and (hopefully) extensible python library that supports custom extraction and custom cost models. You can implement a cost model or extractor independently, and I plan to port some of extraction-gym extractors for my needs. Then using some ugly deserialization, I convert it back to an egglog object for further processing or code generation. It is based on @sklam POC, i.e. parsing the serialized JSON and using networkx. It does not support the matching logic you described in the issue without serializing. I believe the proposed matching might be too "specific", in the sense that I don't always want to match against a full expression (instead of a partial one - i.e. I raise to a power and there is an addition in the exponent). In addition, how does it support cycles in the graph? It is not fully in the spirit of the library, where everything is strongly typed, but it was necessary to me to support at least a custom cost model that can "peek" to the children of nodes. |
Oh cool! Yeah, definitely of interest, feel free to post a link. This is what I would like to work on next, so in particular, if there are test cases or use cases that use the logic you wrote, those would be super helpful to look at to see if I can get something upstreamed here that covers them... |
I have been making progress with my POC. Here's a long notebook that goes through compiling a Python function, encoding it into egraph, custom extraction, and MLIR codegen: https://numba.github.io/sealir/demo_geglu_approx_mlir.html#extracting-an-optimized-representation I have changed the extraction logic since the POC. The previous code mishandles cycles. The new algorithm approximate the solution using an iterative relaxation and it naturally deals with cycles without much complexity: https://github.com/sklam/sealir/blob/05e85a1ed2c7ff4c6f5ec5f7dd901833aa3516ee/sealir/eqsat/rvsdg_extract.py#L145-L202 . For the cost-modeling, my next step is to look into Pareto curve for time-cost vs power-cost (or vs precision). The new algorithm is easier to reason about (for me) in that kind of scenario. So, I'll want the cost to be a ND vector soon. |
Currently, the extraction in egglog is rather limited. It does a tree-based extraction (meaning that if a node shows up twice, it will be counted twice) and requires static costs per function.
The first issue, the type of extractor, could be alleviated by using some extractors from extraction gym. The second, having some custom costs per item could be addressed upstream in egglog (egraphs-good/egglog#294) but is not on the immediate roadmap.
Either way, it would also be nice to have fully custom extraction. Being able to iterate through the e-graph and do what you will...
Currently, it's "possible" by serializing the e-graph to JSON. But this is not ideal because then you have to look at JSON with random keys and they might not map to your Python function names and it's not type safe and... Yeah it's just a real pain!
So I think it would make sense to add an interface that allows:
Possible Design
Here is a possible API design for the extractors
Using this interface, you could use the default costs form egglog and use a custom extractor, as shown in the helper
extract
method.However, you could also set custom costs before serializing, overriding any from egglog:
How would you be able to traverse an expression at runtime and see its children? I think with three small additions, we could be able to do this with our current API:
int(i64(0))
bool(eq(x).to(y))
which will resolve to whether the two sides are exactly syntactically equal.fn_matches(x, f)
would return a boolean to say whether the function matches, and thenfn_args(x, f)
would return a list of the args. They could be typed like this:Alternatively, how would you create a custom extractor? We would want to add one more way to traverse expressions... This time not caring about what particular expression they are, just their args and a way to re-ccreate them with different args. Using that, we could write a simple tree based extractor:
The text was updated successfully, but these errors were encountered: