-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for pickle or some serialization/deserialization #41
Comments
What's the motivation for this? It would be useful to understand that since that helps inform the design approach (or whether it's even necessary at all, maybe I can suggest an alternative). As far as whether it's possible... two approaches:
Either way there would be some code added to this repo as well. |
Sadly this is not a small patch if we do it the way I want to do it. And as you point out, it's a compatibility hazard. (I do agree that a naive patch adding Serde support would probably be smallish in isolation.) I do want some kind of serialization support some day, like what is in |
twitches in big-endian Not that anyone runs on big-endian platforms these days? I guess you could use some zero-copy serialization format like capnproto or something to represent the data structures? Regardless, yeah that'd be a lot of work. And I guess it's possible naive serde deserialization wouldn't even be any better than reparsing the needles, especially if the Python wrapper switches to the faster new NFA thing by default. |
Probably naive serde deserialization will be noticeably faster than actually building the automaton itself. That kind of deserialization would do a lot of bulk copying which is going to be much faster than the byte-by-byte construction process of Aho-Corasick (and then the breadth first search of the entire trie to fill in the failure transitions). As far as big-endian goes, indeed, for I'd be surprised if something like capnproto could work for this, but I don't know of anyone who has tried it. |
Anyway we'll see what the use case is. Pickling is often used in e.g. multiprocessing support in Python, for example, and for that use case aggressively invalidating old pickles with versioning isn't a problem since it's always the same version of the library in parent and child process. |
For me, the use case would be caching the automaton. Currently, we build it from the (cached) terms in each request where it is needed, but that takes considerable time and I think deserializing would be much faster. For compatibility, it would be enough if it would refuse to load incompatible serialized data, we can fall back to rebuilding it from the terms. |
Another possible way would be to share automaton between processes |
+1 to this feature |
Hi:
any plan to support pickling or some serialization/deserialization?
The text was updated successfully, but these errors were encountered: