Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for pickle or some serialization/deserialization #41

Open
pombredanne opened this issue Jan 19, 2023 · 8 comments
Open

Add support for pickle or some serialization/deserialization #41

pombredanne opened this issue Jan 19, 2023 · 8 comments

Comments

@pombredanne
Copy link

Hi:
any plan to support pickling or some serialization/deserialization?

>>> import ahocorasick_rs
>>> patterns = ["hello", "world", "fish"]
>>> ac = ahocorasick_rs.AhoCorasick(patterns)
>>> import pickle
>>> q=pickle.dumps(ac)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot pickle 'builtins.AhoCorasick' object
@itamarst
Copy link
Collaborator

itamarst commented Jan 26, 2023

What's the motivation for this? It would be useful to understand that since that helps inform the design approach (or whether it's even necessary at all, maybe I can suggest an alternative).

As far as whether it's possible... two approaches:

  1. Rely on the underlying Rust library and serialize the state machine it creates. At the moment it does not support serialization (i.e. it doesn't integrate serde support). This would be pretty a small patch for them, so plausibly they'd be happy to do it. Probably would want some sort of version tag, since underlying object can change.
  2. Serialize just the underlying needles (search terms), and recreate the actual underlying search object during unpickling. This means more expensive unpickling, but also means pickles would usually be backwards- and forwards-compatible.

Either way there would be some code added to this repo as well.

@BurntSushi
Copy link

BurntSushi commented Jan 26, 2023

  1. This would be pretty a small patch for them, so plausibly they'd be happy to do it.

Sadly this is not a small patch if we do it the way I want to do it. And as you point out, it's a compatibility hazard. (I do agree that a naive patch adding Serde support would probably be smallish in isolation.)

I do want some kind of serialization support some day, like what is in regex-automata for DFAs, but not with Serde. With Serde, you have to go through a costly (de)serialization roundtrip. That might work for your pickle use case, but what I want is for the actual raw underlying byte representation in memory to also be the serialization format. (Again, like DFAs are in regex-automata.) That way, deserialization is done in constant time by simply re-interpreting the bytes. This unlocks lots of cool use cases like, "I have a huge Aho-Corasick automaton that I want to generate offline and embed into my binary so I don't have to generate it at runtime." But... it is a lot of work and tends to introduce a lot of unsafe where there is exactly none today.

@itamarst
Copy link
Collaborator

itamarst commented Jan 26, 2023

twitches in big-endian Not that anyone runs on big-endian platforms these days? I guess you could use some zero-copy serialization format like capnproto or something to represent the data structures? Regardless, yeah that'd be a lot of work.

And I guess it's possible naive serde deserialization wouldn't even be any better than reparsing the needles, especially if the Python wrapper switches to the faster new NFA thing by default.

@BurntSushi
Copy link

Probably naive serde deserialization will be noticeably faster than actually building the automaton itself. That kind of deserialization would do a lot of bulk copying which is going to be much faster than the byte-by-byte construction process of Aho-Corasick (and then the breadth first search of the entire trie to fill in the failure transitions).

As far as big-endian goes, indeed, for regex-automata you have to build both little endian and big endian serialization files for each DFA. Assuming you're embedding them into your binary, only one gets compiled in. But yeah, that is a downside of the zerocopy approach.

I'd be surprised if something like capnproto could work for this, but I don't know of anyone who has tried it.

@itamarst
Copy link
Collaborator

Anyway we'll see what the use case is. Pickling is often used in e.g. multiprocessing support in Python, for example, and for that use case aggressively invalidating old pickles with versioning isn't a problem since it's always the same version of the library in parent and child process.

@nijel
Copy link
Contributor

nijel commented Sep 14, 2023

For me, the use case would be caching the automaton. Currently, we build it from the (cached) terms in each request where it is needed, but that takes considerable time and I think deserializing would be much faster. For compatibility, it would be enough if it would refuse to load incompatible serialized data, we can fall back to rebuilding it from the terms.

@prorealize
Copy link

Another possible way would be to share automaton between processes

@Azzonith
Copy link

Azzonith commented Jun 6, 2024

+1 to this feature
Same use case, e.g. 100 items to process with automaton, ProcessPoolExecutor created with 10 processes, each process requires it's own automaton instance.
Creating a new automaton instance each time is rather inefficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants