Add support for pickle or some serialization/deserialization #41

pombredanne · 2023-01-19T09:41:03Z

Hi:
any plan to support pickling or some serialization/deserialization?

>>> import ahocorasick_rs
>>> patterns = ["hello", "world", "fish"]
>>> ac = ahocorasick_rs.AhoCorasick(patterns)
>>> import pickle
>>> q=pickle.dumps(ac)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot pickle 'builtins.AhoCorasick' object

itamarst · 2023-01-26T16:32:17Z

What's the motivation for this? It would be useful to understand that since that helps inform the design approach (or whether it's even necessary at all, maybe I can suggest an alternative).

As far as whether it's possible... two approaches:

Rely on the underlying Rust library and serialize the state machine it creates. At the moment it does not support serialization (i.e. it doesn't integrate serde support). This would be pretty a small patch for them, so plausibly they'd be happy to do it. Probably would want some sort of version tag, since underlying object can change.
Serialize just the underlying needles (search terms), and recreate the actual underlying search object during unpickling. This means more expensive unpickling, but also means pickles would usually be backwards- and forwards-compatible.

Either way there would be some code added to this repo as well.

BurntSushi · 2023-01-26T16:40:15Z

This would be pretty a small patch for them, so plausibly they'd be happy to do it.

Sadly this is not a small patch if we do it the way I want to do it. And as you point out, it's a compatibility hazard. (I do agree that a naive patch adding Serde support would probably be smallish in isolation.)

I do want some kind of serialization support some day, like what is in regex-automata for DFAs, but not with Serde. With Serde, you have to go through a costly (de)serialization roundtrip. That might work for your pickle use case, but what I want is for the actual raw underlying byte representation in memory to also be the serialization format. (Again, like DFAs are in regex-automata.) That way, deserialization is done in constant time by simply re-interpreting the bytes. This unlocks lots of cool use cases like, "I have a huge Aho-Corasick automaton that I want to generate offline and embed into my binary so I don't have to generate it at runtime." But... it is a lot of work and tends to introduce a lot of unsafe where there is exactly none today.

itamarst · 2023-01-26T18:18:02Z

twitches in big-endian Not that anyone runs on big-endian platforms these days? I guess you could use some zero-copy serialization format like capnproto or something to represent the data structures? Regardless, yeah that'd be a lot of work.

And I guess it's possible naive serde deserialization wouldn't even be any better than reparsing the needles, especially if the Python wrapper switches to the faster new NFA thing by default.

BurntSushi · 2023-01-26T18:28:50Z

Probably naive serde deserialization will be noticeably faster than actually building the automaton itself. That kind of deserialization would do a lot of bulk copying which is going to be much faster than the byte-by-byte construction process of Aho-Corasick (and then the breadth first search of the entire trie to fill in the failure transitions).

As far as big-endian goes, indeed, for regex-automata you have to build both little endian and big endian serialization files for each DFA. Assuming you're embedding them into your binary, only one gets compiled in. But yeah, that is a downside of the zerocopy approach.

I'd be surprised if something like capnproto could work for this, but I don't know of anyone who has tried it.

itamarst · 2023-01-26T18:30:55Z

Anyway we'll see what the use case is. Pickling is often used in e.g. multiprocessing support in Python, for example, and for that use case aggressively invalidating old pickles with versioning isn't a problem since it's always the same version of the library in parent and child process.

nijel · 2023-09-14T07:19:40Z

For me, the use case would be caching the automaton. Currently, we build it from the (cached) terms in each request where it is needed, but that takes considerable time and I think deserializing would be much faster. For compatibility, it would be enough if it would refuse to load incompatible serialized data, we can fall back to rebuilding it from the terms.

prorealize · 2023-10-01T12:57:21Z

Another possible way would be to share automaton between processes

Azzonith · 2024-06-06T10:50:45Z

+1 to this feature
Same use case, e.g. 100 items to process with automaton, ProcessPoolExecutor created with 10 processes, each process requires it's own automaton instance.
Creating a new automaton instance each time is rather inefficient.

nijel mentioned this issue Sep 14, 2023

Improve glossary lookup performance by caching automaton WeblateOrg/weblate#9949

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for pickle or some serialization/deserialization #41

Add support for pickle or some serialization/deserialization #41

pombredanne commented Jan 19, 2023

itamarst commented Jan 26, 2023 •

edited

Loading

BurntSushi commented Jan 26, 2023 •

edited

Loading

itamarst commented Jan 26, 2023 •

edited

Loading

BurntSushi commented Jan 26, 2023

itamarst commented Jan 26, 2023

nijel commented Sep 14, 2023

prorealize commented Oct 1, 2023

Azzonith commented Jun 6, 2024

Add support for pickle or some serialization/deserialization #41

Add support for pickle or some serialization/deserialization #41

Comments

pombredanne commented Jan 19, 2023

itamarst commented Jan 26, 2023 • edited Loading

BurntSushi commented Jan 26, 2023 • edited Loading

itamarst commented Jan 26, 2023 • edited Loading

BurntSushi commented Jan 26, 2023

itamarst commented Jan 26, 2023

nijel commented Sep 14, 2023

prorealize commented Oct 1, 2023

Azzonith commented Jun 6, 2024

itamarst commented Jan 26, 2023 •

edited

Loading

BurntSushi commented Jan 26, 2023 •

edited

Loading

itamarst commented Jan 26, 2023 •

edited

Loading