-
-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coarse-grained multiprocessing with bambi #400
Comments
Or maybe the |
The workaround stated above is not sufficient if model contains priors, in which case there are other attributes that break pickling the File "python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function <lambda> at 0x7f26918604c0>:
attribute lookup <lambda> on bambi.priors.link failed Reproducer: # df defined as above
m = Model(formula = 'y ~ x', data=df, priors={'x': 'superwide'})
m._design.eval_env._namespaces = None
dumps(m) The culprit is the lambda link in Lines 68 to 75 in feab259
I believe it would be a good idea to replace these lambdas with proper functions, as this will not only benefit pickling, and multiprocessing but also give nicer traceback should anything go wrong. Edit: and the workaround is:
|
Hi @krassowski, thanks for opening the issue and the very detailed review and recommendations.
It would be a good idea, definitely. I'm still not sure what's the best alternative to make it work. Formulaic (a library similar to fomulae) allows pickling, but I'm not still sure how it does it. I suspect it only stores names of functions and not the functions themselves. Patsy does not allow it but has interesting discussions about the topic here
Yes, makes sense. The namespace is used to have access to functions that are used in the model formula. But as you say, there's no need to keep other members that are not relevant. Also, there's no need to have a namespace if no external function is needed.
This makes a lot of sense, and it is indeed one of the most accessible things to do. I was the one who wrote those lambda functions, sorry for the inconvenience! In addition, what worries me about pickling is that I don't have an answer to the problem of generating a new design matrix from an existing model specification, which is what happens when you use import numpy as np
model = bmb.Model("y ~ np.log(x1) + x2", data) In a new session, you could load On the practical side, I guess I think that if you use some functions in a model, and you want to save that model for later use, you should be aware of those functions that were used when first defining the model and it's up to you they're available when you load the model again. To sum up: this is something I would really like to have, but I'm not still sure what is the best approach. And thank you for pointing us to this problem! |
Thank you for a very quick reply! Just a minor clarification on this point:
Pickle can handle functions which are defined in modules that are stored on disk in a permanent location which is in PYTHONPATH, there is no need to actually import those. You can verify this quickly with: import numpy as np
obj = {'func': np.log}
import pickle
pickle.dumps(obj)
And then in a new session: obj = pickle.loads(b'\x80\x04\x952\x00\x00\x00\x00\x00\x00\x00}\x94\x8c\x04func\x94\x8c\x1cnumpy.core._multiarray_umath\x94\x8c\x03log\x94\x93\x94s.')
obj['func']
The problematic cases are:
Good point (but again, not really as relevant in context of multiprocessing). If the signatures of the functions match then it would be a problem indeed. One solution would be to implement custom pickle handlers via dunder methods |
Thanks! I wasn't aware of that, I'm not very familiar with pickle. And thank you for all the good ideas about this problem. Unfortunately, I don't have much time right now to start working on this, but I would try to work on it in the future. |
@krassowski I know it's been a while since you opened this issue, but I think I could have some time to work on this in the coming week. Do you still think this solution
is a good alternative? To me, it makes sense. But as I've said before I'm not very the details of pickling. Another alternative is to keep only the names of the functions that are used within formulae and it's the user's resposibility to make sure these functions match the functions that were used originally. I suspect this may be easier to implement too. |
I tried to use Python's built-in multiprocessing capabilities as I have a series of models to fit (thousands of biomolecules), but while I can pass all my arguments and receive results such as
model.fit()
output orarviz.summary()
, the model itself is not serializable if it contains a formulae. This comes down the reference to current global namespace which of course cannot be pickled (or, it would be a very bad thing if it was pickled as it would lead to OOM errors quickly). Here is a reproducer:Resulting in:
We can narrow this down to private attribute
_design
:Where
type(m._design) is formulae.matrices.DesignMatrices
.We can further narrow it down to
eval_enn._namespaces
(using the same method as above). As a workaround, it is sufficient to remove the namespaces reference with:It does not seem that this information is super important for use cases like this. Would it be a good idea to improve formulae to be pickle-able? Could the
namespaces
attribute be silently removed on pickle time and a warning emitted after un-pickling if user tries to access it? Or maybebambi.Model
should implement custom pickling handler in addition to the discussion in #259?For search engine indexing, the error that you are likely to get when running within a Jupyter environment is:
The text was updated successfully, but these errors were encountered: