Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coarse-grained multiprocessing with bambi #400

Open
krassowski opened this issue Aug 24, 2021 · 6 comments
Open

Coarse-grained multiprocessing with bambi #400

krassowski opened this issue Aug 24, 2021 · 6 comments

Comments

@krassowski
Copy link
Contributor

krassowski commented Aug 24, 2021

I tried to use Python's built-in multiprocessing capabilities as I have a series of models to fit (thousands of biomolecules), but while I can pass all my arguments and receive results such as model.fit() output or arviz.summary(), the model itself is not serializable if it contains a formulae. This comes down the reference to current global namespace which of course cannot be pickled (or, it would be a very bad thing if it was pickled as it would lead to OOM errors quickly). Here is a reproducer:

from bambi import Model
from pandas import DataFrame
import numpy as np

x = np.linspace(0, 1, 200)

df = DataFrame(dict(
    x=x,
    y=1 + 2 * x + np.random.normal(scale=0.5, size=size)
))

m = Model(formula = 'y ~ x', data=df)

from pickle import dumps
dumps(m)

Resulting in:

TypeError: cannot pickle 'module' object

We can narrow this down to private attribute _design:

for k, v in vars(m).items():
    try:
        dumps(v)
    except:
        print(f'{k} is the culprit!')
        break

_design is the culprit!

Where type(m._design) is formulae.matrices.DesignMatrices.

We can further narrow it down to eval_enn._namespaces (using the same method as above). As a workaround, it is sufficient to remove the namespaces reference with:

m._design.eval_env._namespaces = None

It does not seem that this information is super important for use cases like this. Would it be a good idea to improve formulae to be pickle-able? Could the namespaces attribute be silently removed on pickle time and a warning emitted after un-pickling if user tries to access it? Or maybe bambi.Model should implement custom pickling handler in addition to the discussion in #259?

For search engine indexing, the error that you are likely to get when running within a Jupyter environment is:

PicklingError("Can't pickle <function <lambda> at 0xsomeaddress>: attribute lookup <lambda> on jupyter_client.session failed",
@krassowski
Copy link
Contributor Author

Or maybe the _namespaces should get filtered to only the relevant members - surely formula does not need to have access to full global namespace?

@krassowski
Copy link
Contributor Author

krassowski commented Aug 24, 2021

The workaround stated above is not sufficient if model contains priors, in which case there are other attributes that break pickling the bambi.Model:

  File "python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "python3.9/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function <lambda> at 0x7f26918604c0>:
    attribute lookup <lambda> on bambi.priors.link failed

Reproducer:

# df defined as above
m = Model(formula = 'y ~ x', data=df, priors={'x': 'superwide'})
m._design.eval_env._namespaces = None
dumps(m)

The culprit is the lambda link in m.response.family.link.link, which I guess comes from:

"identity": {
"link": lambda mu: mu,
"linkinv": lambda eta: eta
},
"inverse_squared": {
"link": lambda mu: 1 / mu ** 2,
"linkinv": lambda eta: 1 / np.sqrt(eta)
},

I believe it would be a good idea to replace these lambdas with proper functions, as this will not only benefit pickling, and multiprocessing but also give nicer traceback should anything go wrong.

Edit: and the workaround is:

m.response.family.link.link = None
m.response.family.link.linkinv = None

@tomicapretto
Copy link
Collaborator

Hi @krassowski, thanks for opening the issue and the very detailed review and recommendations.

Would it be a good idea to improve formulae to be pickle-able? Could the namespaces attribute be silently removed on pickle time and a warning emitted after un-pickling if user tries to access it? Or maybe bambi.Model should implement custom pickling handler in addition to the discussion in #259?

It would be a good idea, definitely. I'm still not sure what's the best alternative to make it work. Formulaic (a library similar to fomulae) allows pickling, but I'm not still sure how it does it. I suspect it only stores names of functions and not the functions themselves. Patsy does not allow it but has interesting discussions about the topic here

Or maybe the _namespaces should get filtered to only the relevant members - surely formula does not need to have access to full global namespace?

Yes, makes sense. The namespace is used to have access to functions that are used in the model formula. But as you say, there's no need to keep other members that are not relevant. Also, there's no need to have a namespace if no external function is needed.

I believe it would be a good idea to replace these lambdas with proper functions, as this will not only benefit pickling, and multiprocessing but also give nicer traceback should anything go wrong.

This makes a lot of sense, and it is indeed one of the most accessible things to do. I was the one who wrote those lambda functions, sorry for the inconvenience!


In addition, what worries me about pickling is that I don't have an answer to the problem of generating a new design matrix from an existing model specification, which is what happens when you use model.predict() on a new dataset in Bambi. How should it behave when you try to load a model whose formula points to a function? For example, the following model:

import numpy as np
model = bmb.Model("y ~ np.log(x1) + x2", data)

In a new session, you could load numpy before unpickling and it should work. But what if the name of the function is func, and then you do have a function called func when you unpickle the model, but it is the wrong func (one that does another thing). What should happen in that case? Is it right to leave that responsibility to the user? Can we do something to make it a better experience?

On the practical side, I guess I think that if you use some functions in a model, and you want to save that model for later use, you should be aware of those functions that were used when first defining the model and it's up to you they're available when you load the model again.

To sum up: this is something I would really like to have, but I'm not still sure what is the best approach. And thank you for pointing us to this problem!

@krassowski
Copy link
Contributor Author

Thank you for a very quick reply! Just a minor clarification on this point:

In a new session, you could load numpy before unpickling and it should work.

Pickle can handle functions which are defined in modules that are stored on disk in a permanent location which is in PYTHONPATH, there is no need to actually import those. You can verify this quickly with:

import numpy as np
obj = {'func': np.log}
import pickle
pickle.dumps(obj)

b'\x80\x04\x952\x00\x00\x00\x00\x00\x00\x00}\x94\x8c\x04func\x94\x8c\x1cnumpy.core._multiarray_umath\x94\x8c\x03log\x94\x93\x94s.

And then in a new session:

obj = pickle.loads(b'\x80\x04\x952\x00\x00\x00\x00\x00\x00\x00}\x94\x8c\x04func\x94\x8c\x1cnumpy.core._multiarray_umath\x94\x8c\x03log\x94\x93\x94s.')
obj['func']

<ufunc 'log'>

The problematic cases are:

  • the user upgrades the package (e.g. numpy) to a version which is no longer compatible (which is not relevant for multiprocessing use case), but this is a known caveat of pickling (I would never trust to store a final model in a pickle for deployment)
  • the user refactors the structure of their scripts
  • references to lambda functions, modules and other things that cannot be pickled
  • user-defined functions in some Jupyter kernels (only problematic for re-use outside of session, not a problem for multiprocessing use case)

But what if the name of the function is func, and then you do have a function called func when you unpickle the model, but it is the wrong func (one that does another thing).

Good point (but again, not really as relevant in context of multiprocessing). If the signatures of the functions match then it would be a problem indeed.

One solution would be to implement custom pickle handlers via dunder methods __getstate__ and __setstate__; __getstate__ could add a dictionary to the namespace under a special key, say __bambi_private_do_not_touch which would contain a map between function name and an md5 hashsums of its full source code at the moment of writing if known. If no source code is available you could use docstring, or just return None. The __setstate__ method would then check if the functions available after unpickling indeed match what was stored, or else emit a warning to the user (the warning could be also emitted if some functions could not be verified due to lack of the source code). I wonder if dill or any other package provides something like this for free already.

@tomicapretto
Copy link
Collaborator

Pickle can handle functions which are defined in modules that are stored on disk in a permanent location which is in PYTHONPATH, there is no need to actually import those.

Thanks! I wasn't aware of that, I'm not very familiar with pickle.

And thank you for all the good ideas about this problem. Unfortunately, I don't have much time right now to start working on this, but I would try to work on it in the future.

@tomicapretto
Copy link
Collaborator

tomicapretto commented Apr 10, 2022

@krassowski I know it's been a while since you opened this issue, but I think I could have some time to work on this in the coming week.

Do you still think this solution

One solution would be to implement custom pickle handlers via dunder methods getstate and setstate; getstate could add a dictionary to the namespace under a special key, say __bambi_private_do_not_touch which would contain a map between function name and an md5 hashsums of its full source code at the moment of writing if known. If no source code is available you could use docstring, or just return None. The setstate method would then check if the functions available after unpickling indeed match what was stored, or else emit a warning to the user (the warning could be also emitted if some functions could not be verified due to lack of the source code). I wonder if dill or any other package provides something like this for free already.

is a good alternative?

To me, it makes sense. But as I've said before I'm not very the details of pickling.

Another alternative is to keep only the names of the functions that are used within formulae and it's the user's resposibility to make sure these functions match the functions that were used originally. I suspect this may be easier to implement too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants