-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: make it possible to pass a seed parameter to ibis.random
#8054
Comments
See #7139, I asked for this same thing there originally, and then we restricted the scope to Table.sample(). Are there other use cases besides sampling that you need this for? |
The use cases presented in the description cannot be implemented with
Right now I implement it as: import ibis
test_frac = 0.25
table = ibis.memtable({"a": range(30)})
t = table.mutate(_tmp_random=ibis.random())
train_split = t.filter(t._tmp_random > test_frac).drop(t._tmp_random)
test_split = t.filter(t._tmp_random <= test_frac).drop(t._tmp_random) The rows of Ideally I would like to pass a user settable seed to Note that the extra |
I think you can do t = t.mutate(_id=ibis.row_number())
test = t.sample(fraction=fraction)
train = t[~t._id.isin(rest._id)]
test = test.drop("_id")
train = train.drop("_id") ? |
@NickCrews your suggestion found a bug in the SQL compiler: I reported it as #8058 with some details. Assuming it would work, do you think databases such as DuckDB would be smart enough to run such a query as efficiently as my original solution with |
No idea re speed, I think benchmarks sound like a great idea! Perhaps once you play around there you will find a way that works with Table.sql() and then you will get the perf you need? I do agree of course that all these workarounds are ugly, and that an ibis-native way would be the most ideal in terms of beauty. |
Now that #8058 has been fixed in >>> import ibis
... import numpy as np
...
... test_frac = 0.25
... table = ibis.memtable({"a": np.random.default_rng(0).uniform(size=int(1e8))})
... t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0)
... t = t.mutate(is_test=t._id.isin(test._id))
... train = t.filter(~t.is_test)
... %time train.a.mean().execute(), test.a.mean().execute()
100% ▕████████████████████████████████████████████████████████████▏
CPU times: user 15 s, sys: 1.81 s, total: 16.8 s
Wall time: 7.71 s
(0.49996058048305153, 0.5000561795455769)
>>> t = table.mutate(_tmp_random=ibis.random())
... train = t.filter(t._tmp_random > test_frac).drop(t._tmp_random)
... test = t.filter(t._tmp_random <= test_frac).drop(t._tmp_random)
...
... %time train.a.mean().execute(), test.a.mean().execute()
CPU times: user 2.15 s, sys: 32.3 ms, total: 2.18 s
Wall time: 315 ms
(0.499970109856995, 0.49995665849150145) I tried to change the number of rows, and the variant based on |
And I have not tried to profile memory usage, but I am pretty sure that the |
Note that if I cache the shared intermediate result, I can make the sample/isin variant only 3x slower than the random comparison alternative. >>> t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0)
... t = t.mutate(is_test=t._id.isin(test._id)).cache() # <-------- changed line
... train = t.filter(~t.is_test)
... %time train.a.mean().execute(), test.a.mean().execute()
CPU times: user 1.15 s, sys: 718 ms, total: 1.87 s
Wall time: 911 ms
(0.49996058048306424, 0.5000561795455769) However, the memory (or disk IO) usage is likely to be even larger because of the cache (I assume). Also note: I don't really understand why caching the isin statement is that useful. I also tried to cache only the common ancestor to the two steps and it's not as good: >>> t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0).cache() # <--------- changed line
... t = t.mutate(is_test=t._id.isin(test._id))
... train = t.filter(~t.is_test)
... %time train.a.mean().execute(), test.a.mean().execute()
100% ▕████████████████████████████████████████████████████████████▏
CPU times: user 6.69 s, sys: 1.74 s, total: 8.43 s
Wall time: 5.62 s
(0.49996058048305153, 0.5000561795454677) |
Can you time the entire block of code? How long does the |
Indeed, you are right, I made a mistake and caching appears to be useless for this code: >>> import ibis
... import numpy as np
...
... test_frac = 0.25
... table = ibis.memtable({"a": np.random.default_rng(0).uniform(size=int(1e8))})
>>> %%time
... t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0)
... t = t.mutate(is_test=t._id.isin(test._id))
... train = t.filter(~t.is_test)
... train.a.mean().execute(), test.a.mean().execute()
...
...
100% ▕████████████████████████████████████████████████████████████▏
CPU times: user 9.36 s, sys: 1.92 s, total: 11.3 s
Wall time: 8.06 s
(0.49996058048305153, 0.5000561795455769)
>>> %%time
... t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0).cache()
... t = t.mutate(is_test=t._id.isin(test._id))
... train = t.filter(~t.is_test)
... train.a.mean().execute(), test.a.mean().execute()
...
...
100% ▕████████████████████████████████████████████████████████████▏
CPU times: user 15.5 s, sys: 2.12 s, total: 17.6 s
Wall time: 8.45 s
(0.49996058048305153, 0.5000561795454715)
>>> %%time
... t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0)
... t = t.mutate(is_test=t._id.isin(test._id))
... train = t.filter(~t.is_test).cache()
... train.a.mean().execute(), test.a.mean().execute()
...
...
100% ▕████████████████████████████████████████████████████████████▏
CPU times: user 10.6 s, sys: 2.58 s, total: 13.2 s
Wall time: 9.85 s
(0.4999605804830627, 0.5000561795455769) |
😅 It might not be entirely useless. If a particular expression ends up being slow to recompute and you're changing dependent expressions a lot then caching parts of the expression might make exploration (changing those dependent expressions) a bit snappier to iterate on. |
Thanks for the feedback. I opened a discussion for the specific case of evaluating sub expressions with a shared ancestry efficiently here: #8277 and let's keep this issue focused on the original problem of seeding |
As described in this duckdb blog post, one possible userland-based solution of "psuedorandom but reproducible shuffle" would be
|
Is your feature request related to a problem?
It would be neat to get repeatable queries that use
ibis.random
, for instance when randomly reordering results (t.order_by(ibis.random())
):n: int
arg toTable.sample()
#7689 has a particular example for instance;Describe the solution you'd like
Make it possible to pass an integer seed to
ibis.random
as is possible foribis.expr.types.relations.Table.sample
.What version of ibis are you running?
7.2.0
What backend(s) are you using, if any?
DuckDB
Code of Conduct
The text was updated successfully, but these errors were encountered: