feat: make it possible to pass a seed parameter to `ibis.random` #8054

ogrisel · 2024-01-21T17:58:56Z

Is your feature request related to a problem?

It would be neat to get repeatable queries that use ibis.random, for instance when randomly reordering results (t.order_by(ibis.random())):

feat: UX: add n: int arg to Table.sample() #7689 has a particular example for instance;
doing a repeatable pseudo-random train/test split for machine learning would be another one.

Describe the solution you'd like

Make it possible to pass an integer seed to ibis.random as is possible for ibis.expr.types.relations.Table.sample.

What version of ibis are you running?

7.2.0

What backend(s) are you using, if any?

DuckDB

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

NickCrews · 2024-01-21T21:34:01Z

See #7139, I asked for this same thing there originally, and then we restricted the scope to Table.sample(). Are there other use cases besides sampling that you need this for?

ogrisel · 2024-01-21T22:22:55Z

Are there other use cases besides sampling that you need this for?

The use cases presented in the description cannot be implemented with table.sample(frac, seed=seed) as far as I know:

doing a repeatable pseudo-random train/test split for machine learning would be another one.

Right now I implement it as:

import ibis

test_frac = 0.25
table = ibis.memtable({"a": range(30)})
t = table.mutate(_tmp_random=ibis.random())
train_split = t.filter(t._tmp_random > test_frac).drop(t._tmp_random)
test_split = t.filter(t._tmp_random <= test_frac).drop(t._tmp_random)

The rows of train_split and test_split should be mutually exclusive.

Ideally I would like to pass a user settable seed to ibis.random() to make this repeatable.

Note that the extra .mutate / .drop of the temporary _tmp_random column is needed because of #8055.

NickCrews · 2024-01-22T00:35:34Z

I think you can do

t = t.mutate(_id=ibis.row_number())
test = t.sample(fraction=fraction)
train = t[~t._id.isin(rest._id)]
test = test.drop("_id")
train = train.drop("_id")

?

ogrisel · 2024-01-22T14:05:27Z

@NickCrews your suggestion found a bug in the SQL compiler: I reported it as #8058 with some details.

Assuming it would work, do you think databases such as DuckDB would be smart enough to run such a query as efficiently as my original solution with ibis.random? I guess the best way to know is to make it work and benchmark it.

NickCrews · 2024-01-22T16:57:39Z

No idea re speed, I think benchmarks sound like a great idea! Perhaps once you play around there you will find a way that works with Table.sql() and then you will get the perf you need?

I do agree of course that all these workarounds are ugly, and that an ibis-native way would be the most ideal in terms of beauty.

ogrisel · 2024-02-08T09:10:03Z

Now that #8058 has been fixed in the-epic-split branch, I ran the following quick benchmark on ipython:

>>> import ibis
... import numpy as np
... 
... test_frac = 0.25
... table = ibis.memtable({"a": np.random.default_rng(0).uniform(size=int(1e8))})
... t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0)
... t = t.mutate(is_test=t._id.isin(test._id))
... train = t.filter(~t.is_test)
... %time train.a.mean().execute(), test.a.mean().execute()
100% ▕████████████████████████████████████████████████████████████▏ 
CPU times: user 15 s, sys: 1.81 s, total: 16.8 s
Wall time: 7.71 s
(0.49996058048305153, 0.5000561795455769)
>>> t = table.mutate(_tmp_random=ibis.random())
... train = t.filter(t._tmp_random > test_frac).drop(t._tmp_random)
... test = t.filter(t._tmp_random <= test_frac).drop(t._tmp_random)
... 
... %time train.a.mean().execute(), test.a.mean().execute()
CPU times: user 2.15 s, sys: 32.3 ms, total: 2.18 s
Wall time: 315 ms
(0.499970109856995, 0.49995665849150145)

I tried to change the number of rows, and the variant based on ibis.random + mutually exclusive comparison operators is always significantly faster (10x or more) than the variant based on the .sample / .isin combo.

ogrisel · 2024-02-08T09:12:25Z

And I have not tried to profile memory usage, but I am pretty sure that the ibis.random + mutually exclusive comparison operators variant is also more space efficient.

ogrisel · 2024-02-08T09:25:18Z

Note that if I cache the shared intermediate result, I can make the sample/isin variant only 3x slower than the random comparison alternative.

>>> t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0)
... t = t.mutate(is_test=t._id.isin(test._id)).cache()  # <-------- changed line
... train = t.filter(~t.is_test)
... %time train.a.mean().execute(), test.a.mean().execute()
CPU times: user 1.15 s, sys: 718 ms, total: 1.87 s
Wall time: 911 ms
(0.49996058048306424, 0.5000561795455769)

However, the memory (or disk IO) usage is likely to be even larger because of the cache (I assume).

Also note: I don't really understand why caching the isin statement is that useful. I also tried to cache only the common ancestor to the two steps and it's not as good:

>>> t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0).cache() # <--------- changed line
... t = t.mutate(is_test=t._id.isin(test._id))
... train = t.filter(~t.is_test)
... %time train.a.mean().execute(), test.a.mean().execute()
100% ▕████████████████████████████████████████████████████████████▏ 
CPU times: user 6.69 s, sys: 1.74 s, total: 8.43 s
Wall time: 5.62 s
(0.49996058048305153, 0.5000561795454677)

cpcloud · 2024-02-08T14:27:11Z

Can you time the entire block of code?

How long does the cache call take? If it takes around 4-5 seconds then the timings you're showing would be consistent with that.

ogrisel · 2024-02-09T14:41:59Z

Indeed, you are right, I made a mistake and caching appears to be useless for this code:

>>> import ibis
... import numpy as np
... 
... test_frac = 0.25
... table = ibis.memtable({"a": np.random.default_rng(0).uniform(size=int(1e8))})

>>> %%time
... t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0)
... t = t.mutate(is_test=t._id.isin(test._id))
... train = t.filter(~t.is_test)
... train.a.mean().execute(), test.a.mean().execute()
... 
... 
100% ▕████████████████████████████████████████████████████████████▏ 
CPU times: user 9.36 s, sys: 1.92 s, total: 11.3 s
Wall time: 8.06 s
(0.49996058048305153, 0.5000561795455769)

>>> %%time
... t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0).cache()
... t = t.mutate(is_test=t._id.isin(test._id))
... train = t.filter(~t.is_test)
... train.a.mean().execute(), test.a.mean().execute()
... 
... 
100% ▕████████████████████████████████████████████████████████████▏ 
CPU times: user 15.5 s, sys: 2.12 s, total: 17.6 s
Wall time: 8.45 s
(0.49996058048305153, 0.5000561795454715)

>>> %%time
... t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0)
... t = t.mutate(is_test=t._id.isin(test._id))
... train = t.filter(~t.is_test).cache()
... train.a.mean().execute(), test.a.mean().execute()
... 
... 
100% ▕████████████████████████████████████████████████████████████▏ 
CPU times: user 10.6 s, sys: 2.58 s, total: 13.2 s
Wall time: 9.85 s
(0.4999605804830627, 0.5000561795455769)

cpcloud · 2024-02-09T14:46:49Z

😅 It might not be entirely useless.

If a particular expression ends up being slow to recompute and you're changing dependent expressions a lot then caching parts of the expression might make exploration (changing those dependent expressions) a bit snappier to iterate on.

ogrisel · 2024-02-09T18:20:07Z

Thanks for the feedback. I opened a discussion for the specific case of evaluating sub expressions with a shared ancestry efficiently here: #8277

and let's keep this issue focused on the original problem of seeding ibis.random.

NickCrews · 2025-01-20T16:38:10Z

As described in this duckdb blog post, one possible userland-based solution of "psuedorandom but reproducible shuffle" would be a_physical_table.order_by((a_physical_table.rowid() + 42).hash()), where 42 is the seed of choice. This has the limitations though of

it requires a physical table
if you inserted or removed some rows, then most of the rowids would stay fixed (I think), and thus most of the ordering would stay the same

ogrisel added the feature Features or general enhancements label Jan 21, 2024

github-project-automation bot added this to Ibis planning and roadmap Jan 21, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Jan 21, 2024

ogrisel mentioned this issue Jan 21, 2024

bug: surprising semantics for ibis.random() expressions #8055

Closed

1 task

ogrisel mentioned this issue Jan 22, 2024

bug: compilation failure in table expression mixing ibis.row_number(), .isin() and .sample() #8058

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make it possible to pass a seed parameter to `ibis.random` #8054

feat: make it possible to pass a seed parameter to `ibis.random` #8054

ogrisel commented Jan 21, 2024 •

edited

Loading

NickCrews commented Jan 21, 2024 •

edited

Loading

ogrisel commented Jan 21, 2024 •

edited

Loading

NickCrews commented Jan 22, 2024

ogrisel commented Jan 22, 2024 •

edited

Loading

NickCrews commented Jan 22, 2024

ogrisel commented Feb 8, 2024 •

edited

Loading

ogrisel commented Feb 8, 2024

ogrisel commented Feb 8, 2024 •

edited

Loading

cpcloud commented Feb 8, 2024

ogrisel commented Feb 9, 2024

cpcloud commented Feb 9, 2024

ogrisel commented Feb 9, 2024

NickCrews commented Jan 20, 2025

feat: make it possible to pass a seed parameter to ibis.random #8054

feat: make it possible to pass a seed parameter to ibis.random #8054

Comments

ogrisel commented Jan 21, 2024 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

NickCrews commented Jan 21, 2024 • edited Loading

ogrisel commented Jan 21, 2024 • edited Loading

NickCrews commented Jan 22, 2024

ogrisel commented Jan 22, 2024 • edited Loading

NickCrews commented Jan 22, 2024

ogrisel commented Feb 8, 2024 • edited Loading

ogrisel commented Feb 8, 2024

ogrisel commented Feb 8, 2024 • edited Loading

cpcloud commented Feb 8, 2024

ogrisel commented Feb 9, 2024

cpcloud commented Feb 9, 2024

ogrisel commented Feb 9, 2024

NickCrews commented Jan 20, 2025

feat: make it possible to pass a seed parameter to `ibis.random` #8054

feat: make it possible to pass a seed parameter to `ibis.random` #8054

ogrisel commented Jan 21, 2024 •

edited

Loading

NickCrews commented Jan 21, 2024 •

edited

Loading

ogrisel commented Jan 21, 2024 •

edited

Loading

ogrisel commented Jan 22, 2024 •

edited

Loading

ogrisel commented Feb 8, 2024 •

edited

Loading

ogrisel commented Feb 8, 2024 •

edited

Loading