bug: cumsum function does not work as expected in Bigquery backend due to ibis default window frame logic #10699

maxshine · 2025-01-21T23:21:05Z

What happened?

I would like to calculate the cumulative sum of a target column by rows by using ibis. Here is a sample code snippet used:

import ibis
import ibis.backends.bigquery as bigquery

con = ibis.backends.bigquery.connect(project_id="pers-decision-engine-dev", dataset_id="pde_food_sow_20250121_dktlu3ov")
tbl = con.table("pers-decision-engine-dev.pde_food_sow_20250121_dktlu3ov.ranking_stg_chosen_exploit_offers_grow")
tbl_muted = tbl.mutate(running_cost=tbl.cost.cumsum(order_by="alpha"))
print(ibis.to_sql(tbl_muted))

The generated SQL is

SELECT
  `t0`.`cust_id`,
  `t0`.`campaign_id`,
  `t0`.`cost`,
  `t0`.`model_predicted_cost`,
  `t0`.`calibration_factor`,
  `t0`.`rewards_estimate`,
  `t0`.`static_rand`,
  `t0`.`roi`,
  `t0`.`alpha`,
  SUM(`t0`.`cost`) OVER (ORDER BY `t0`.`alpha` ASC) AS `running_cost` -- problematic window function to work out cumulative sum by rows
FROM `pers-decision-engine-dev`.`pde_food_sow_20250121_dktlu3ov`.`ranking_stg_chosen_exploit_offers_grow` AS `t0`

According to the paper searched in google here the Bigquery logic with default window frame is :

I did a experiment to prove this Bigquery behavior with different window frames by the calculation of the cumulative sum by rows:

From above result, It can tell that only the rows_window_frame_running_rewards with explict rows window frame giving out the correct running sum result.

Therefore, an explicit rows window frame must be respected by ibis instead of dropping the window spec and using the default window frame when the range is BETWEEN UNBOUND PRECEDING AND CURRENT ROW.

The Bigquery default window frame behavior shouldn't be used by ibis to implement the cumusum function and this logic further impacts the cumulative_window, rows_window and et al.

What version of ibis are you using?

As limited testing, the problem exists:

9.2.0
9.5.0

What backend(s) are you using, if any?

big query

Relevant log output

(venv) ➜  ibis-debug pip list | grep ibis
ibis-framework                9.5.0
(venv) ➜  ibis-debug python -V
Python 3.10.16
(venv) ➜  ibis-debug

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

maxshine · 2025-01-21T23:21:33Z

/take

cpcloud · 2025-01-24T10:43:07Z

Do you have a simple reproducible failure case where Ibis is producing incorrect results, that doesn't depend on anything in your bigquery project? For example, using ibis.memtable?

The following code behaves correctly for me on main and the assert passes:

t = ibis.memtable({"ranking": [1, 2, 3, 4], "rewards": [10, 20, 30, 40]})

expr = t.rewards.cumsum(order_by="ranking")
result = con.to_pyarrow(expr)

expected = [10, 30, 60, 100]
assert result.to_pylist() == expected

maxshine · 2025-01-24T11:48:17Z

Do you have a simple reproducible failure case where Ibis is producing incorrect results, that doesn't depend on anything in your bigquery project? For example, using ibis.memtable?

The following code behaves correctly for me on main and the assert passes:

t = ibis.memtable({"ranking": [1, 2, 3, 4], "rewards": [10, 20, 30, 40]})

expr = t.rewards.cumsum(order_by="ranking")
result = con.to_pyarrow(expr)

expected = [10, 30, 60, 100]
assert result.to_pylist() == expected

Hi @cpcloud

As described in this issue, this issue is specific to Bigquery backend. And to reproduce the problem, I would like to modify the data bit as

And my testing code snippet is:

con = ibis.backends.bigquery.connect(project_id="pers-decision-engine-dev", dataset_id="ygao")
tbl = con.table("pers-decision-engine-dev.ygao.ibis_cumsum_debug_table")
tbl_muted = tbl.mutate(running_cost=tbl.rewards.cumsum(order_by="ranking"))
ret = tbl_muted.execute()
print(ret)

The last statement printed the result:

   ranking  rewards  running_cost
0        1       10            10
1        2       20            30
2        3       30           100
3        3       40           100

To conclude, the last two records are having incorrect cumsum result because of Bigquery compiler threw away the window frame created by cumsum function, which I tried to fix via the PR #10700

I am using the corporate resources to debug this issue as it occurred in our company projects, so can't create a public accessible table for you to try. Sorry about that.

maxshine · 2025-01-24T11:55:19Z

The root cause is when ibis ignore the BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW window frame, Bigquery platform will handle the ibis generated statements by falling back to its default behavior based on whether there is a order_by in the window function. Unfortunately, this is not correct behavior to calculate cumulative sum aggregation.

Furthermore, I don't think it is correct for Bigquery backend implementation to simply drop any user defined window frames when they are defined as BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. I don't see it reasonable to do so . Moreover because this dropping happens within the fundamental compiling part of Bigquery backend, its impacts are widely spread onto other functions, e.g. rows_window and et al.

cpcloud · 2025-01-24T12:26:17Z

so can't create a public accessible table for you to try. Sorry about that.

You don't need to do that. Just provide a dataframe with literal values that reproduces the problem, right here on GitHub and we can work from there.

maxshine · 2025-01-24T12:44:03Z

so can't create a public accessible table for you to try. Sorry about that.

You don't need to do that. Just provide a dataframe with literal values that reproduces the problem, right here on GitHub and we can work from there.

Github does not support CSV so I packaged the exported CSV file in a zip file. Hope this would be helpful

ibis_cumsum_debug_table.csv.zip

maxshine · 2025-01-28T01:02:20Z

so can't create a public accessible table for you to try. Sorry about that.

You don't need to do that. Just provide a dataframe with literal values that reproduces the problem, right here on GitHub and we can work from there.

Github does not support CSV so I packaged the exported CSV file in a zip file. Hope this would be helpful

ibis_cumsum_debug_table.csv.zip

Hi @cpcloud

Would you have got a chance to reproduce this issue with my provided data? Keen to hear back from you. :)

cpcloud · 2025-01-28T12:35:13Z

Yes, I can reproduce this.

Again, no need to upload files, you can copy paste runnable Ibis code using ibis.memtable directly to a GitHub comment on this issue.

For example,

In [2]: from ibis.interactive import *
   ...: con = ibis.bigquery.connect()
   ...: ibis.set_backend(con)
   ...:
   ...: t = ibis.memtable({"ranking": [1, 2, 3, 3], "rewards": [10, 20, 30, 40]})
   ...:
   ...: expr = t.rewards.cumsum(order_by="ranking")
   ...: result = expr.to_pyarrow()
   ...:
   ...: expected = [10, 30, 60, 100]
   ...: assert result.to_pylist() == expected

cpcloud · 2025-01-28T12:37:30Z

The important part that wasn't really mentioned here is the duplicate threes. RANGE (default BigQuery behavior when unspecified, which becomes the case because we override _minimize_spec in the BigQuery compiler) and ROWS (what we generate in the base compiler class by default) behave differently with duplicate values in the window. ROWS treats each row as a unique value and RANGE only considers unique differences (consecutive or not depends on partitioning).

maxshine · 2025-01-28T13:02:35Z

The important part that wasn't really mentioned here is the duplicate threes. RANGE (default BigQuery behavior when unspecified, which becomes the case because we override _minimize_spec in the BigQuery compiler) and ROWS (what we generate in the base compiler class by default) behave differently with duplicate values in the window. ROWS treats each row as a unique value and RANGE only considers unique differences (consecutive or not depends on partitioning).

Thanks @cpcloud confirming this is a valid issue.

I created a simple PR trying to fix it in Bigquery specifically #10700, where I considered that there would be no need for the BQ overrided _minimize_spec logic dropping the window frame in the case BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

If you don't mind, can you please review it again if you are happy with that fix.

It was mentioned some BQ test cases got broken by that pull request, but I checked the CI results that Bigquery cases were all good then. Can you please kindly advise what I can do to perfect #10700

Thank you.

cpcloud · 2025-01-28T13:03:51Z

I'm going to put up a PR that fixes the issue. The BigQuery tests do not run on PRs, only on a merge to main, because they require credentials.

maxshine · 2025-01-28T13:04:01Z

Yes, I can reproduce this.

Again, no need to upload files, you can copy paste runnable Ibis code using ibis.memtable directly to a GitHub comment on this issue.

For example,
In [2]: from ibis.interactive import *
   ...: con = ibis.bigquery.connect()
   ...: ibis.set_backend(con)
   ...:
   ...: t = ibis.memtable({"ranking": [1, 2, 3, 3], "rewards": [10, 20, 30, 40]})
   ...:
   ...: expr = t.rewards.cumsum(order_by="ranking")
   ...: result = expr.to_pyarrow()
   ...:
   ...: expected = [10, 30, 60, 100]
   ...: assert result.to_pylist() == expected

Thank you again for educating how to communicate by sharing the testing data. Good to learn that. 👍

maxshine added the bug Incorrect behavior inside of ibis label Jan 21, 2025

github-project-automation bot added this to Ibis planning and roadmap Jan 21, 2025

github-project-automation bot moved this to backlog in Ibis planning and roadmap Jan 21, 2025

github-actions bot assigned maxshine Jan 21, 2025

ygao-wiq marked this as a duplicate of #10698 Jan 21, 2025

maxshine mentioned this issue Jan 21, 2025

fix(bigquery): let bigquery backend respect window frame set by users #10700

Closed

cpcloud added this to the 10.0 milestone Jan 28, 2025

cpcloud added the bigquery The BigQuery backend label Jan 28, 2025

cpcloud mentioned this issue Jan 28, 2025

fix(backends): ensure that analytic functions do not receive a window frame #10739

Merged

cpcloud closed this as completed in #10739 Jan 29, 2025

github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: cumsum function does not work as expected in Bigquery backend due to ibis default window frame logic #10699

bug: cumsum function does not work as expected in Bigquery backend due to ibis default window frame logic #10699

maxshine commented Jan 21, 2025

maxshine commented Jan 21, 2025

cpcloud commented Jan 24, 2025

maxshine commented Jan 24, 2025 •

edited

Loading

maxshine commented Jan 24, 2025 •

edited

Loading

cpcloud commented Jan 24, 2025

maxshine commented Jan 24, 2025

maxshine commented Jan 28, 2025

cpcloud commented Jan 28, 2025

cpcloud commented Jan 28, 2025

maxshine commented Jan 28, 2025

cpcloud commented Jan 28, 2025

maxshine commented Jan 28, 2025

bug: cumsum function does not work as expected in Bigquery backend due to ibis default window frame logic #10699

bug: cumsum function does not work as expected in Bigquery backend due to ibis default window frame logic #10699

Comments

maxshine commented Jan 21, 2025

What happened?

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

maxshine commented Jan 21, 2025

cpcloud commented Jan 24, 2025

maxshine commented Jan 24, 2025 • edited Loading

maxshine commented Jan 24, 2025 • edited Loading

cpcloud commented Jan 24, 2025

maxshine commented Jan 24, 2025

maxshine commented Jan 28, 2025

cpcloud commented Jan 28, 2025

cpcloud commented Jan 28, 2025

maxshine commented Jan 28, 2025

cpcloud commented Jan 28, 2025

maxshine commented Jan 28, 2025

maxshine commented Jan 24, 2025 •

edited

Loading

maxshine commented Jan 24, 2025 •

edited

Loading