Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: dt.highest_precedence([dt.uint64, dt.int32]) gives uint64, not int64 #7331

Closed
1 task done
Tracked by #9638 ...
NickCrews opened this issue Oct 11, 2023 · 6 comments
Closed
1 task done
Tracked by #9638 ...
Labels
bug Incorrect behavior inside of ibis

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Oct 11, 2023

What happened?

Came from exploring around exposing highest_precedence() publicly per #5707

from ibis.expr import datatypes as dt

dt.highest_precedence([dt.int32, dt.uint64])

EDIT: this is a bit wrong, see cpclouds comment below.
gives uint64. I would expect it to be int64, because if you had a negative number in the int32 column, I would want that to still be negative in the unioned dtype. Maybe I'm not understanding this correctly though.

What version of ibis are you using?

main

What backend(s) are you using, if any?

NA

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the bug Incorrect behavior inside of ibis label Oct 11, 2023
@cpcloud
Copy link
Member

cpcloud commented Oct 11, 2023

The correct answer is probably int128 (but we don't have that type and if we did we have the same problem again).

I think I would expect the current system to error if there's no type that can represent the set of values common to both types.

int64 wouldn't be correct here because it cannot represent many uint64 values.

@gforsyth
Copy link
Member

I think this means we would remove the ability to cast from signed to unsigned integers in all cases, which is more restrictive than the backends actually are -- although I recognize this is a bit intractable without knowing the bounds of the values in a given column.

@NickCrews
Copy link
Contributor Author

NickCrews commented Oct 11, 2023

int64 wouldn't be correct

oh I see, my bad.

I think this means we would remove the ability to cast from signed to unsigned integers in all cases, which is more restrictive than the backends actually are -- although I recognize this is a bit intractable without knowing the bounds of the values in a given column.

All uint32s are representable by int64s, so casting in that case should always be allowed. Are you talking about the uintX to intX case? I think casting there should be allowed because backends support it.

I think there are two different functionalities here:

"is x castable to y?"

What needs to happen during an explicit cast() call. Is much more lenient, because the user is doing an explicit cast.

  • castable(uint64, int64) == true
  • castable(list, float) == false

"find a dtype that can represent both dtypes safely"

What happens during an implicit cast. Currently we are relying on castable() to do this, but this is wrong. For instance, with #7332 applied we get the wrong behavior (grep through the codebase for usage of highest_precedence to look for vulnerabilities):

uint8s = ibis.array([255]).cast("array<uint8>")
int8s = ibis.array([127, -128]).cast("array<int8>")
x = uint8s.concat(int8s)
print(x.type())
print(x)
# array<uint8>
# [255, 127, -128]

Really this should return array<int16>, or it should error, not sure which. In the case of array<uint64> and array<int64>, this should definitely error because there is no safe implicit casting option.

I would expect these properties:

If z = find_common(x, y), then

  1. z might be x or y, but not necessarily.
  2. castable(x, z) to be true
  3. castable(y, z) to be true

If castable(a, b) is true, then find_common(a, b) might error or might find something: (uint64, int64) is castable but has no common.

If castable(c, d) is false, then find_common(c, d) must error: (float, list) aren't safely nor unsafely castable.

@gforsyth
Copy link
Member

Are you talking about the uintX to intX case?

No, those are tractable by going up by a power of 2 in bit-width, but int to uint requires knowledge of the underlying value to be able to say whether the cast is valid or not.

I think the split you've identified is good! So if a user says "just cast it" we can allow it (within reason) and then let the backend fail or succeed.

Implicit casting should find a dtype that can encompass both dtypes (or the implicit cast should fail, if it cannot).

@NickCrews
Copy link
Contributor Author

Are you talking about the uintX to intX case?

Ah sorry I wasn't clear, I think we're on the same page I meant X was the same width for both eg uint32 to int32

@gforsyth
Copy link
Member

There are valid issues here, but it's not currently blocking any of our work and fixing it is liable to be an insane 🐰 🕳️ -- if someone is interested in tackling this work without breaking the entire project, we're game to talk it through, but this isn't going to get picked up any time soon.

@gforsyth gforsyth closed this as not planned Won't fix, can't repro, duplicate, stale Jul 31, 2024
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Archived in project
Development

No branches or pull requests

3 participants