Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: upcast dtype on union of two schemas #5707

Closed
1 task done
NickCrews opened this issue Mar 8, 2023 · 9 comments
Closed
1 task done

feat: upcast dtype on union of two schemas #5707

NickCrews opened this issue Mar 8, 2023 · 9 comments
Labels
feature Features or general enhancements

Comments

@NickCrews
Copy link
Contributor

Is your feature request related to a problem?

If I have two schemas that are the same, except the dtype of column X in is int32 vs int64, then if I union these two schemas I get an error.

Really, I would like to get a schema with X of type int64, the upcasted version of those two dtypes.

Describe the solution you'd like

As I say above.

Behavior is still poorly defined for & and - operators, so I would still say error there.

What version of ibis are you running?

head

What backend(s) are you using, if any?

all

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the feature Features or general enhancements label Mar 8, 2023
@cpcloud
Copy link
Member

cpcloud commented Mar 9, 2023

How about a schema.upcast(other_schema) method instead of baking in complicated casting rules into union and also instead of a keyword argument to union. This behavior seems specific enough that it's worth an isolated method to do just what you're asking for.

@NickCrews
Copy link
Contributor Author

I think that makes sense to have it separate.

Would the order matter?

  • schema_with_int32.upcast(schema_with_int64) seems like this should work
  • schema_with_int64.upcast(schema_with_int32) seems like this should error
    If order matters, then this is a problem, because what if you have
    A = {x: int64, y: int32} and B = {x: int32, y: int64}. I want {x: int64, y: int64} as the result

PErhaps just a different name from upcast?

@cpcloud
Copy link
Member

cpcloud commented Mar 9, 2023

We already have code to upcast types, and that code isn't order dependent.

@NickCrews
Copy link
Contributor Author

OK, sounds good. Is this the behavior that you are thinking of, where it acts like a union if a column is only present in one input schema:

def upcast(self, other):
    all_columns = set(self.names) | set(other.names)
    d = {}
    uncastables = []
    for c in all_columns:
        if c not in self:
            d[c] = other[c]
        elif c not in other:
            d[c] = self[c]
        else:
            try:
                d[c] = self[c].upcast(other[c])
            except CastError:
                 uncastables.append(c)
    if uncastables:
        raise CastError("some columns aren't compatible: ", uncastables)
    return Schema(d)

@cpcloud
Copy link
Member

cpcloud commented Oct 11, 2023

@NickCrews Does this upcast function suffice?

In [13]: def upcast(left, right):
    ...:     import ibis
    ...:     import ibis.expr.datatypes as dt
    ...:
    ...:     return ibis.schema(
    ...:         dict(zip(left.names, map(dt.higher_precedence, left.types, right.types)))
    ...:     )

In [14]: t = ibis.schema({"a": "int32"})

In [15]: s = ibis.schema({"a": "int64"})

In [16]: upcast(t, s)
Out[16]:
ibis.Schema {
  a  int64
}

@NickCrews
Copy link
Contributor Author

hmm, not quite, because the result will have exactly left's columns, but I was looking for the union of left and right's columns. eg upcast({"a": "uint32"}, {"a": "int64", "b": "str"}) should be {"a": "int64", "b": "str"}.

I think it might be best if we just exposed a public ibis.highest_precedence(*dtypes: str | dt.Type, **maybe_add_kwargs_later?) -> dt.Type, and then I could do whatever implementation I need. I worry that others might want slightly different edge case behavior, and it will be hard to satisfy them all.

@cpcloud
Copy link
Member

cpcloud commented Oct 23, 2023

I think we should overhaul this part of the codebase a bit and instead of checking for castability, we have a promote API that accepts 1 or more dtypes and returns a type that can store the values in the arguments without information loss, or it errors.

@NickCrews
Copy link
Contributor Author

NickCrews commented Oct 23, 2023

@cpcloud I agree, I think I had this same proposal in #7331 (comment)

@gforsyth gforsyth closed this as not planned Won't fix, can't repro, duplicate, stale Jul 31, 2024
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Jul 31, 2024
@NickCrews
Copy link
Contributor Author

I think we should overhaul this part of the codebase a bit and instead of checking for castability, we have a promote API that accepts 1 or more dtypes and returns a type that can store the values in the arguments without information loss, or it errors.

I believe the existing from ibis.expr.datatypes.highest_precedence function does exactly this? So I think officially exposing it as in #9335 would solve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Archived in project
Development

No branches or pull requests

3 participants