Skip to content

Add arrow cast #962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Jan 7, 2025
Merged

Add arrow cast #962

merged 33 commits into from
Jan 7, 2025

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Dec 3, 2024

Which issue does this PR close?

Completes a task in #463

Rationale for this change

This PR introduces the implementation of the arrow_cast function from datafusion's similar arrow_cast function.

What changes are included in this PR?

Functionality:

Implements the arrow_cast function

Tests:

Adds a test case to validate the functionality of arrow_cast.

Are there any user-facing changes?

Yes, this PR adds the arrow_cast function to the API, enabling users to columns into specific data type.

Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the additions. Do you think this is needed or to just use Expr.cast?

@kosiew
Copy link
Contributor Author

kosiew commented Dec 12, 2024

Thank you for the additions. Do you think this is needed or to just use Expr.cast?

@timsaucer
Copy link
Contributor

So doing a little testing to see if this is necessary:

from datafusion import SessionContext, col, lit
import pyarrow as pa
import datetime
ctx = SessionContext()

df = ctx.from_pydict({
    "a": pa.array([1], type=pa.int64()),
    "date": pa.array([datetime.datetime.today()], type=pa.timestamp("us")),
})

df.show()
print(df.schema())

df = (
    df
    .with_column("b", col("a").cast(pa.int8()))
    .with_column("ms_date", col("date").cast(pa.timestamp("ms")))
)

df.show()
print(df.schema())

Produces:

DataFrame()
+---+----------------------------+
| a | date                       |
+---+----------------------------+
| 1 | 2024-12-12T07:28:24.431099 |
+---+----------------------------+
a: int64
date: timestamp[us]
DataFrame()
+---+----------------------------+---+-------------------------+
| a | date                       | b | ms_date                 |
+---+----------------------------+---+-------------------------+
| 1 | 2024-12-12T07:28:24.431099 | 1 | 2024-12-12T07:28:24.431 |
+---+----------------------------+---+-------------------------+
a: int64
date: timestamp[us]
b: int8
ms_date: timestamp[ms]

So maybe we can use the existing method?

@kosiew
Copy link
Contributor Author

kosiew commented Dec 16, 2024

@timsaucer ,

Thanks for the detailed example.

The other reason for this PR is also to add arrow_cast so that the full set of datafusion Rust scalar functions are available in datafusion-python - #463

@timsaucer timsaucer merged commit e36e8ab into apache:main Jan 7, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants