-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52837][CONNECT][PYTHON] Support TimeType in pyspark #51515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hello @zhengruifeng @MaxGekk @peter-toth , please take a look at this, we should also support time literals in Connect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dengziming Please, open new ticket specifically for the python client.
@MaxGekk I have created a separate ticket. |
python/pyspark/sql/types.py
Outdated
@@ -384,6 +385,33 @@ def fromInternal(self, v: int) -> datetime.date: | |||
return datetime.date.fromordinal(v + self.EPOCH_ORDINAL) | |||
|
|||
|
|||
class TimeType(AtomicType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR actually add a new datatype in pyspark, not just python client.
It should work with both python client and pyspark classic.
We'd better also add test for pyspark classic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I just notice that the class hierarchy is not consistent with the JVM side:
case class TimeType(precision: Int) extends AnyTimeType
We'd better make them the same, by also introducing the AnyTimeType
in pyspark. If we need to touch other existing date time types, we can do it in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very professional with PySpark, this will take some time, I changed it to WIP by now.
0942af7
to
84d55ae
Compare
@zhengruifeng Fortunately, it's not at all complicated to make TimeType and DatetimeType consistent with JVM side definitions, I have finished these changes, please take a look. |
Hello @zhengruifeng @HyukjinKwon , please take a look at this PR when you are free, I have made a thorough check of the code. |
python/pyspark/sql/types.py
Outdated
|
||
EPOCH_ORDINAL = datetime.datetime(1970, 1, 1).toordinal() | ||
class DatetimeType(AtomicType): | ||
"""Super class of all datetime data type.""" | ||
|
||
def needConversion(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DatetimeType should not have needConversion
|
||
def needConversion(self) -> bool: | ||
return True | ||
|
||
|
||
class DateType(DatetimeType, metaclass=DataTypeSingleton): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DateType's needConversion
is missing
@@ -384,11 +390,47 @@ def fromInternal(self, v: int) -> datetime.date: | |||
return datetime.date.fromordinal(v + self.EPOCH_ORDINAL) | |||
|
|||
|
|||
class TimestampType(AtomicType, metaclass=DataTypeSingleton): | |||
"""Timestamp (datetime.datetime) data type.""" | |||
class AnyTimeType(DatetimeType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep the needConversion
in each child class
@@ -1508,6 +1508,16 @@ def condition(): | |||
|
|||
eventually(catch_assertions=True)(condition)() | |||
|
|||
def test_time_lit(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test seems redundant.
Tests in test_column
and test_functions
will be reused on connect mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test file is mainly for directly comparing behavior between connect and classic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LGTM, left a few comments
Thank you for the instructions @zhengruifeng . I fixed them, PTAL. |
What changes were proposed in this pull request?
This is the follow-up of #51462 to support TimeType literal in pyspark connect.
Why are the changes needed?
To align the Python Connect client with the Java/Scala Connect client.
Does this PR introduce any user-facing change?
Yes, we can use TimeType literal in several ways, for example,
PySparkSession.sql("SELECT TIME '12:13:14'")
and
pyspark.sql.connect.functions.lit(datetime.time(12, 13, 14))
.How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No