Skip to content

Commit

Permalink
[SPARK-42826][3.4][FOLLOWUP][PS][DOCS] Update migration notes for pan…
Browse files Browse the repository at this point in the history
…das API on Spark

### What changes were proposed in this pull request?

This is follow-up for apache#40459 to fix the incorrect information and to elaborate more detailed changes.
- We're not fully support the pandas 2.0.0, so the information "Pandas API on Spark follows for the pandas 2.0" is not correct.
- We should list all the APIs that no longer support `inplace` parameter.

### Why are the changes needed?

Correctness for migration notes.

### Does this PR introduce _any_ user-facing change?

No, only updating migration notes.

### How was this patch tested?

The existing CI should pass

Closes apache#41207 from itholic/migration_guide_followup.

Authored-by: itholic <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
  • Loading branch information
itholic authored and catalinii committed Oct 10, 2023
1 parent 9f559de commit a700cf7
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion python/docs/source/migration_guide/pyspark_upgrade.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ Upgrading from PySpark 3.3 to 3.4
* In Spark 3.4, the ``Series.concat`` sort parameter will be respected to follow pandas 1.4 behaviors.
* In Spark 3.4, the ``DataFrame.__setitem__`` will make a copy and replace pre-existing arrays, which will NOT be over-written to follow pandas 1.4 behaviors.
* In Spark 3.4, the ``SparkSession.sql`` and the Pandas on Spark API ``sql`` have got new parameter ``args`` which provides binding of named parameters to their SQL literals.
* In Spark 3.4, Pandas API on Spark follows for the pandas 2.0, and some APIs were deprecated or removed in Spark 3.4 according to the changes made in pandas 2.0. Please refer to the [release notes of pandas](https://pandas.pydata.org/docs/dev/whatsnew/) for more details.
* In Spark 3.4, the custom monkey-patch of ``collections.namedtuple`` was removed, and ``cloudpickle`` was used by default. To restore the previous behavior for any relevant pickling issue of ``collections.namedtuple``, set ``PYSPARK_ENABLE_NAMEDTUPLE_PATCH`` environment variable to ``1``.
* In Spark 3.4, the ``inplace`` parameter is no longer supported for Pandas API on Spark API ``add_categories``, ``remove_categories``, ``remove_unused_categories``, ``rename_categories``, ``reorder_categories``, ``set_categories`` to follow pandas 2.0.0 behaviors.


Upgrading from PySpark 3.2 to 3.3
Expand Down

0 comments on commit a700cf7

Please sign in to comment.