ENH(string dtype): Make str.decode return str dtype #60709

rhshadrach · 2025-01-12T21:16:51Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

rhshadrach · 2025-01-12T21:40:15Z

pandas/tests/strings/test_strings.py

    ser = Series(["a", "b", "a\xe4"], dtype=any_string_dtype).str.encode("utf-8")
    result = ser.str.decode("utf-8")
-    expected = ser.map(lambda x: x.decode("utf-8")).astype(object)
+    expected = Series(["a", "b", "a\xe4"], dtype="str")


The change from ser.map to using Series is just to make this test a bit more explicit. Using ser.map(...).astype("str") also passes.

jorisvandenbossche

Looks good!

jorisvandenbossche · 2025-01-13T09:35:08Z

pandas/io/pytables.py

+            if get_option("future.infer_string"):
+                data = ser.to_numpy()
+            else:
+                data = ser._values


You can probably simplify this and always to .to_numpy()? (or np.asarray(..))
In the case of object dtype in the else branch, that will return the same (and be as cheap) as _values I think

Confirmed - thanks.

jorisvandenbossche · 2025-01-24T20:13:37Z

@rhshadrach can you update this?

…decode

rhshadrach · 2025-01-25T12:32:57Z

@jorisvandenbossche - the issue with .to_numpy on NumPy-backed Series is that we set the underlying data to read-only. In pytables, we switch out NA values in libwriters.string_array_replace_from_nan_rep, which is causing the tests to fail.

Perhaps there could be a way (e.g. Series._to_numpy) to always get a corresponding NumPy array that isn't read-only? Barring this, it seems to me we could either always (and unnecessarily) make a copy, or use my original branching logic. Open to other ideas too.

jorisvandenbossche · 2025-01-26T11:40:09Z

Hmm, good point. Ideally we would be able to solve this without using private APIs, I think, because it is a good case study for what also other people (external code) could run into.

So I think what we have said before is that downstream users could do data.flags.writeable = True on the result of to_numpy() if they know what they are doing (and in this case we know that we indeed own the memory, because we are reading a file and created that data and not yet returned it to the user).

But this also makes me wonder if we should re-discuss if we have to add some keyword to to_numpy() to get this (e.g. something like writeable=True)

…decode

rhshadrach · 2025-01-29T21:47:47Z

@jorisvandenbossche - ran into a couple more test changes.

mroeschke · 2025-01-29T22:07:10Z

Thanks @rhshadrach

lumberbot-app · 2025-01-29T22:07:32Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.3.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 c36da3f6ded4141add4b3b16c252cedf4641e5ea

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #60709: ENH(string dtype): Make str.decode return str dtype'

Push to a named branch:

git push YOURFORK 2.3.x:auto-backport-of-pr-60709-on-2.3.x

Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60709 on branch 2.3.x (ENH(string dtype): Make str.decode return str dtype)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

…rn str dtype

rhshadrach · 2025-01-30T21:06:52Z

Backport PR: #60821

…pe (#60821)

TST(string dtype): Make str.decode return str dtype

60a8eee

rhshadrach added Enhancement Strings String extension data type and string data labels Jan 12, 2025

rhshadrach marked this pull request as draft January 12, 2025 21:17

rhshadrach changed the title ~~TST(string dtype): Make str.decode return str dtype~~ ENH(string dtype): Make str.decode return str dtype Jan 12, 2025

Test fixups

513e3c3

rhshadrach commented Jan 12, 2025

View reviewed changes

pytables fixup

c1d9e6d

jorisvandenbossche approved these changes Jan 13, 2025

View reviewed changes

jorisvandenbossche added this to the 2.3 milestone Jan 22, 2025

rhshadrach added 3 commits January 24, 2025 20:27

Simplify

9a6a231

Merge branch 'main' of https://github.com/pandas-dev/pandas into str_…

7afd274

…decode

whatsnew

45aa4ae

rhshadrach marked this pull request as ready for review January 25, 2025 01:32

rhshadrach marked this pull request as draft January 25, 2025 02:31

rhshadrach added 2 commits January 28, 2025 20:17

Merge branch 'main' of https://github.com/pandas-dev/pandas into str_…

a287424

…decode

fix implementation

9c29f82

rhshadrach marked this pull request as ready for review January 29, 2025 21:46

rhshadrach requested a review from jorisvandenbossche January 29, 2025 21:47

jorisvandenbossche approved these changes Jan 29, 2025

View reviewed changes

mroeschke approved these changes Jan 29, 2025

View reviewed changes

mroeschke merged commit c36da3f into pandas-dev:main Jan 29, 2025
42 checks passed

lumberbot-app bot added the Still Needs Manual Backport label Jan 29, 2025

rhshadrach deleted the str_decode branch January 30, 2025 21:02

rhshadrach added a commit to rhshadrach/pandas that referenced this pull request Jan 30, 2025

Backport PR pandas-dev#60709: ENH(string dtype): Make str.decode retu…

fa7224f

…rn str dtype

jorisvandenbossche mentioned this pull request Feb 5, 2025

Backport PR #60709: ENH(string dtype): Make str.decode return str dtype #60821

Merged

jorisvandenbossche removed the Still Needs Manual Backport label Feb 5, 2025

jorisvandenbossche pushed a commit that referenced this pull request Feb 5, 2025

Backport PR #60709: ENH(string dtype): Make str.decode return str dty…

97a06de

…pe (#60821)

Uh oh!

ENH(string dtype): Make str.decode return str dtype #60709

ENH(string dtype): Make str.decode return str dtype #60709

Uh oh!

Conversation

rhshadrach commented Jan 12, 2025

Uh oh!

rhshadrach Jan 12, 2025

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

rhshadrach Jan 25, 2025

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jan 24, 2025

Uh oh!

rhshadrach commented Jan 25, 2025

Uh oh!

jorisvandenbossche commented Jan 26, 2025

Uh oh!

rhshadrach commented Jan 29, 2025

Uh oh!

Uh oh!

mroeschke commented Jan 29, 2025

Uh oh!

lumberbot-app bot commented Jan 29, 2025

Uh oh!

rhshadrach commented Jan 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants