Skip to content

Add takestring!(x) to create a string from the content of x, emptying it #54372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jakobnissen
Copy link
Member

@jakobnissen jakobnissen commented May 6, 2024

This PR adds the takestring function. It currently has two methods:

  • The generic takestring!(x)::String creates a string from the content of x, emptying x. It's currently only defined for Vector{UInt8}.
  • takestring(::GenericIOBuffer)::String creates a string from the content of the buffer, and resets the buffer to the initial state.

For the motivation for this function, and discussion, read the comments on this PR further down.

@jakobnissen jakobnissen marked this pull request as ready for review May 6, 2024 09:28
@nhz2
Copy link
Contributor

nhz2 commented May 6, 2024

Also, check for uses of StringMemory for example:

slug = Base.StringMemory(len)
chars = b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
for i = 1:len
slug[i] = chars[(Libc.rand() % length(chars)) + 1]
end
return String(slug)

#53962

@oscardssmith oscardssmith requested a review from vtjnash May 6, 2024 13:11
@KristofferC KristofferC requested a review from JeffBezanson May 6, 2024 13:27
@stevengj
Copy link
Member

stevengj commented May 6, 2024

The name makes me think that take_string!(iobuffer) should be a replacement/synonym for String(take!(iobuffer))?

@oscardssmith oscardssmith added the triage This should be discussed on a triage call label May 7, 2024
@oscardssmith
Copy link
Member

I would be in favor of that. Every time I have used an IOBuffer I have had to go to the docs to figure out how to get the data out of it correctly.

@jariji
Copy link
Contributor

jariji commented May 7, 2024

So what's the specification of String(_): does it truncate or not? Or does it depend on the specific argument type? An "it depends" is not a very nice property for generic programming.

@jakobnissen
Copy link
Member Author

It does not truncate - except when the argument is a Vector{UInt8}. Yes, that's bad and in my opinion it shouldn't have been like that. But it's documented, and therefore can't change. I made this PR to make sure more exceptions aren't added.

@stevengj
Copy link
Member

stevengj commented May 9, 2024

Do we need an underscore here? takestring! is pretty readable

@oscardssmith
Copy link
Member

oscardssmith commented May 9, 2024

Triage had a long talk about this, #54369, #54156 and #54273. The conclusions we came to are

  1. String(Memory) should copy
  2. String(Array) should truncate the array but not the Memory
  3. triage was right last time wrt view and reshape of Memory
  4. using the an Array aliased with another Array that is turned into a String is UB.
  5. take_string! is a much better API for IOBuffer et al.

@oscardssmith oscardssmith added arrays [a, r, r, a, y, s] strings "Strings!" and removed triage This should be discussed on a triage call status: waiting for PR reviewer labels May 9, 2024
@nhz2 nhz2 mentioned this pull request May 9, 2024
@stevengj
Copy link
Member

stevengj commented May 9, 2024

The style guide says not to use underscores "when readable" without, which I think applies here.

@jakobnissen jakobnissen changed the title Add take_string! and make String(::Memory) not truncate Add takestring! and make String(::Memory) not truncate May 10, 2024
@topolarity
Copy link
Member

topolarity commented May 10, 2024

  1. using the an Array aliased with another Array that is turned into a String is UB.

Can this be refined to "mutating an Array after its alias has been turned into a String is UB"?

edit: For the record , I don't think it's right for us to put this contract on an API that's not marked unsafe. I expect that this 'contract' is broken much more often than the UB is actually triggered though, which is probably what's saving us in practice.

@oscardssmith
Copy link
Member

Yes. observing it is not great but not UB. the only UB is if you mutate it.

@nhz2
Copy link
Contributor

nhz2 commented May 10, 2024

takestring! can mutate the bytes in the underlying memory of a Vector input, it is not just making a String view.

@jakobnissen
Copy link
Member Author

jakobnissen commented May 10, 2024

After triage, this PR now contains the following changes:

  • Add takestring!. This function can be used with IOBuffer and Vector{UInt8}, returning a String. It resets the input to its initial state. This means it empties vectors, and empties IOBuffer if and only if the buffer is writable. This behaviour of not empying the input if the input is not writable is slightly weird, but matches take!
  • Add the internal methods Base.unsafe_takestring! and Base.unsafe_takestring, which are similar to takestring!, except they leave the argument (IOBuffer or Memory) in an unusable corrupt state. They are useful for the common pattern when code creates a buffer, takes a string from the buffer and then returns the string without touching the buffer again.
  • String(::Memory) now no longer truncates the memory
  • String(::Vector{UInt8}) now no longer truncates the underlying memory of the string. It still empties the vector, and assigns a new (empty) memory to the vector just like before.
  • The many different calls to String(take!(::IOBuffer)) and String(_unsafe_take!(::IOBuffer)) has been replaced with takestring! and the two internal methods Base.unsafe_takestring! and Base.unsafe_takestring). Not all calls to String(take!(...)) has been replaced, as there are many, many of such calls in the test suite and no need to change them.
  • Replace most uses of String(take!(::IOBuffer)) in Base and the stdlibs
  • Add tests

@jakobnissen jakobnissen added merge me PR is reviewed. Merge when all tests are passing and removed status: waiting for PR author labels Mar 30, 2025
@jakobnissen
Copy link
Member Author

Thanks for the thorough review, @nhz2. I addressed your points and rebased on master. As soon as all tests pass, this PR is good to go.

Copy link
Contributor

@nhz2 nhz2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the compat version number, this looks good.

@jakobnissen
Copy link
Member Author

CI failures are unrelated.

@StefanKarpinski StefanKarpinski added the triage This should be discussed on a triage call label Apr 17, 2025
@DilumAluthge
Copy link
Member

Looks like this is going to go through another round of triage discussion.

@DilumAluthge DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Apr 18, 2025
@jakobnissen jakobnissen changed the title Add takestring! and make String(::Memory) not truncate Add takestring!(x) to create a string from the content of x, emptying it Apr 23, 2025
@gbaraldi
Copy link
Member

Triage likes this.

@oscardssmith
Copy link
Member

@jakobnissen thanks so much for all your work here! The one request from triage was a rebase to 1 commit that implements the new functions and a 2nd that switches uses/docs over.

This function creates a string from an IOBuffer or a Vector, emptying the input
object and reusing the memory where possible.
This commit switches the usage of the pattern `String(take!(::IOBuffer))` to use
the new `takestring!` or `unsafe_takestring!` functions across Base, and in doc-
strings.
Not all occurrences in e.g. tests are switched over, as this would consistute a
lot of code churn for no real purpose.
@jakobnissen jakobnissen added merge me PR is reviewed. Merge when all tests are passing and removed triage This should be discussed on a triage call labels Apr 25, 2025
io.size = 0
io.offset_or_compacted = 0
s
else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is jl_genericmemory_to_string (the no-copy path) only reachable if the io is writable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, takestring! is mainly used for IOBuffers that are used as string builders, so they are already writable.
Also, the last time I checked, calling jl_genericmemory_to_string can make the reading from the memory invalid: #54372 (comment) , though this might have changed since then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When IOBuffer is not writable, the IOBuffer doesn't take ownership of the memory, and allows it to be shared, e.g. between multiple IOBuffers (and that also makes IOBuffer more safe in general since the user can keep a reference to the memory they wrapped the IOBuffer around). Therefore, we need to copy it out.
Luckily, as Nathan said, that's usually not a performance problem for takestring! where the iobuffer is mutable.

@IanButterworth IanButterworth added the don't squash Don't squash merge label Apr 28, 2025
@IanButterworth
Copy link
Member

Assuming don't squash from the triage request

@DilumAluthge
Copy link
Member

Removing merge me label because there still seems to be a review conversation ongoing: #54372 (comment)

@DilumAluthge DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] don't squash Don't squash merge feature Indicates new feature / enhancement requests strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.