Skip to content

Tracking Issue for utf8_chunks #99543

@dylni

Description

@dylni
Contributor

Feature gate: #![feature(utf8_chunks)]

This is a tracking issue for an improved API for str::from_utf8.

Public API

// core::str

pub struct Utf8Chunks<'a> { ... }

impl<'a> Utf8Chunks<'a> {
    pub fn new(bytes: &'a [u8]) -> Self;
}

impl<'a> Iterator for Utf8Chunks<'a> {
    type Item = Utf8Chunk<'a>;
}

impl<'a> Clone for Utf8Chunks<'a>;
impl<'a> Debug for Utf8Chunks<'a>;
impl<'a> FusedIterator for Utf8Chunks<'a>;


pub struct Utf8Chunk<'a> { ... }

impl<'a> Utf8Chunk<'a> {
    pub fn valid(&self) -> &'a str;
    pub fn invalid(&self) -> &'a [u8];
}

impl<'a> Clone for Utf8Chunk<'a>;
impl<'a> Debug for Utf8Chunk<'a>;
impl<'a> PartialEq for Utf8Chunk<'a>;
impl<'a> Eq for Utf8Chunk<'a>;

Steps / History

Unresolved Questions

  • Should the constructor be Utf8Chunks::new or <[u8]>::utf8_chunks?
  • Should Utf8Chunks::debug or a similar method be exposed?

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

Activity

added
C-tracking-issueCategory: An issue tracking the progress of sth. like the implementation of an RFC
T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.
on Jul 21, 2022
added a commit that references this issue on Aug 20, 2022
d499065
dtolnay

dtolnay commented on Nov 24, 2023

@dtolnay
Member

I'd be interested in using this to implement Display and Debug for CxxString in the cxx crate. Here is the current implementation without Utf8Chunks:

Here is what it looks like using Utf8Chunks as currently exists in nightly:

Are there other known use cases so far that we could look at before an FCP? One thing I am interested in is how the current Utf8Chunks API compares with this alternative one, not based on Iterator, with just 1 type:

pub struct Utf8Chunks<'a>;

impl<'a> Utf8Chunks<'a> {
    pub fn next_valid(&mut self) -> &'a str;
    pub fn next_invalid(&mut self) -> &'a [u8];
}
dylni

dylni commented on Dec 16, 2023

@dylni
ContributorAuthor

@dtolnay I am currently waiting on this ACP for stabilization.

Are there other known use cases so far that we could look at before an FCP?

I am aware of the following use cases.

  • Lossy conversion (String::from_utf8_lossy)
  • Debug formatting (as you mentioned)

I was originally going to use this feature in os_str_bytes for Debug formatting, but invalid returning individual "sequences" made this usage cumbersome. OsStr cannot be assumed to have the same invalid sequences. However, the individual sequences are required for lossy conversion to work with Utf8Chunks in its current form within libstd.

One thing I am interested in is how the current Utf8Chunks API compares with this alternative one, not based on Iterator, with just 1 type:

My concern is that the alternate API is easier to misuse (e.g., calling next_valid twice for two valid chunks). It also requires parsing each invalid sequence twice.

Dylan-DPC

Dylan-DPC commented on Mar 6, 2024

@Dylan-DPC
Member

@dylni generally, you don't need an ACP for this to stabilise (unless the team explicitly asked for it which I don't think happened in this case).
The next step is an FCP. In which case, you can submit a stabilisation pr for it linking this issue and preferably putting the report you shared here in that pr and then the team will run an fcp either in the pr or the issue.

dylni

dylni commented on Mar 9, 2024

@dylni
ContributorAuthor

@Dylan-DPC Right, but the problem is that the ACP would change the API. Stabilizing at this point would prevent the API change from landing.

dtolnay

dtolnay commented on Apr 11, 2024

@dtolnay
Member

@rust-lang/libs-api:
@rfcbot fcp merge

I propose stabilizing core::str::Utf8Chunks and core::str::Utf8Chunk ❗with the minor modification described in rust-lang/libs-team#190 ❗.

I recently wanted this API for C++ string Debug impls in cxx as described in #99543 (comment), and also in libproc_macro for synthesizing C-string literals in #123769.

rfcbot

rfcbot commented on Apr 11, 2024

@rfcbot
Collaborator

Team member @dtolnay has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

29 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-tracking-issueCategory: An issue tracking the progress of sth. like the implementation of an RFCT-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.disposition-mergeThis issue / PR is in PFCP or FCP with a disposition to merge it.finished-final-comment-periodThe final comment period is finished for this PR / Issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @BurntSushi@m-ou-se@dtolnay@apiraino@rfcbot

        Issue actions

          Tracking Issue for `utf8_chunks` · Issue #99543 · rust-lang/rust