Skip to content

Add new iterators: core::str::{chars_uppercase,chars_lowercase} #58

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
markokr opened this issue Jun 25, 2022 · 12 comments
Closed

Add new iterators: core::str::{chars_uppercase,chars_lowercase} #58

markokr opened this issue Jun 25, 2022 · 12 comments
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@markokr
Copy link

markokr commented Jun 25, 2022

Proposal

Problem statement

There is no way to itererate over lowercase/uppercase chars of a &str without creating full copy of it with str::to_lowercase
or str::to_uppercase.

But there exists a bad way: str::chars with char::to_lowercase. The problem is that the latter API cannot match output of str::to_lowercase or str::to_uppercase as some Unicode conversions require context and that API cannot provide it.

Motivation, use-cases

Reason for avoiding call to str::to_lowercase() and str::to_uppercase:

  • no_std
  • performance
  • lazy operation - only few chars are needed.

Reasons to avoid char::to_uppercase and char::to_lowercase:

  • corretctness.

Solution sketches

Solution A:

  • str::chars_lowercase() -> CharsLowercase
  • str::chars_uppercase() -> CharsUppercase

Solution B:

  • char::to_lowercase_with_context(&str, usize) -> ToLowercase
  • char::to_uppercase_with_context(&str, usize) -> ToUppercase

Links and related work

PR for solution A: rust-lang/rust#98490

@markokr markokr added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Jun 25, 2022
@thomcc
Copy link
Member

thomcc commented Jun 25, 2022

Essentially, this is useful because the naïve solution here -- something like "foo".chars().flat_map(|c| c.to_lowercase()) -- will incorrectly lowercase some text. Notably, Greek requires sigma at the end of a word to be lowercased as ς rather than σ.

We handle this in str::to_lowercase, but this requires liballoc, and is less efficient in some cases than using an iterator. This proposal (which I'm in favor of) adds iterators over the lowercased chars of a str which are able to correctly handle this.

It also adds an iterator over the uppercased chars. This is probably mostly for symmetry, since currently, there is no situation where this differs from "foo".chars().flat_map(|c| c.to_uppercase()). That said, (I believe) the non-existence of this is not guaranteed by Unicode, and in principal a future version may have reason for this interface to exist.

@lopopolo
Copy link

lopopolo commented Jul 1, 2022

Hi! I stumbled across this issue and I think a crate I maintain is relevant here: https://crates.io/crates/roe. Roe is a no std crate that does case mapping for ascii/full/turkic modes to upper/lower/title case.

I find it hard to grasp what std does in its string lowercase operation. It mentions the Unicode lowercase property and maybe some special cases for Greek.

Case mapping and case folding are Unicode transforms that the "to lowercase" operation in std differs from.

I'd love for case mapping to be better supported in core, but would prefer that it be via a Unicode Case Mapping API.

@lopopolo
Copy link

lopopolo commented Jul 1, 2022

It also adds an iterator over the uppercased chars. This is probably mostly for symmetry, since currently, there is no situation where this differs from "foo".chars().flat_map(|c| c.to_uppercase())

I think this assumes "full" case mapping mode (or something like it). Turkic mapping mode would map i to upper dotted I.

@markokr
Copy link
Author

markokr commented Jul 1, 2022

This issue is about adding new iterators that keep existing behavior from str::to_lowercase and str::to_uppercase. They use locale-independent mapping.

You seem interested in locale-specific mapping, that requires designing new APIs for that. This is right repo for that discussion, you can create an issue with problem description and API suggestions.

@thomcc
Copy link
Member

thomcc commented Jul 1, 2022

Yeah, we shouldn't add any locale-dependent mappings. (But this is a good argument that yes, it would be completely legal for Unicode to add a locale-independent uppercase mapping to SpecialCasing.txt in some future version).

@yaahc
Copy link
Member

yaahc commented Jul 14, 2022

This seems very well justified, I don't have specific feedback on the locale-dependent mappings issue but would like to see this move forward. Thank you for addressing this issue @markokr!

Seconded

@rustbot label +initial-comment-period

@rustbot rustbot added the initial-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Jul 14, 2022
@pitaj
Copy link

pitaj commented Jan 30, 2023

I realize this is quite old, but it doesn't appear to be implemented yet.

What about adding to_uppercase() and to_lowercase() (not final fn names) associated functions to the Chars struct? They would take it by value and return CharsUppercase and CharsLowercase respectively.

@thomcc
Copy link
Member

thomcc commented Jan 30, 2023

I'm planning on getting to this soon (I already have a patch, I'll finish it up this weekend). I think the approach that got accepted in the ACP is fine but don't feel that strongly.

@pitaj
Copy link

pitaj commented Jan 30, 2023

I just worry about the ballooning number of str methods, especially for less common use cases like this.

@thomcc
Copy link
Member

thomcc commented Jan 30, 2023

Eh, I've seen a few people use str.chars().flat_map(|c| c.to_lowercase()) not realizing that it's subtly incorrect. I don't think this is that niche, and it's worth having it be more discoverable IMO.

@pitaj
Copy link

pitaj commented Jan 30, 2023

That's a good point, but I think we can just add a clippy lint for that pattern and clearly document .chars().to_uppercase() in both the docs for .chars() and char::to_uppercase

A clippy lint would be valuable regardless of approach, actually.

@m-ou-se
Copy link
Member

m-ou-se commented May 17, 2023

Closing this as this was already seconded.

@m-ou-se m-ou-se closed this as completed May 17, 2023
@dtolnay dtolnay added ACP-accepted API Change Proposal is accepted (seconded with no objections) and removed initial-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. labels Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

8 participants