Skip to content

Import unicode-normalization or re-write from scratch? #40

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sffc opened this issue Apr 18, 2020 · 16 comments
Closed

Import unicode-normalization or re-write from scratch? #40

sffc opened this issue Apr 18, 2020 · 16 comments
Labels
A-scope Area: Project scope, feature coverage C-unicode Component: Props, sets, tries help wanted Issue needs an assignee T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Apr 18, 2020

@markusicu has done a great deal of work on ICU4C's normalizer. It depends on low-level and highly optimized data structures such as UCPTrie.

Writing normalization code from a clean room would allow us to:

  1. Use the same core algorithms as ICU4C, allowing better interop of code, data, and clients
  2. Build in proper string handling (String encodings (UTF-8/UTF-16) #14)
  3. Integrate it with ICU4X's locale data pipeline (including UCD data)
@sffc sffc added the question Unresolved questions; type unclear label Apr 18, 2020
@sffc
Copy link
Member Author

sffc commented Apr 18, 2020

@macchiati
Copy link
Member

macchiati commented Apr 18, 2020 via email

@Manishearth
Copy link
Member

I'm okay with cleanroom. We can perhaps use the same API as unicode-normalization.

We could also import it and gradually optimize. I don't have opinions on this.

@hsivonen
Copy link
Member

Use the same core algorithms as ICU4C, allowing better interop of code, data, and clients

How do ICU4C and unicode-normalization compare in performance?

We can perhaps use the same API as unicode-normalization.

unicode-normalization uses an iterator over char-based API, which isn't FFI-friendly. I expect we'll end up with that API for Rust callers only and will also a slice-based API for FFI concerns (also for Rust callers that have a slice).

@sffc sffc added C-process Component: Team processes C-meta Component: Relating to ICU4X as a whole A-scope Area: Project scope, feature coverage and removed C-process Component: Team processes labels May 7, 2020
@sffc
Copy link
Member Author

sffc commented May 13, 2020

@hsivonen @zbraniecki @Manishearth Have you had an opportunity to do performance testing of ICU4C Normalizer versus the existing Rust Normalizer?

@sffc sffc assigned hsivonen and unassigned markusicu May 13, 2020
@Manishearth
Copy link
Member

Not I, maybe Zibi?

@zbraniecki
Copy link
Member

I have not. I can perform if someone more experience with normalization gives me the test samples and calls to test against.

@macchiati
Copy link
Member

macchiati commented May 14, 2020 via email

@hsivonen hsivonen assigned zbraniecki and unassigned hsivonen May 14, 2020
@hsivonen
Copy link
Member

IIRC, we assigned this to Zibi and not me on last week's call.

@sffc
Copy link
Member Author

sffc commented May 14, 2020

See Henri's comment in #66 (comment)

@sffc sffc added T-core Type: Required functionality C-unicode Component: Props, sets, tries and removed C-meta Component: Relating to ICU4X as a whole question Unresolved questions; type unclear labels May 15, 2020
@nciric
Copy link
Contributor

nciric commented May 28, 2020

#93 gives us not so clear messaging on what to do in this case (except that current implementation lags behind ICU4C metrics).

If we need to implement data provider/loading in addition to optimization steps, it may be easier to do full rewrite. If anybody needs normalization now they can either use existing crate, or even go with rust_icu that wraps ICU4C.

@sffc
Copy link
Member Author

sffc commented Jun 2, 2020

Re-writing normalizer in ICU4X still has the benefits discussed in the OP. The concern from @hsivonen was that if ICU's normalizer is super slow, then maybe we shouldn't go that route. However, now that we've put that concern to rest, I think we can revisit the advantages of an ICU-based normalizer implementation.

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Jun 2, 2020
@sffc sffc added backlog help wanted Issue needs an assignee and removed discuss Discuss at a future ICU4X-SC meeting labels Jun 4, 2020
@sffc
Copy link
Member Author

sffc commented Jun 4, 2020

Putting in backlog. Comments from team:

  • Manish: The hard part is data loading. It's probably the same amount of work to re-write from scratch than retrofit the existing crate.
  • Steven: Having an existing implementation to compare against is good.
  • Cira: Having prior art is useful for us to implement.

@sffc sffc closed this as completed Jun 4, 2020
@sffc sffc mentioned this issue Jun 4, 2020
@filmil
Copy link
Contributor

filmil commented Jun 4, 2020

It's about 2 hours to make unorm.h avaiable in rust. Would this help move things along? For example, it seems like it would enable work on Segmenter (#109) since you won't need to wait until rust native normalization is ready.

@sffc
Copy link
Member Author

sffc commented Jun 4, 2020

It's about 2 hours to make unorm.h avaiable in rust. Would this help move things along? For example, it seems like it would enable work on Segmenter (#109) since you won't need to wait until rust native normalization is ready.

We don't have anyone to work on either Normalizer or Segmenter, so I don't think this is a priority at the current time, although it might be nice to have ready to go.

@sffc sffc reopened this Sep 4, 2020
@sffc sffc added this to the Backlog milestone Dec 22, 2022
@sffc sffc removed the backlog label Dec 22, 2022
@hsivonen
Copy link
Member

ICU4X has had a normalizer since 1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-scope Area: Project scope, feature coverage C-unicode Component: Props, sets, tries help wanted Issue needs an assignee T-core Type: Required functionality
Projects
None yet
Development

No branches or pull requests

8 participants