Import unicode-normalization or re-write from scratch? #40

sffc · 2020-04-18T00:25:59Z

@markusicu has done a great deal of work on ICU4C's normalizer. It depends on low-level and highly optimized data structures such as UCPTrie.

Writing normalization code from a clean room would allow us to:

Use the same core algorithms as ICU4C, allowing better interop of code, data, and clients
Build in proper string handling (String encodings (UTF-8/UTF-16) #14)
Integrate it with ICU4X's locale data pipeline (including UCD data)

sffc · 2020-04-18T00:26:14Z

@Manishearth @hsivonen @zbraniecki

macchiati · 2020-04-18T04:46:19Z

+1

…

On Fri, Apr 17, 2020, 17:26 Shane F. Carr ***@***.***> wrote: @Manishearth <https://github.com/Manishearth> @hsivonen <https://github.com/hsivonen> @zbraniecki <https://github.com/zbraniecki> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMC4WFUGTBCIMGIOHLLRNDXTHANCNFSM4MLEESZQ> .

Manishearth · 2020-04-20T18:32:53Z

I'm okay with cleanroom. We can perhaps use the same API as unicode-normalization.

We could also import it and gradually optimize. I don't have opinions on this.

hsivonen · 2020-04-21T07:22:00Z

Use the same core algorithms as ICU4C, allowing better interop of code, data, and clients

How do ICU4C and unicode-normalization compare in performance?

We can perhaps use the same API as unicode-normalization.

unicode-normalization uses an iterator over char-based API, which isn't FFI-friendly. I expect we'll end up with that API for Rust callers only and will also a slice-based API for FFI concerns (also for Rust callers that have a slice).

sffc · 2020-05-13T23:29:58Z

@hsivonen @zbraniecki @Manishearth Have you had an opportunity to do performance testing of ICU4C Normalizer versus the existing Rust Normalizer?

Manishearth · 2020-05-13T23:32:59Z

Not I, maybe Zibi?

zbraniecki · 2020-05-13T23:54:54Z

I have not. I can perform if someone more experience with normalization gives me the test samples and calls to test against.

macchiati · 2020-05-14T00:25:09Z

I think we have some test data. We also test against the Unicode test files, but those are not representative text for performance purposes. Best is a mixture of text from different languages, with the proportion = frequency of text in that language on the web. Mark

…

On Wed, May 13, 2020 at 4:55 PM Zibi Braniecki ***@***.***> wrote: I have not. I can perform if someone more experience with normalization gives me the test samples and calls to test against. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMCK6ZQ4PPXYNBZII2DRRMXNVANCNFSM4MLEESZQ> .

hsivonen · 2020-05-14T06:40:56Z

IIRC, we assigned this to Zibi and not me on last week's call.

sffc · 2020-05-14T18:22:53Z

See Henri's comment in #66 (comment)

nciric · 2020-05-28T23:12:47Z

#93 gives us not so clear messaging on what to do in this case (except that current implementation lags behind ICU4C metrics).

If we need to implement data provider/loading in addition to optimization steps, it may be easier to do full rewrite. If anybody needs normalization now they can either use existing crate, or even go with rust_icu that wraps ICU4C.

sffc · 2020-06-02T03:50:48Z

Re-writing normalizer in ICU4X still has the benefits discussed in the OP. The concern from @hsivonen was that if ICU's normalizer is super slow, then maybe we shouldn't go that route. However, now that we've put that concern to rest, I think we can revisit the advantages of an ICU-based normalizer implementation.

sffc · 2020-06-04T18:06:27Z

Putting in backlog. Comments from team:

Manish: The hard part is data loading. It's probably the same amount of work to re-write from scratch than retrofit the existing crate.
Steven: Having an existing implementation to compare against is good.
Cira: Having prior art is useful for us to implement.

filmil · 2020-06-04T18:32:18Z

It's about 2 hours to make unorm.h avaiable in rust. Would this help move things along? For example, it seems like it would enable work on Segmenter (#109) since you won't need to wait until rust native normalization is ready.

sffc · 2020-06-04T18:59:27Z

It's about 2 hours to make unorm.h avaiable in rust. Would this help move things along? For example, it seems like it would enable work on Segmenter (#109) since you won't need to wait until rust native normalization is ready.

We don't have anyone to work on either Normalizer or Segmenter, so I don't think this is a priority at the current time, although it might be nice to have ready to go.

hsivonen · 2023-02-14T15:13:11Z

ICU4X has had a normalizer since 1.0.

sffc added the question Unresolved questions; type unclear label Apr 18, 2020

sffc assigned markusicu Apr 18, 2020

sffc mentioned this issue Apr 18, 2020

Adding Action column to ecosystem.md #41

Merged

echeran mentioned this issue Apr 30, 2020

Performance testing against ICU4C #66

Open

sffc added C-process Component: Team processes C-meta Component: Relating to ICU4X as a whole A-scope Area: Project scope, feature coverage and removed C-process Component: Team processes labels May 7, 2020

sffc assigned hsivonen and unassigned markusicu May 13, 2020

hsivonen assigned zbraniecki and unassigned hsivonen May 14, 2020

zbraniecki mentioned this issue May 15, 2020

Perform normalization performance evaluation between Rust and ICU #93

Closed

sffc added T-core Type: Required functionality C-unicode Component: Props, sets, tries and removed C-meta Component: Relating to ICU4X as a whole question Unresolved questions; type unclear labels May 15, 2020

sffc added the discuss Discuss at a future ICU4X-SC meeting label Jun 2, 2020

sffc unassigned zbraniecki Jun 4, 2020

sffc added backlog help wanted Issue needs an assignee and removed discuss Discuss at a future ICU4X-SC meeting labels Jun 4, 2020

sffc closed this as completed Jun 4, 2020

sffc mentioned this issue Jun 4, 2020

Segmenter #109

Closed

sffc reopened this Sep 4, 2020

sffc added this to the Backlog milestone Dec 22, 2022

sffc removed the backlog label Dec 22, 2022

hsivonen closed this as completed Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Import unicode-normalization or re-write from scratch? #40

Import unicode-normalization or re-write from scratch? #40

sffc commented Apr 18, 2020

sffc commented Apr 18, 2020

Uh oh!

macchiati commented Apr 18, 2020 via email

Uh oh!

Manishearth commented Apr 20, 2020

Uh oh!

hsivonen commented Apr 21, 2020

Uh oh!

sffc commented May 13, 2020

Uh oh!

Manishearth commented May 13, 2020

Uh oh!

zbraniecki commented May 13, 2020

Uh oh!

macchiati commented May 14, 2020 via email

Uh oh!

hsivonen commented May 14, 2020

Uh oh!

sffc commented May 14, 2020

Uh oh!

nciric commented May 28, 2020

Uh oh!

sffc commented Jun 2, 2020

Uh oh!

sffc commented Jun 4, 2020

Uh oh!

filmil commented Jun 4, 2020

Uh oh!

sffc commented Jun 4, 2020

Uh oh!

hsivonen commented Feb 14, 2023

Uh oh!