-
Notifications
You must be signed in to change notification settings - Fork 214
Import unicode-normalization or re-write from scratch? #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1
…On Fri, Apr 17, 2020, 17:26 Shane F. Carr ***@***.***> wrote:
@Manishearth <https://github.com/Manishearth> @hsivonen
<https://github.com/hsivonen> @zbraniecki <https://github.com/zbraniecki>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#40 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMC4WFUGTBCIMGIOHLLRNDXTHANCNFSM4MLEESZQ>
.
|
I'm okay with cleanroom. We can perhaps use the same API as unicode-normalization. We could also import it and gradually optimize. I don't have opinions on this. |
How do ICU4C and
|
@hsivonen @zbraniecki @Manishearth Have you had an opportunity to do performance testing of ICU4C Normalizer versus the existing Rust Normalizer? |
Not I, maybe Zibi? |
I have not. I can perform if someone more experience with normalization gives me the test samples and calls to test against. |
I think we have some test data. We also test against the Unicode test
files, but those are not representative text for performance purposes. Best
is a mixture of text from different languages, with the proportion =
frequency of text in that language on the web.
Mark
…On Wed, May 13, 2020 at 4:55 PM Zibi Braniecki ***@***.***> wrote:
I have not. I can perform if someone more experience with normalization
gives me the test samples and calls to test against.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#40 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMCK6ZQ4PPXYNBZII2DRRMXNVANCNFSM4MLEESZQ>
.
|
IIRC, we assigned this to Zibi and not me on last week's call. |
See Henri's comment in #66 (comment) |
#93 gives us not so clear messaging on what to do in this case (except that current implementation lags behind ICU4C metrics). If we need to implement data provider/loading in addition to optimization steps, it may be easier to do full rewrite. If anybody needs normalization now they can either use existing crate, or even go with rust_icu that wraps ICU4C. |
Re-writing normalizer in ICU4X still has the benefits discussed in the OP. The concern from @hsivonen was that if ICU's normalizer is super slow, then maybe we shouldn't go that route. However, now that we've put that concern to rest, I think we can revisit the advantages of an ICU-based normalizer implementation. |
Putting in backlog. Comments from team:
|
It's about 2 hours to make |
We don't have anyone to work on either Normalizer or Segmenter, so I don't think this is a priority at the current time, although it might be nice to have ready to go. |
ICU4X has had a normalizer since 1.0. |
@markusicu has done a great deal of work on ICU4C's normalizer. It depends on low-level and highly optimized data structures such as UCPTrie.
Writing normalization code from a clean room would allow us to:
The text was updated successfully, but these errors were encountered: