-
Notifications
You must be signed in to change notification settings - Fork 55
[ICU4X] Migrate text analysis and shaping to use ICU4X #436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- condense all byte indexes to char indexes in a single loop - track a minimal set of LineSegmenters (per LineBreakWordOption), and create as needed
- clean up tests - add tests for multi-character graphemes
- group all word boundary logic together
- fix incorrect start truncation for multi-style strings which arent multi-wb style + test for this - test naming/grouping
- compute `force_normalize`
- simplify ClusterInfo to just `is_emoji` - more clean-up
… as fontique::Language throughout parley_*
| }; | ||
|
|
||
| needs_bidi_resolution |= BidiResolver::needs_bidi_resolution(bidi_class); | ||
| let bracket = lcx.analysis_data_sources.brackets().get(ch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a TODO here to consider making CompositeProps a u64 and baking BidiMirroringGlyph into it to avoid this lookup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point. Done.
taj-p
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! YAY! 🎉 !!! Let's goooooooooooooo! So happy to have ICU4X support in Parley 🥳 🏆 🏆 🏆
|
|
||
| impl Whitespace { | ||
| /// Returns true for space or no break space. | ||
| pub(crate) fn is_space_or_nbsp(self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pub(crate) fn is_space_or_nbsp(self) -> bool { | |
| #[inline(always)] | |
| pub(crate) fn is_space_or_nbsp(self) -> bool { |
Migration of text analysis from Swash → ICU4X
Overview
ICU4X enables text analysis and internationalisation. For Parley, this includes locale and language recognition,
bidirectional text evaluation, text segmentation, emoji recognition, NFC/NFD normalisation and other Unicode character information.
ICU4X is developed and maintained by a trusted authority in the space of text internationalisation: the ICU4X Technical Committee (ICU4X-TC) in the Unicode Consortium. It is targeted at resource-constrained environments. For Parley, this means:
Notable changes
select_fontemoji detection improvements (Flag emoji "🇺🇸", Keycap sequences (e.g. 0️⃣ through 9️⃣) now supported in cluster detection, Swash did not support these).Scripts).Performance/binary size
vello_editoris ~100kB larger (9720kB vs 9620kB).There is a performance regression (~7%), but with optimisations (composite trie data sources, further minimisation of allocations/iterations), the regression is much less significant than it was originally (~55%):
As noted in #436 (comment), I think that we're getting close to maximising the efficiency of the current APIs offered by ICU. This can be seen by inspecting the text analysis profile:
Further optimisation of text analysis may require delving into ICU/unicode-bidi internals to, for example:
Other details
Languageparsing is more tolerant, e.g. it permits extra, invalid subtags (like in"en-Latn-US-a-b-c-d").vello_editorcompilation testing). In order to potentially support correct word breaking across all languages, without seeing a huge compilation size increase, we would need a way for users to attach only the locale data they need at runtime. This locale data could be generated (withicu4x-datagen) and attached (usingDataProviders) at runtime in the future.Future Work
enbut then need to shape somear, we could inform the consumer that they need to providearproperty data.