Skip to content

Conversation

@conor-93
Copy link
Contributor

@conor-93 conor-93 commented Oct 20, 2025

Migration of text analysis from Swash → ICU4X

Overview

ICU4X enables text analysis and internationalisation. For Parley, this includes locale and language recognition,
bidirectional text evaluation, text segmentation, emoji recognition, NFC/NFD normalisation and other Unicode character information.

ICU4X is developed and maintained by a trusted authority in the space of text internationalisation: the ICU4X Technical Committee (ICU4X-TC) in the Unicode Consortium. It is targeted at resource-constrained environments. For Parley, this means:

  • The potential for full locale support for complex line breaking cases (not supported by Swash).
  • Reliable and up-to-date Unicode data.
  • Reasonable performance and memory footprint (with the possibility of future improvements).
  • Full decoupling from Swash (following decoupling for shaping behaviour earlier this year); a significant offloading of maintenance effort.

Notable changes

  • Removal of first-party bidi embed level resolution logic.
  • select_font emoji detection improvements (Flag emoji "🇺🇸", Keycap sequences (e.g. 0️⃣ through 9️⃣) now supported in cluster detection, Swash did not support these).
  • Slightly more up-to-date Unicode data than Swash (e.g. a few more Scripts).

Performance/binary size

  • Binary size for vello_editor is ~100kB larger (9720kB vs 9620kB).
    There is a performance regression (~7%), but with optimisations (composite trie data sources, further minimisation of allocations/iterations), the regression is much less significant than it was originally (~55%):
Default Style - arabic 20 characters               [   9.9 us ...  11.0 us ]     +10.91%*
Default Style - latin 20 characters                [   4.3 us ...   4.8 us ]     +10.88%*
Default Style - japanese 20 characters             [   8.5 us ...   9.2 us ]      +8.04%*
Default Style - arabic 1 paragraph                 [  59.1 us ...  63.7 us ]      +7.70%*
Default Style - latin 1 paragraph                  [  18.8 us ...  20.8 us ]     +10.44%*
Default Style - japanese 1 paragraph               [  74.4 us ...  78.8 us ]      +5.80%*
Default Style - arabic 4 paragraph                 [ 253.6 us ... 269.6 us ]      +6.30%*
Default Style - latin 4 paragraph                  [  79.7 us ...  86.8 us ]      +8.97%*
Default Style - japanese 4 paragraph               [ 102.8 us ... 107.9 us ]      +4.98%*
Styled - arabic 20 characters                      [  11.1 us ...  12.2 us ]      +9.79%*
Styled - latin 20 characters                       [   5.6 us ...   6.2 us ]     +10.34%*
Styled - japanese 20 characters                    [   9.4 us ...  10.1 us ]      +7.41%*
Styled - arabic 1 paragraph                        [  60.1 us ...  65.0 us ]      +8.04%*
Styled - latin 1 paragraph                         [  21.8 us ...  23.9 us ]      +9.66%*
Styled - japanese 1 paragraph                      [  84.2 us ...  87.4 us ]      +3.79%*
Styled - arabic 4 paragraph                        [ 270.5 us ... 288.5 us ]      +6.66%*
Styled - latin 4 paragraph                         [  85.4 us ...  94.1 us ]     +10.17%*
Styled - japanese 4 paragraph                      [ 117.2 us ... 123.5 us ]      +5.39%*

As noted in #436 (comment), I think that we're getting close to maximising the efficiency of the current APIs offered by ICU. This can be seen by inspecting the text analysis profile:

image

Further optimisation of text analysis may require delving into ICU/unicode-bidi internals to, for example:

  1. combine line and word boundary calculations (rather for them to run separately to each other). But, Chad may have ideas on further improvement.
  2. pass in character boundary information from our composite properties Trie. ICU is internally performing multiple lookups for identical characters
  3. pass in bidi class information to unicode-bidi to prevent redundant lookups

Other details

  • Swash's Language parsing is more tolerant, e.g. it permits extra, invalid subtags (like in "en-Latn-US-a-b-c-d").
  • Segmenters (line, word, grapheme) are currently content-aware, and can be used without specifying a locale. However, if we plug locale data in at runtime, we can construct segmenters to target a specific locale, rather than inferring from content (which would be the most correct approach for targeting said locale).
    • The full set of locale data (even with ICU4X's deduplication) is heavy, totalling ~2.5MB (in vello_editor compilation testing). In order to potentially support correct word breaking across all languages, without seeing a huge compilation size increase, we would need a way for users to attach only the locale data they need at runtime. This locale data could be generated (with icu4x-datagen) and attached (using DataProviders) at runtime in the future.
    • Without full locale support, line and word breaking use Unicode rule-based approaches UAX #14 and #29 respectively (at parity with Swash).
  • Swash's support for alternating word break strength is maintained, by breaking text into windows (which look back/forward an extra character for context) and performaing segmentation on each window separarely, as ICU4X doesn't natively support variable word break strength when segmenting.

Future Work

  • We could also support bring-your-own-data for Unicode character information too, for users only interested in narrow character sets (e.g. basic Latin), for a small compilation size improvement (not sure how much exactly).
  • Feature flagging which locale data to bake into the binary
  • Allow hot swapping unicode character data at runtime. For example, if you start off shaping en but then need to shape some ar, we could inform the consumer that they need to provide ar property data.

- condense all byte indexes to char indexes in a single loop
- track a minimal set of LineSegmenters (per LineBreakWordOption), and create as needed
- clean up tests
- add tests for multi-character graphemes
- fix incorrect start truncation for multi-style strings which arent multi-wb style + test for this
- test naming/grouping
- compute `force_normalize`
- simplify ClusterInfo to just `is_emoji`
- more clean-up
@conor-93 conor-93 requested a review from taj-p December 8, 2025 05:32
};

needs_bidi_resolution |= BidiResolver::needs_bidi_resolution(bidi_class);
let bracket = lcx.analysis_data_sources.brackets().get(ch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a TODO here to consider making CompositeProps a u64 and baking BidiMirroringGlyph into it to avoid this lookup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. Done.

Copy link
Contributor

@taj-p taj-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! YAY! 🎉 !!! Let's goooooooooooooo! So happy to have ICU4X support in Parley 🥳 🏆 🏆 🏆

@conor-93 conor-93 requested a review from nicoburns December 9, 2025 01:12
@nicoburns nicoburns dismissed their stale review December 9, 2025 01:15

Outdated


impl Whitespace {
/// Returns true for space or no break space.
pub(crate) fn is_space_or_nbsp(self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub(crate) fn is_space_or_nbsp(self) -> bool {
#[inline(always)]
pub(crate) fn is_space_or_nbsp(self) -> bool {

@taj-p taj-p enabled auto-merge December 9, 2025 01:48
@taj-p taj-p added this pull request to the merge queue Dec 9, 2025
Merged via the queue into linebender:main with commit 1013c37 Dec 9, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants