[ICU4X] Migrate text analysis and shaping to use ICU4X #436

conor-93 · 2025-10-20T03:03:25Z

Migration of text analysis from Swash → ICU4X

Overview

ICU4X enables text analysis and internationalisation. For Parley, this includes locale and language recognition,
bidirectional text evaluation, text segmentation, emoji recognition, NFC/NFD normalisation and other Unicode character information.

ICU4X is developed and maintained by a trusted authority in the space of text internationalisation: the ICU4X Technical Committee (ICU4X-TC) in the Unicode Consortium. It is targeted at resource-constrained environments. For Parley, this means:

The potential for full locale support for complex line breaking cases (not supported by Swash).
Reliable and up-to-date Unicode data.
Reasonable performance and memory footprint (with the possibility of future improvements).
Full decoupling from Swash (following decoupling for shaping behaviour earlier this year); a significant offloading of maintenance effort.

Notable changes

Removal of first-party bidi embed level resolution logic.
select_font emoji detection improvements (Flag emoji "🇺🇸", Keycap sequences (e.g. 0️⃣ through 9️⃣) now supported in cluster detection, Swash did not support these).
Slightly more up-to-date Unicode data than Swash (e.g. a few more Scripts).

Performance/binary size

Binary size for vello_editor is ~100kB larger (9720kB vs 9620kB).
There is a performance regression (~7%), but with optimisations (composite trie data sources, further minimisation of allocations/iterations), the regression is much less significant than it was originally (~55%):

Default Style - arabic 20 characters               [   9.9 us ...  11.0 us ]     +10.91%*
Default Style - latin 20 characters                [   4.3 us ...   4.8 us ]     +10.88%*
Default Style - japanese 20 characters             [   8.5 us ...   9.2 us ]      +8.04%*
Default Style - arabic 1 paragraph                 [  59.1 us ...  63.7 us ]      +7.70%*
Default Style - latin 1 paragraph                  [  18.8 us ...  20.8 us ]     +10.44%*
Default Style - japanese 1 paragraph               [  74.4 us ...  78.8 us ]      +5.80%*
Default Style - arabic 4 paragraph                 [ 253.6 us ... 269.6 us ]      +6.30%*
Default Style - latin 4 paragraph                  [  79.7 us ...  86.8 us ]      +8.97%*
Default Style - japanese 4 paragraph               [ 102.8 us ... 107.9 us ]      +4.98%*
Styled - arabic 20 characters                      [  11.1 us ...  12.2 us ]      +9.79%*
Styled - latin 20 characters                       [   5.6 us ...   6.2 us ]     +10.34%*
Styled - japanese 20 characters                    [   9.4 us ...  10.1 us ]      +7.41%*
Styled - arabic 1 paragraph                        [  60.1 us ...  65.0 us ]      +8.04%*
Styled - latin 1 paragraph                         [  21.8 us ...  23.9 us ]      +9.66%*
Styled - japanese 1 paragraph                      [  84.2 us ...  87.4 us ]      +3.79%*
Styled - arabic 4 paragraph                        [ 270.5 us ... 288.5 us ]      +6.66%*
Styled - latin 4 paragraph                         [  85.4 us ...  94.1 us ]     +10.17%*
Styled - japanese 4 paragraph                      [ 117.2 us ... 123.5 us ]      +5.39%*

As noted in #436 (comment), I think that we're getting close to maximising the efficiency of the current APIs offered by ICU. This can be seen by inspecting the text analysis profile:

Further optimisation of text analysis may require delving into ICU/unicode-bidi internals to, for example:

combine line and word boundary calculations (rather for them to run separately to each other). But, Chad may have ideas on further improvement.
pass in character boundary information from our composite properties Trie. ICU is internally performing multiple lookups for identical characters
pass in bidi class information to unicode-bidi to prevent redundant lookups

Other details

Swash's Language parsing is more tolerant, e.g. it permits extra, invalid subtags (like in "en-Latn-US-a-b-c-d").
Segmenters (line, word, grapheme) are currently content-aware, and can be used without specifying a locale. However, if we plug locale data in at runtime, we can construct segmenters to target a specific locale, rather than inferring from content (which would be the most correct approach for targeting said locale).
- The full set of locale data (even with ICU4X's deduplication) is heavy, totalling ~2.5MB (in vello_editor compilation testing). In order to potentially support correct word breaking across all languages, without seeing a huge compilation size increase, we would need a way for users to attach only the locale data they need at runtime. This locale data could be generated (with icu4x-datagen) and attached (using DataProviders) at runtime in the future.
- Without full locale support, line and word breaking use Unicode rule-based approaches UAX #14 and #29 respectively (at parity with Swash).
Swash's support for alternating word break strength is maintained, by breaking text into windows (which look back/forward an extra character for context) and performaing segmentation on each window separarely, as ICU4X doesn't natively support variable word break strength when segmenting.

Future Work

We could also support bring-your-own-data for Unicode character information too, for users only interested in narrow character sets (e.g. basic Latin), for a small compilation size improvement (not sure how much exactly).
Feature flagging which locale data to bake into the binary
Allow hot swapping unicode character data at runtime. For example, if you start off shaping en but then need to shape some ar, we could inform the consumer that they need to provide ar property data.

… tests

- condense all byte indexes to char indexes in a single loop - track a minimal set of LineSegmenters (per LineBreakWordOption), and create as needed

- clean up tests - add tests for multi-character graphemes

- group all word boundary logic together

…start` - doc

- fix incorrect start truncation for multi-style strings which arent multi-wb style + test for this - test naming/grouping

…of allocating

…mentation

- compute `force_normalize`

- simplify ClusterInfo to just `is_emoji` - more clean-up

… as fontique::Language throughout parley_*

parley/src/resolve/mod.rs

parley/src/tests/test_analysis.rs

parley/src/bidi.rs

parley/src/setting.rs

parley/src/analysis/cluster.rs

parley/src/analysis/mod.rs

taj-p · 2025-12-08T20:35:13Z

parley/src/analysis/mod.rs

            };

+            needs_bidi_resolution |= BidiResolver::needs_bidi_resolution(bidi_class);
+            let bracket = lcx.analysis_data_sources.brackets().get(ch);


Could you please add a TODO here to consider making CompositeProps a u64 and baking BidiMirroringGlyph into it to avoid this lookup?

Ah good point. Done.

taj-p

LGTM! YAY! 🎉 !!! Let's goooooooooooooo! So happy to have ICU4X support in Parley 🥳 🏆 🏆 🏆

Outdated

taj-p · 2025-12-09T01:41:48Z

parley/src/analysis/cluster.rs

+
+impl Whitespace {
+    /// Returns true for space or no break space.
+    pub(crate) fn is_space_or_nbsp(self) -> bool {


Suggested change

pub(crate) fn is_space_or_nbsp(self) -> bool {

#[inline(always)]

pub(crate) fn is_space_or_nbsp(self) -> bool {

parley/src/analysis/cluster.rs

parley/src/analysis/mod.rs

conor-93 added 30 commits September 15, 2025 08:11

side-by-side analysis with Swash (bidi levels, boundaries, scripts) +…

586f39f

… tests

- resolve Mandatory boundaries

f03116d

- condense all byte indexes to char indexes in a single loop - track a minimal set of LineSegmenters (per LineBreakWordOption), and create as needed

- boundary analysis clean-up / condense logic / todos for optimisations

4054508

- clean up tests - add tests for multi-character graphemes

- avoid consuming and re-creating iterator over word boundary data

0ec33d4

- group all word boundary logic together

- remove previous_substring_end, made redundant by `building_range_…

4ab3adc

…start` - doc

avoid consuming iterator when getting first/last char lens

ca6b03a

avoid unnecessarily consuming iterators for script/line break data

8de6f53

.

b2a0fc2

.

9895c38

- dont reallocate string for fast path

326fd09

- fix incorrect start truncation for multi-style strings which arent multi-wb style + test for this - test naming/grouping

.

2d9a397

avoid allocating vecs for boundaries/bidi levels

8ce1898

.

663f79f

.

0971ca3

establish an iterator for contiguous_word_break_substrings instead …

d3303ae

…of allocating

just store index, not char too

fb20378

.

26a8591

.

2581a69

.

ac31353

address TODOs

f3a53a4

add Swash-equivalent Cluster types to Parley, WIP select_font reimple…

4406abd

…mentation

icu-backed select_font equivalent impl

00609fb

select_font working, minus force_normalize

36ed996

force_normalize pseudocode/groundwork

0621f5d

- frontload/simplify analysis info access

fb24f87

- compute `force_normalize`

fix crash on empty style ranges

1ea1c3a

use icu for everything except script

09bac72

use icu for script/locale/language

c77d830

- simplify and fix bidi level retrieval + add tests

77c2a36

- simplify ClusterInfo to just `is_emoji` - more clean-up

optimise is_emoji_grapheme

1269dcc

conor-93 added 4 commits December 8, 2025 12:55

suggestion: move needs_bidi_resolution into BidiResolver

d3af36b

suggestion: parse icu's Locale and get LanguageIdentifier from that

fbcf1c1

suggestion: use HarfRust's [u8;4] Tag instead of u32

6b2206f

suggestion: consistently refer to icu_locale_core::LanguageIdentifier…

302cdca

… as fontique::Language throughout parley_*

conor-93 requested a review from taj-p December 8, 2025 05:32

clippy

b56e84c

robertbastian reviewed Dec 8, 2025

View reviewed changes

parley/src/resolve/mod.rs Outdated Show resolved Hide resolved

parley/src/tests/test_analysis.rs Outdated Show resolved Hide resolved

parley/src/bidi.rs Outdated Show resolved Hide resolved

robertbastian reviewed Dec 8, 2025

View reviewed changes

parley/src/setting.rs Outdated Show resolved Hide resolved

taj-p reviewed Dec 8, 2025

View reviewed changes

parley/src/analysis/cluster.rs Outdated Show resolved Hide resolved

taj-p reviewed Dec 8, 2025

View reviewed changes

parley/src/analysis/mod.rs Show resolved Hide resolved

taj-p reviewed Dec 8, 2025

View reviewed changes

parley/src/analysis/mod.rs Show resolved Hide resolved

taj-p reviewed Dec 8, 2025

View reviewed changes

suggestions

da2db55

taj-p approved these changes Dec 9, 2025

View reviewed changes

conor-93 requested a review from nicoburns December 9, 2025 01:12