Merge pull request #136 from n8willis/arabic-early-mark-reorder

This merges in the Arabic-model "early mark reorder" change to Arabic, Syriac, N'Ko, and Mongolian; moving the AMTRA transient-reordering of adjacent MCM marks to the first stage. This provides better alignment with HarfBuzz and we think it preserves behavior with other shapers as well. The attempt is made to keep AMTRA reordering distinct from Unicode NFD/NFC normalization (although not all implementers are expected to care about that distinction themselves), while still keeping `ccmp` and `locl` working as expected. See the discussion in #122 for more detail.
n8willis · Sep 7, 2021 · a0fabe2 · a0fabe2
2 parents e118787 + 5c9a7db
commit a0fabe2
Show file tree

Hide file tree

Showing 6 changed files with 452 additions and 409 deletions.
diff --git a/opentype-shaping-arabic-general.md b/opentype-shaping-arabic-general.md
@@ -13,12 +13,12 @@ Syriac, and Mongolian.
 	  - [Mark classification](#mark-classification)
 	  - [Character tables](#character-tables)
   - [The `<arab>` shaping model](#the-arab-shaping-model)
-      - [1. Compound character composition and decomposition](#1-compound-character-composition-and-decomposition)
-      - [2. Computing letter joining states](#2-computing-letter-joining-states)
-      - [3. Applying the `stch` feature](#3-applying-the-stch-feature)
-      - [4. Applying the language-form substitution features from GSUB](#4-applying-the-language-form-substitution-features-from-gsub)
-      - [5. Applying the typographic-form substitution features from GSUB](#5-applying-the-typographic-form-substitution-features-from-gsub)
-      - [6. Mark reordering](#6-mark-reordering)
+      - [1. Transient reordering of modifier combining marks](#1-transient-reordering-of-modifier-combining-marks)
+      - [2. Compound character composition and decomposition](#2-compound-character-composition-and-decomposition)
+      - [3. Computing letter joining states](#3-computing-letter-joining-states)
+      - [4. Applying the `stch` feature](#4-applying-the-stch-feature)
+      - [5. Applying the language-form substitution features from GSUB](#5-applying-the-language-form-substitution-features-from-gsub)
+      - [6. Applying the typographic-form substitution features from GSUB](#6-applying-the-typographic-form-substitution-features-from-gsub)
       - [7. Applying the positioning features from GPOS](#7-applying-the-positioning-features-from-gpos)
 
 
@@ -165,7 +165,8 @@ The sets are:
   - Below-base (class 220) MCMs
   - Above-base (class 230) MCMs
 
-These classifications are used in the [mark-reordering stage](#6-mark-reordering).
+These classifications are used in the [mark-transient-reordering
+stage](#1-transient-reordering-of-modifier-combining-marks).
 
 Lists of the marks that belong to each MCM classes are included in the
 script-specific shaping documents for Arabic and Syriac.
@@ -184,36 +185,91 @@ characters, are available here:
 
 ## The general Arabic-based shaping model ##
 
-Processing a run of `<arab>` text involves seven top-level stages:
+Processing a run of text tagged with any of the scripts supported by
+the general Arabic shaping model involves seven top-level stages:
 
-1. Compound character composition and decomposition
-2. Computing letter joining states
-3. Applying the `stch` feature
-4. Applying the language-form substitution features from GSUB
-5. Applying the typographic-form substitution features from GSUB
-6. Mark reordering
+1. Transient reordering of modifier combining marks
+2. Compound character composition and decomposition
+3. Computing letter joining states
+4. Applying the `stch` feature
+5. Applying the language-form substitution features from GSUB
+6. Applying the typographic-form substitution features from GSUB
 7. Applying the positioning features from GPOS
 
 
-### 1. Compound character composition and decomposition ###
+### 1. Transient reordering of modifier combining marks ###
+
+<!--- http://www.unicode.org/reports/tr53/tr53-1.pdf --->
+> Note: the transient reordering of modifier combining marks is
+> necessary only for scripts that can feature the "Shadda" mark or
+> marks that belong to _Modifier Combining Marks_ (MCM) classes.
+
+Sequences of adjacent marks must be reordered so that they appear in
+the appropriate visual order before the mark-to-base and mark-to-mark
+positioning features from GPOS can be correctly applied.
+
+In particular, those marks that have strong affinity to the base
+character must be placed closest to the base.
+
+This mark-reordering operation is distinct from the standard,
+cross-script mark-reordering performed during Unicode
+normalization. The standard Unicode mark-reordering algorithm is based
+on comparing the _Canonical_Combining_Class_ (Ccc) properties of mark
+codepoints, whereas this script-specific reordering utilizes the
+_Modifier_Combining_Mark_ (`MCM`) subclasses specified in the
+character tables.
+
+The algorithm for reordering a sequence of marks is:
+
+  - First, move any "Shadda" (combining class `33`) characters to the
+    beginning of the mark sequence.
+
+  -	Second, move any subsequence of combining-class-`230` characters that begins
+       with a `230_MCM` character to the beginning of the sequence,
+       before all "Shadda" characters. The subsequence must be moved
+       as a group.
+
+  - Finally, move any subsequence of combining-class-`220` characters that begins
+       with a `220_MCM` character to the beginning of the sequence,
+       before all "Shadda" characters and before all class-`230`
+       characters. The subsequence must be moved as a group.
+
+> Note: Unicode describes this mark-reordering operation, the Arabic
+> Mark Transient Reordering Algorithm (AMTRA), in Technical Report 53,
+> which describes it in terms that are distinct from standard,
+> Ccc-based mark reordering.
+>
+> Specifically, AMTRA is designated as an operation performed during
+> text rendering only, which therefore does not impact other
+> Unicode-compliance issues such as allowable input sequences or text
+> encoding.
+>
+> However, shaping engines may choose to perform the reordering of
+> modifier combining marks in conjunction with their Unicode
+> normalization functionality for increased efficiency.
+
+### 2. Compound character composition and decomposition ###
 
 The `ccmp` feature allows a font to substitute
 
  - mark-and-base sequences with a pre-composed glyph including both
-   the mark and the base (as is done in with a ligature substitution)
- - individual compound glyphs with the equivalent sequence of
-   decomposed glyphs
+    the mark and the base (as is done in with a ligature substitution)
+
+  - individual compound glyphs with the equivalent sequence of
+    decomposed glyphs (such as decomposing a letter with ijam into a
+    separate fundamental-letter glyph followed by an ijam-only glyph,
+    to permit more precise positioning)
 
 If present, these composition and decomposition substitutions must be
 performed before applying any other GSUB or GPOS lookups, because
 those lookups may be written to match only the `ccmp`-substituted
 glyphs. 
 
 
-### 2. Computing letter joining states ###
+### 3. Computing letter joining states ###
 
 In order to correctly apply the initial, medial, and final form
-substitutions from GSUB during stage 5, the shaping engine must
+substitutions from GSUB during stage 6, the shaping engine must
 tag every letter for possible application of the appropriate feature.
 
 > Note: not all of the rules detailed below apply to every script that
@@ -307,39 +363,10 @@ next word.
 > implement the joining-state computation as a state machine, in a lookup
 > table, or by any other means desirable.
 
-<!--- HarfBuzz state table:
-
-Tag for current character:
-
-| Preceding   | NON_JOINING | LEFT | RIGHT | DUAL or JOIN_CAUSING | syrcAL | syrcDR |
-|:------------|:------------|:-----|:------|:---------------------|:-------|:-------|
-| Current     | | | | | | |
-| NON_JOINING | _none_      |      |       |                      |        |        |
-| LEFT        | `isol`      |      |       |                      |        |        |
-| RIGHT       |             |      |       |                      |        |        |
-| DUAL/CAUS   |             |      |       |                      |        |        |
-| syrcAL      |             |      |       |                      |        |        |
-| syrcDR      |             |      |       |                      |        |        |
-
-
-Updated tag for preceding character:
-
-| Preceding   | NON_JOINING | LEFT | RIGHT | DUAL or JOIN_CAUSING | syrcAL | syrcDR |
-|:------------|:------------|:-----|:------|:---------------------|:-------|:-------|
-| Current     | | | | | | |
-| NON_JOINING |             |      |       |                      |        |        |
-| LEFT        |             |      |       |                      |        |        |
-| RIGHT       |             |      |       |                      |        |        |
-| DUAL/CAUS   |             |      |       |                      |        |        |
-| syrcAL      |             |      |       |                      |        |        |
-| syrcDR      |             |      |       |                      |        |        |
-
---->
-
 At the end of this process, all letters should be tagged for possible
 substitution by one of the `isol`, `init`, `medi`, or `fina` features.
 
-### 3. Applying the `stch` feature ###
+### 4. Applying the `stch` feature ###
 
 The `stch` feature decomposes and stretches special marks that are
 meant to extend to the full width of words to which they are
@@ -375,7 +402,7 @@ Finally, the decomposed mark must be reordered as follows:
     the word.
 
 
-### 4. Applying the language-form substitution features from GSUB ###
+### 5. Applying the language-form substitution features from GSUB ###
 
 The language-substitution phase applies mandatory substitution
 features using the rules in the font's GSUB table. In preparation for
@@ -409,7 +436,7 @@ for script-specific information.
 > GSUB substitutions in the following steps.
 
 
-### 5. Applying the typographic-form substitution features from GSUB ###
+### 6. Applying the typographic-form substitution features from GSUB ###
 
 The typographic-substitution phase applies optional substitution
 features using the rules in the font's GSUB table.
@@ -426,32 +453,6 @@ See the individual script pages for further detail on each feature and
 for script-specific information.
 
 
-### 6. Mark reordering ###
-
-<!--- http://www.unicode.org/reports/tr53/tr53-1.pdf --->
-
-Sequences of adjacent marks must be reordered so that they appear in
-canonical order before the mark-to-base and mark-to-mark positioning
-features from GPOS can be correctly applied.
-
-In particular, those marks that have strong affinity to the base
-character must be placed closest to the base.
-
-The algorithm for reordering a sequence of marks is:
-
-  - First, move any "Shadda" (combining class `33`) characters to the
-    beginning of the mark sequence.
-
-  -	Second, move any subsequence of combining-class-`230` characters that begins
-       with a `230_MCM` character to the beginning of the sequence,
-       before all "Shadda" characters. The subsequence must be moved
-       as a group.
-
-  - Finally, move any subsequence of combining-class-`220` characters that begins
-       with a `220_MCM` character to the beginning of the sequence,
-       before all "Shadda" characters and before all class-`230`
-       characters. The subsequence must be moved as a group.
-
 ### 7. Applying the positioning features from GPOS ###
 
 The positioning stage adjusts the positions of mark and base