Skip to content

Commit

Permalink
Merge pull request #136 from n8willis/arabic-early-mark-reorder
Browse files Browse the repository at this point in the history
This merges in the Arabic-model "early mark reorder" change to Arabic, Syriac, N'Ko, and Mongolian; moving the AMTRA transient-reordering of adjacent MCM marks to the first stage. This provides better alignment with HarfBuzz and we think it preserves behavior with other shapers as well. The attempt is made to keep AMTRA reordering distinct from Unicode NFD/NFC normalization (although not all implementers are expected to care about that distinction themselves), while still keeping `ccmp` and `locl` working as expected. See the discussion in #122 for more detail.
  • Loading branch information
n8willis authored Sep 7, 2021
2 parents e118787 + 5c9a7db commit a0fabe2
Show file tree
Hide file tree
Showing 6 changed files with 452 additions and 409 deletions.
157 changes: 79 additions & 78 deletions opentype-shaping-arabic-general.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@ Syriac, and Mongolian.
- [Mark classification](#mark-classification)
- [Character tables](#character-tables)
- [The `<arab>` shaping model](#the-arab-shaping-model)
- [1. Compound character composition and decomposition](#1-compound-character-composition-and-decomposition)
- [2. Computing letter joining states](#2-computing-letter-joining-states)
- [3. Applying the `stch` feature](#3-applying-the-stch-feature)
- [4. Applying the language-form substitution features from GSUB](#4-applying-the-language-form-substitution-features-from-gsub)
- [5. Applying the typographic-form substitution features from GSUB](#5-applying-the-typographic-form-substitution-features-from-gsub)
- [6. Mark reordering](#6-mark-reordering)
- [1. Transient reordering of modifier combining marks](#1-transient-reordering-of-modifier-combining-marks)
- [2. Compound character composition and decomposition](#2-compound-character-composition-and-decomposition)
- [3. Computing letter joining states](#3-computing-letter-joining-states)
- [4. Applying the `stch` feature](#4-applying-the-stch-feature)
- [5. Applying the language-form substitution features from GSUB](#5-applying-the-language-form-substitution-features-from-gsub)
- [6. Applying the typographic-form substitution features from GSUB](#6-applying-the-typographic-form-substitution-features-from-gsub)
- [7. Applying the positioning features from GPOS](#7-applying-the-positioning-features-from-gpos)


Expand Down Expand Up @@ -165,7 +165,8 @@ The sets are:
- Below-base (class 220) MCMs
- Above-base (class 230) MCMs

These classifications are used in the [mark-reordering stage](#6-mark-reordering).
These classifications are used in the [mark-transient-reordering
stage](#1-transient-reordering-of-modifier-combining-marks).

Lists of the marks that belong to each MCM classes are included in the
script-specific shaping documents for Arabic and Syriac.
Expand All @@ -184,36 +185,91 @@ characters, are available here:

## The general Arabic-based shaping model ##

Processing a run of `<arab>` text involves seven top-level stages:
Processing a run of text tagged with any of the scripts supported by
the general Arabic shaping model involves seven top-level stages:

1. Compound character composition and decomposition
2. Computing letter joining states
3. Applying the `stch` feature
4. Applying the language-form substitution features from GSUB
5. Applying the typographic-form substitution features from GSUB
6. Mark reordering
1. Transient reordering of modifier combining marks
2. Compound character composition and decomposition
3. Computing letter joining states
4. Applying the `stch` feature
5. Applying the language-form substitution features from GSUB
6. Applying the typographic-form substitution features from GSUB
7. Applying the positioning features from GPOS


### 1. Compound character composition and decomposition ###
### 1. Transient reordering of modifier combining marks ###

<!--- http://www.unicode.org/reports/tr53/tr53-1.pdf --->
> Note: the transient reordering of modifier combining marks is
> necessary only for scripts that can feature the "Shadda" mark or
> marks that belong to _Modifier Combining Marks_ (MCM) classes.
Sequences of adjacent marks must be reordered so that they appear in
the appropriate visual order before the mark-to-base and mark-to-mark
positioning features from GPOS can be correctly applied.

In particular, those marks that have strong affinity to the base
character must be placed closest to the base.

This mark-reordering operation is distinct from the standard,
cross-script mark-reordering performed during Unicode
normalization. The standard Unicode mark-reordering algorithm is based
on comparing the _Canonical_Combining_Class_ (Ccc) properties of mark
codepoints, whereas this script-specific reordering utilizes the
_Modifier_Combining_Mark_ (`MCM`) subclasses specified in the
character tables.

The algorithm for reordering a sequence of marks is:

- First, move any "Shadda" (combining class `33`) characters to the
beginning of the mark sequence.

- Second, move any subsequence of combining-class-`230` characters that begins
with a `230_MCM` character to the beginning of the sequence,
before all "Shadda" characters. The subsequence must be moved
as a group.

- Finally, move any subsequence of combining-class-`220` characters that begins
with a `220_MCM` character to the beginning of the sequence,
before all "Shadda" characters and before all class-`230`
characters. The subsequence must be moved as a group.

> Note: Unicode describes this mark-reordering operation, the Arabic
> Mark Transient Reordering Algorithm (AMTRA), in Technical Report 53,
> which describes it in terms that are distinct from standard,
> Ccc-based mark reordering.
>
> Specifically, AMTRA is designated as an operation performed during
> text rendering only, which therefore does not impact other
> Unicode-compliance issues such as allowable input sequences or text
> encoding.
>
> However, shaping engines may choose to perform the reordering of
> modifier combining marks in conjunction with their Unicode
> normalization functionality for increased efficiency.
### 2. Compound character composition and decomposition ###

The `ccmp` feature allows a font to substitute

- mark-and-base sequences with a pre-composed glyph including both
the mark and the base (as is done in with a ligature substitution)
- individual compound glyphs with the equivalent sequence of
decomposed glyphs
the mark and the base (as is done in with a ligature substitution)

- individual compound glyphs with the equivalent sequence of
decomposed glyphs (such as decomposing a letter with ijam into a
separate fundamental-letter glyph followed by an ijam-only glyph,
to permit more precise positioning)

If present, these composition and decomposition substitutions must be
performed before applying any other GSUB or GPOS lookups, because
those lookups may be written to match only the `ccmp`-substituted
glyphs.


### 2. Computing letter joining states ###
### 3. Computing letter joining states ###

In order to correctly apply the initial, medial, and final form
substitutions from GSUB during stage 5, the shaping engine must
substitutions from GSUB during stage 6, the shaping engine must
tag every letter for possible application of the appropriate feature.

> Note: not all of the rules detailed below apply to every script that
Expand Down Expand Up @@ -307,39 +363,10 @@ next word.
> implement the joining-state computation as a state machine, in a lookup
> table, or by any other means desirable.
<!--- HarfBuzz state table:
Tag for current character:
| Preceding | NON_JOINING | LEFT | RIGHT | DUAL or JOIN_CAUSING | syrcAL | syrcDR |
|:------------|:------------|:-----|:------|:---------------------|:-------|:-------|
| Current | | | | | | |
| NON_JOINING | _none_ | | | | | |
| LEFT | `isol` | | | | | |
| RIGHT | | | | | | |
| DUAL/CAUS | | | | | | |
| syrcAL | | | | | | |
| syrcDR | | | | | | |
Updated tag for preceding character:
| Preceding | NON_JOINING | LEFT | RIGHT | DUAL or JOIN_CAUSING | syrcAL | syrcDR |
|:------------|:------------|:-----|:------|:---------------------|:-------|:-------|
| Current | | | | | | |
| NON_JOINING | | | | | | |
| LEFT | | | | | | |
| RIGHT | | | | | | |
| DUAL/CAUS | | | | | | |
| syrcAL | | | | | | |
| syrcDR | | | | | | |
--->

At the end of this process, all letters should be tagged for possible
substitution by one of the `isol`, `init`, `medi`, or `fina` features.

### 3. Applying the `stch` feature ###
### 4. Applying the `stch` feature ###

The `stch` feature decomposes and stretches special marks that are
meant to extend to the full width of words to which they are
Expand Down Expand Up @@ -375,7 +402,7 @@ Finally, the decomposed mark must be reordered as follows:
the word.


### 4. Applying the language-form substitution features from GSUB ###
### 5. Applying the language-form substitution features from GSUB ###

The language-substitution phase applies mandatory substitution
features using the rules in the font's GSUB table. In preparation for
Expand Down Expand Up @@ -409,7 +436,7 @@ for script-specific information.
> GSUB substitutions in the following steps.

### 5. Applying the typographic-form substitution features from GSUB ###
### 6. Applying the typographic-form substitution features from GSUB ###

The typographic-substitution phase applies optional substitution
features using the rules in the font's GSUB table.
Expand All @@ -426,32 +453,6 @@ See the individual script pages for further detail on each feature and
for script-specific information.


### 6. Mark reordering ###

<!--- http://www.unicode.org/reports/tr53/tr53-1.pdf --->

Sequences of adjacent marks must be reordered so that they appear in
canonical order before the mark-to-base and mark-to-mark positioning
features from GPOS can be correctly applied.

In particular, those marks that have strong affinity to the base
character must be placed closest to the base.

The algorithm for reordering a sequence of marks is:

- First, move any "Shadda" (combining class `33`) characters to the
beginning of the mark sequence.

- Second, move any subsequence of combining-class-`230` characters that begins
with a `230_MCM` character to the beginning of the sequence,
before all "Shadda" characters. The subsequence must be moved
as a group.

- Finally, move any subsequence of combining-class-`220` characters that begins
with a `220_MCM` character to the beginning of the sequence,
before all "Shadda" characters and before all class-`230`
characters. The subsequence must be moved as a group.

### 7. Applying the positioning features from GPOS ###

The positioning stage adjusts the positions of mark and base
Expand Down
Loading

0 comments on commit a0fabe2

Please sign in to comment.