Skip to content

Commit

Permalink
Merge pull request #134 from n8willis/arabic-special-function
Browse files Browse the repository at this point in the history
Add special-purpose codepoint explanations to Arabic-General scripts.
  • Loading branch information
n8willis authored Aug 22, 2021
2 parents 6bbd4a7 + d6add41 commit e118787
Show file tree
Hide file tree
Showing 4 changed files with 245 additions and 31 deletions.
75 changes: 64 additions & 11 deletions opentype-shaping-arabic.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ Codepoints with no assigned meaning are
designated as _unassigned_ in the _Unicode category_ column.


<!--- Character table example and explanation --->
#### Special-function codepoints ####

Other important characters that may be encountered when shaping runs
of Arabic text include the dotted-circle placeholder (`U+25CC`), the
Expand All @@ -250,17 +250,49 @@ and zero-width non-joiner (`U+200C`), the left-to-right text marker
(`U+200E`) and right-to-left text marker (`U+200F`), and the no-break
space (`U+00A0`).

Each of these is of particular importance to shaping engines, because
these codepoints interact with the shaping engine, the text run, and
the active font, either to mediate non-default shaping behavior or to
relay information about the current shaping process.

The dotted-circle placeholder is frequently used when displaying a
vowel or diacritical mark in isolation. Real-world text documents may
also use other characters, such as hyphens or dashes, in a similar
placeholder fashion; shaping engines should cope with this situation
gracefully.
combining mark in isolation. Real-world text documents may also use
other characters, such as hyphens or dashes, in a similar placeholder
fashion; shaping engines should cope with this situation gracefully.

Dotted-circle placeholder characters (like any Unicode codepoint) can
appear anywhere in text input sequences and should be rendered
normally. GPOS positioning lookups should attach mark glyphs to dotted
circles as they would to other non-mark characters. As visible glyphs,
dotted circles can also be involved in GSUB substitutions.

In addition to the default input-text handling process, shaping
engines may also insert dotted-circle placeholders into the text
sequence. Dotted-circle insertions are required when a non-spacing
mark or dependent sign is formed with no base character present.

This requirement covers:

- Dependent signs that are assigned their own individual Unicode
codepoints (such as most dependent-vowel marks or matras)

- Dependent signs that are formed only by specific sequences of
other codepoints (which is not common in Arabic but can occur in
other scripts)



The combining grapheme joiner (CGJ) is primarily used to alter the
order in which adjacent marks are positioned during the
mark-reordering stage, in order to adhere to the needs of a
non-default language orthography.
<!--- combining grapheme joiner explanation --->

By default, OpenType shaping reorders sequences of adjacent marks by
sorting the sequence on the marks' Canonical_Combining_Class (Ccc)
values. The presence of a CGJ character within a sequence of marks has
the effect of splitting the sequence into two sequences of marks and,
therefore, halting any mark-reordering that would have occurred
between the marks on either side of the CGJ.

The zero-width joiner (ZWJ) is primarily used to force the usage of the
cursive connecting form of a letter even when the context of the
Expand All @@ -271,17 +303,38 @@ as for displaying it in a table of forms), the sequence "_Letter_,ZWJ"
would be used. To show the medial form of a letter in isolation, the
sequence "ZWJ,_Letter_,ZWJ" would be used.

The zero-width non-joiner (ZWNJ) is primarily used to prevent a
cursive connection between two adjacent characters that would, under
normal circumstances, form a join.

<!--- Zero-Width Non Joiner explanation --->
The ZWJ and ZWNJ characters are, by definition, non-printing control
characters and have the _Default_Ignorable_ property in the Unicode
Character Database. In standard text-display scenarios, their function
is to signal a request from the user to the shaping engine for some
particular non-default behavior. As such, they are not rendered
visually.

The right-to-left mark (RLM) and left-to-right mark (LRM) are used by
the Unicode bidirectionality algorithm (BiDi) to indicate the points
in a text run at which the writing direction changes.
> Note: Naturally, there are special circumstances where a user or
> document might need to request that a ZWJ or ZWNJ be rendered
> visually, such as when illustrating the OpenType shaping process, or
> displaying Unicode tables.
Because the ZWJ and ZWNJ are non-printing control characters, they can
be ignored by any portion of a software text-handling stack not
involved in the shaping operations that the ZWJ and ZWNJ are designed
to interface with. For example, spell-checking or collation functions
will typically ignore ZWJ and ZWNJ.

<!--- How shaping is affected by the LTR and RTL markers explanation --->
Similarly, the ZWJ and ZWNJ should be ignored by the shaping engine
when matching sequences of codepoints against the backtrack and
lookahead sequences of a font's GSUB or GPOS lookups.


The right-to-left mark (RLM) and left-to-right mark (LRM) are used by
the Unicode bidirectionality algorithm (BiDi) to indicate the points
in a text run at which the writing direction changes. Generally
speaking RLM and LRM codepoints do not interact with shaping.

The no-break space is primarily used to display those codepoints that
are defined as non-spacing (such as vowel or diacritical marks and "Hamza") in an
isolated context, as an alternative to displaying them superimposed on
Expand Down
58 changes: 55 additions & 3 deletions opentype-shaping-mongolian.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,17 +234,46 @@ Codepoints with no assigned meaning are
designated as _unassigned_ in the _Unicode category_ column.


#### Special-function codepoints ####

Other important characters that may be encountered when shaping runs
of Mongolian text include the dotted-circle placeholder (`U+25CC`),
the zero-width joiner (`U+200D`) and zero-width non-joiner (`U+200C`),
the no-break space (`U+00A0`) and the narrow no-break space(`U+202F`).

Each of these is of particular importance to shaping engines, because
these codepoints interact with the shaping engine, the text run, and
the active font, either to mediate non-default shaping behavior or to
relay information about the current shaping process.

The dotted-circle placeholder is frequently used when displaying a
diacritical mark in isolation. Real-world text documents may
combining mark in isolation. Real-world text documents may
also use other characters, such as hyphens or dashes, in a similar
placeholder fashion; shaping engines should cope with this situation
gracefully.

Dotted-circle placeholder characters (like any Unicode codepoint) can
appear anywhere in text input sequences and should be rendered
normally. GPOS positioning lookups should attach mark glyphs to dotted
circles as they would to other non-mark characters. As visible glyphs,
dotted circles can also be involved in GSUB substitutions.

In addition to the default input-text handling process, shaping
engines may also insert dotted-circle placeholders into the text
sequence. Dotted-circle insertions are required when a non-spacing
mark or dependent sign is formed with no base character present.

This requirement covers:

- Dependent signs that are assigned their own individual Unicode
codepoints (such as most dependent-vowel marks or matras)

- Dependent signs that are formed only by specific sequences of
other codepoints (which is not common in Mongolian but can occur in
other scripts)



The zero-width joiner (ZWJ) is primarily used to force the usage of the
cursive connecting form of a letter even when the context of the
adjoining letters would not trigger the connecting form.
Expand All @@ -254,8 +283,31 @@ as for displaying it in a table of forms), the sequence "_Letter_,ZWJ"
would be used. To show the medial form of a letter in isolation, the
sequence "ZWJ,_Letter_,ZWJ" would be used.


<!--- Zero-Width Non Joiner explanation --->
The zero-width non-joiner (ZWNJ) is primarily used to prevent a
cursive connection between two adjacent characters that would, under
normal circumstances, form a join.

The ZWJ and ZWNJ characters are, by definition, non-printing control
characters and have the _Default_Ignorable_ property in the Unicode
Character Database. In standard text-display scenarios, their function
is to signal a request from the user to the shaping engine for some
particular non-default behavior. As such, they are not rendered
visually.

> Note: Naturally, there are special circumstances where a user or
> document might need to request that a ZWJ or ZWNJ be rendered
> visually, such as when illustrating the OpenType shaping process, or
> displaying Unicode tables.
Because the ZWJ and ZWNJ are non-printing control characters, they can
be ignored by any portion of a software text-handling stack not
involved in the shaping operations that the ZWJ and ZWNJ are designed
to interface with. For example, spell-checking or collation functions
will typically ignore ZWJ and ZWNJ.

Similarly, the ZWJ and ZWNJ should be ignored by the shaping engine
when matching sequences of codepoints against the backtrack and
lookahead sequences of a font's GSUB or GPOS lookups.

The no-break space is primarily used to display those codepoints that
are defined as non-spacing (such as diacritical marks) in an
Expand Down
75 changes: 64 additions & 11 deletions opentype-shaping-nko.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ Codepoints with no assigned meaning are
designated as _unassigned_ in the _Unicode category_ column.


<!--- Character table example and explanation --->
#### Special-function codepoints ####

Other important characters that may be encountered when shaping runs
of N'Ko text include the dotted-circle placeholder (`U+25CC`), the
Expand All @@ -179,17 +179,49 @@ and zero-width non-joiner (`U+200C`), the left-to-right text marker
(`U+200E`) and right-to-left text marker (`U+200F`), and the no-break
space (`U+00A0`).

Each of these is of particular importance to shaping engines, because
these codepoints interact with the shaping engine, the text run, and
the active font, either to mediate non-default shaping behavior or to
relay information about the current shaping process.

The dotted-circle placeholder is frequently used when displaying a
vowel or diacritical mark in isolation. Real-world text documents may
also use other characters, such as hyphens or dashes, in a similar
placeholder fashion; shaping engines should cope with this situation
gracefully.
combining mark in isolation. Real-world text documents may also use
other characters, such as hyphens or dashes, in a similar placeholder
fashion; shaping engines should cope with this situation gracefully.

Dotted-circle placeholder characters (like any Unicode codepoint) can
appear anywhere in text input sequences and should be rendered
normally. GPOS positioning lookups should attach mark glyphs to dotted
circles as they would to other non-mark characters. As visible glyphs,
dotted circles can also be involved in GSUB substitutions.

In addition to the default input-text handling process, shaping
engines may also insert dotted-circle placeholders into the text
sequence. Dotted-circle insertions are required when a non-spacing
mark or dependent sign is formed with no base character present.

This requirement covers:

- Dependent signs that are assigned their own individual Unicode
codepoints (such as most dependent-vowel marks or matras)

- Dependent signs that are formed only by specific sequences of
other codepoints (which is not common in N'Ko but can occur in
other scripts)



The combining grapheme joiner (CGJ) is primarily used to alter the
order in which adjacent marks are positioned during the
mark-reordering stage, in order to adhere to the needs of a
non-default language orthography.
<!--- combining grapheme joiner explanation --->

By default, OpenType shaping reorders sequences of adjacent marks by
sorting the sequence on the marks' Canonical_Combining_Class (Ccc)
values. The presence of a CGJ character within a sequence of marks has
the effect of splitting the sequence into two sequences of marks and,
therefore, halting any mark-reordering that would have occurred
between the marks on either side of the CGJ.

The zero-width joiner (ZWJ) is primarily used to force the usage of the
cursive connecting form of a letter even when the context of the
Expand All @@ -200,17 +232,38 @@ as for displaying it in a table of forms), the sequence "_Letter_,ZWJ"
would be used. To show the medial form of a letter in isolation, the
sequence "ZWJ,_Letter_,ZWJ" would be used.

The zero-width non-joiner (ZWNJ) is primarily used to prevent a
cursive connection between two adjacent characters that would, under
normal circumstances, form a join.

<!--- Zero-Width Non Joiner explanation --->
The ZWJ and ZWNJ characters are, by definition, non-printing control
characters and have the _Default_Ignorable_ property in the Unicode
Character Database. In standard text-display scenarios, their function
is to signal a request from the user to the shaping engine for some
particular non-default behavior. As such, they are not rendered
visually.

The right-to-left mark (RLM) and left-to-right mark (LRM) are used by
the Unicode bidirectionality algorithm (BiDi) to indicate the points
in a text run at which the writing direction changes.
> Note: Naturally, there are special circumstances where a user or
> document might need to request that a ZWJ or ZWNJ be rendered
> visually, such as when illustrating the OpenType shaping process, or
> displaying Unicode tables.
Because the ZWJ and ZWNJ are non-printing control characters, they can
be ignored by any portion of a software text-handling stack not
involved in the shaping operations that the ZWJ and ZWNJ are designed
to interface with. For example, spell-checking or collation functions
will typically ignore ZWJ and ZWNJ.

<!--- How shaping is affected by the LTR and RTL markers explanation --->
Similarly, the ZWJ and ZWNJ should be ignored by the shaping engine
when matching sequences of codepoints against the backtrack and
lookahead sequences of a font's GSUB or GPOS lookups.


The right-to-left mark (RLM) and left-to-right mark (LRM) are used by
the Unicode bidirectionality algorithm (BiDi) to indicate the points
in a text run at which the writing direction changes. Generally
speaking RLM and LRM codepoints do not interact with shaping.

The no-break space is primarily used to display those codepoints that
are defined as non-spacing (such as vowel or diacritical marks and "Hamza") in an
isolated context, as an alternative to displaying them superimposed on
Expand Down
Loading

0 comments on commit e118787

Please sign in to comment.