Skip to content

Latest commit

 

History

History
275 lines (248 loc) · 23.3 KB

character-tables-khmer.md

File metadata and controls

275 lines (248 loc) · 23.3 KB

Khmer character tables

This document lists the per-character shaping information needed to shape Khmer text.

Table of Contents

Khmer character table

Khmer glyphs should be classified as in the following table. Codepoints in the Khmer block with no assigned meaning are designated as unassigned in the Unicode category column.

Assigned codepoints with a null in the Shaping class column evoke no special behavior from the shaping engine. Note that this does include some valid codepoints, such as currency marks, punctuation, and other symbols.

Note: the NUMBER and SYMBOL Shaping classes are important during syllable identification, but generally evoke no further special behavior during the rest of the shaping process.

The Mark-placement subclass column indicates mark-placement positioning for codepoints in the Mark category. Assigned, non-mark codepoints have a null in this column and evoke no special mark-placement behavior. Marks tagged with [Mn] in the Unicode category column are categorized as non-spacing; marks tagged with [Mc] are categorized as spacing-combining.

Some codepoints in the following table use a Shaping class that differs from the codepoint's Unicode General Category. The Shaping class takes precedence during OpenType shaping, as it captures more specific, script-aware behavior.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+1780 Letter CONSONANT null ក Ka
U+1781 Letter CONSONANT null ខ Kha
U+1782 Letter CONSONANT null គ Ko
U+1783 Letter CONSONANT null ឃ Kho
U+1784 Letter CONSONANT null ង Ngo
U+1785 Letter CONSONANT null ច Ca
U+1786 Letter CONSONANT null ឆ Cha
U+1787 Letter CONSONANT null ជ Co
U+1788 Letter CONSONANT null ឈ Cho
U+1789 Letter CONSONANT null ញ Nyo
U+178A Letter CONSONANT null ដ Da
U+178B Letter CONSONANT null ឋ Ttha
U+178C Letter CONSONANT null ឌ Do
U+178D Letter CONSONANT null ឍ Ttho
U+178E Letter CONSONANT null ណ Nno
U+178F Letter CONSONANT null ត Ta
U+1790 Letter CONSONANT null ថ Tha
U+1791 Letter CONSONANT null ទ To
U+1792 Letter CONSONANT null ធ Tho
U+1793 Letter CONSONANT null ន No
U+1794 Letter CONSONANT null ប Ba
U+1795 Letter CONSONANT null ផ Pha
U+1796 Letter CONSONANT null ព Po
U+1797 Letter CONSONANT null ភ Pho
U+1798 Letter CONSONANT null ម Mo
U+1799 Letter CONSONANT null យ Yo
U+179A Letter CONSONANT null រ Ro
U+179B Letter CONSONANT null ល Lo
U+179C Letter CONSONANT null វ Vo
U+179D Letter CONSONANT null ឝ Sha
U+179E Letter CONSONANT null ឞ Sso
U+179F Letter CONSONANT null ស Sa
U+17A0 Letter CONSONANT null ហ Ha
U+17A1 Letter CONSONANT null ឡ La
U+17A2 Letter CONSONANT null អ Qa
U+17A3 Letter VOWEL_INDEPENDENT null ឣ Qaq
U+17A4 Letter VOWEL_INDEPENDENT null ឤ Qaa
U+17A5 Letter VOWEL_INDEPENDENT null ឥ Qi
U+17A6 Letter VOWEL_INDEPENDENT null ឦ Qii
U+17A7 Letter VOWEL_INDEPENDENT null ឧ Qu
U+17A8 Letter VOWEL_INDEPENDENT null ឨ Quk
U+17A9 Letter VOWEL_INDEPENDENT null ឩ Quu
U+17AA Letter VOWEL_INDEPENDENT null ឪ Quuv
U+17AB Letter VOWEL_INDEPENDENT null ឫ Ry
U+17AC Letter VOWEL_INDEPENDENT null ឬ Ryy
U+17AD Letter VOWEL_INDEPENDENT null ឭ Ly
U+17AE Letter VOWEL_INDEPENDENT null ឮ Lyy
U+17AF Letter VOWEL_INDEPENDENT null ឯ Qe
U+17B0 Letter VOWEL_INDEPENDENT null ឰ Qai
U+17B1 Letter VOWEL_INDEPENDENT null ឱ Qoo Type One
U+17B2 Letter VOWEL_INDEPENDENT null ឲ Qoo Type Two
U+17B3 Letter VOWEL_INDEPENDENT null ឳ Qau
U+17B4 Mark [Mn] null null ឴ Inherent Aq
U+17B5 Mark [Mn] null null ឵ Inherent Aa
U+17B6 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ា Sign Aa
U+17B7 Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ិ Sign I
U+17B8 Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ី Sign Ii
U+17B9 Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ឹ Sign Y
U+17BA Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ឺ Sign Yy
U+17BB Mark [Mn] VOWEL_DEPENDENT BOTTOM_POSITION ុ Sign U
U+17BC Mark [Mn] VOWEL_DEPENDENT BOTTOM_POSITION ូ Sign Uu
U+17BD Mark [Mn] VOWEL_DEPENDENT BOTTOM_POSITION ួ Sign Ua
U+17BE Mark [Mc] VOWEL_DEPENDENT TOP_AND_LEFT_POSITION ើ Sign Oe
U+17BF Mark [Mc] VOWEL_DEPENDENT TOP_LEFT_AND_RIGHT_POSITION ឿ Sign Ya
U+17C0 Mark [Mc] VOWEL_DEPENDENT LEFT_AND_RIGHT_POSITION ៀ Sign Ie
U+17C1 Mark [Mc] VOWEL_DEPENDENT LEFT_POSITION េ Sign E
U+17C2 Mark [Mc] VOWEL_DEPENDENT LEFT_POSITION ែ Sign Ae
U+17C3 Mark [Mc] VOWEL_DEPENDENT LEFT_POSITION ៃ Sign Ai
U+17C4 Mark [Mc] VOWEL_DEPENDENT LEFT_AND_RIGHT_POSITION ោ Sign Oo
U+17C5 Mark [Mc] VOWEL_DEPENDENT LEFT_AND_RIGHT_POSITION ៅ Sign Au
U+17C6 Mark [Mn] NUKTA TOP_POSITION ំ Nikahit
U+17C7 Mark [Mc] VISARGA RIGHT_POSITION ះ Reahmuk
U+17C8 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ៈ Yuukaleapintu
U+17C9 Mark [Mn] REGISTER_SHIFTER TOP_POSITION ៉ Muusikatoan
U+17CA Mark [Mn] REGISTER_SHIFTER TOP_POSITION ៊ Triisap
U+17CB Mark [Mn] SYLLABLE_MODIFIER TOP_POSITION ់ Bantoc
U+17CC Mark [Mn] CONSONANT_POST_REPHA TOP_POSITION ៌ Robat
U+17CD Mark [Mn] CONSONANT_KILLER TOP_POSITION ៍ Toandakhiat
U+17CE Mark [Mn] SYLLABLE_MODIFIER TOP_POSITION ៎ Kakabat
U+17CF Mark [Mn] SYLLABLE_MODIFIER TOP_POSITION ៏ Ahsda
U+17D0 Mark [Mn] SYLLABLE_MODIFIER TOP_POSITION ័ Samyok Sannya
U+17D1 Mark [Mn] PURE_KILLER TOP_POSITION ៑ Viriam
U+17D2 Mark [Mn] INVISIBLE_STACKER null ្ Sign Coeng
U+17D3 Mark [Mn] SYLLABLE_MODIFIER TOP_POSITION ៓ Bathamasat
U+17D4 Punctuation null null ។ Khan
U+17D5 Punctuation null null ៕ Bariyoosan
U+17D6 Punctuation null null ៖ Camnuc Pii Kuuh
U+17D7 Letter null null ៗ Lek Too
U+17D8 Punctuation null null ៘ Beyyal
U+17D9 Punctuation null null ៙ Phnaek Muan
U+17DA Punctuation null null ៚ Koomuut
U+17DB Symbol SYMBOL null ៛ Riel
U+17DC Letter AVAGRAHA null ៜ Avakrahasanya
U+17DD Mark [Mn] SYLLABLE_MODIFIER TOP_POSITION ៝ Atthacan
U+17DE unassigned
U+17DF unassigned
U+17E0 Number NUMBER null ០ Digit Zero
U+17E1 Number NUMBER null ១ Digit One
U+17E2 Number NUMBER null ២ Digit Two
U+17E3 Number NUMBER null ៣ Digit Three
U+17E4 Number NUMBER null ៤ Digit Four
U+17E5 Number NUMBER null ៥ Digit Five
U+17E6 Number NUMBER null ៦ Digit Six
U+17E7 Number NUMBER null ៧ Digit Seven
U+17E8 Number NUMBER null ៨ Digit Eight
U+17E9 Number NUMBER null ៩ Digit Nine
U+17EA unassigned
U+17EB unassigned
U+17EC unassigned
U+17ED unassigned
U+17EE unassigned
U+17EF unassigned
U+17F0 Number null null ៰ Lek Attak Son
U+17F1 Number null null ៱ Lek Attak Muoy
U+17F2 Number null null ៲ Lek Attak Pii
U+17F3 Number null null ៳ Lek Attak Bei
U+17F4 Number null null ៴ Lek Attak Buon
U+17F5 Number null null ៵ Lek Attak Pram
U+17F6 Number null null ៶ Lek Attak Pram-Muoy
U+17F7 Number null null ៷ Lek Attak Pram-Pii
U+17F8 Number null null ៸ Lek Attak Pram-Bei
U+17F9 Number null null ៹ Lek Attak Pram-Buon
U+17FA unassigned
U+17FB unassigned
U+17FC unassigned
U+17FD unassigned
U+17FE unassigned
U+17FF unassigned

Khmer Symbols character table

The Khmer Symbols block contains miscellaneous symbols used for lunar-date calendars. None evoke any special behavior from the shaping engine.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+19E0 Symbol null null ᧠ Pathamasat
U+19E1 Symbol null null ᧡ Muoy Koet
U+19E2 Symbol null null ᧢ Pii Koet
U+19E3 Symbol null null ᧣ Bei Koet
U+19E4 Symbol null null ᧤ Buon Koet
U+19E5 Symbol null null ᧥ Pram Koet
U+19E6 Symbol null null ᧦ Pram-Muoy Koet
U+19E7 Symbol null null ᧧ Pram-Pii Koet
U+19E8 Symbol null null ᧨ Pram-Bei Koet
U+19E9 Symbol null null ᧩ Pram-Buon Koet
U+19EA Symbol null null ᧪ Dap Koet
U+19EB Symbol null null ᧫ Dap-Muoy Koet
U+19EC Symbol null null ᧬ Dap-Pii Koet
U+19ED Symbol null null ᧭ Dap-Bei Koet
U+19EE Symbol null null ᧮ Dap-Buon Koet
U+19EF Symbol null null ᧯ Dap-Pram Koet
U+19F0 Symbol null null ᧰ Tuteyasat
U+19F1 Symbol null null ᧱ Muoy ROC
U+19F2 Symbol null null ᧲ Pii Roc
U+19F3 Symbol null null ᧳ Bei Roc
U+19F4 Symbol null null ᧴ Buon Roc
U+19F5 Symbol null null ᧵ Pram Roc
U+19F6 Symbol null null ᧶ Pram-Muoy Roc
U+19F7 Symbol null null ᧷ Pram-Pii Roc
U+19F8 Symbol null null ᧸ Pram-Bei Roc
U+19F9 Symbol null null ᧹ Pram-Buon Roc
U+19FA Symbol null null ᧺ Dap Roc
U+19FB Symbol null null ᧻ Dap-Muoy Roc
U+19FC Symbol null null ᧼ Dap-Pii Roc
U+19FD Symbol null null ᧽ Dap-Bei Roc
U+19FE Symbol null null ᧾ Dap-Buon Roc
U+19FF Symbol null null ᧿ Dap-Pram Roc

Miscellaneous character table

Other important characters that may be encountered when shaping runs of Khmer text include the dotted-circle placeholder (U+25CC), the zero-width joiner (U+200D) and zero-width non-joiner (U+200C), and the no-break space (U+00A0).

The dotted-circle placeholder is frequently used when displaying a dependent vowel (matra) or a combining mark in isolation. Real-world text syllables may also use other characters, such as hyphens or dashes, in a similar placeholder fashion; shaping engines should cope with this situation gracefully.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+00A0 Separator PLACEHOLDER null   No-break space
U+200C Other NON_JOINER null ‌ Zero-width non-joiner
U+200D Other JOINER null ‍ Zero-width joiner
U+2010 Punctuation PLACEHOLDER null ‐ Hyphen
U+2011 Punctuation PLACEHOLDER null ‑ No-break hyphen
U+2012 Punctuation PLACEHOLDER null ‒ Figure dash
U+2013 Punctuation PLACEHOLDER null – En dash
U+2014 Punctuation PLACEHOLDER null — Em dash
U+25CC Symbol DOTTED_CIRCLE null ◌ Dotted circle

The zero-width joiner (ZWJ) is primarily used to prevent the formation of a conjunct from a "Consonant,Halant,Consonant" sequence. The sequence "Consonant,Halant,ZWJ,Consonant" blocks the formation of a conjunct between the two consonants.

Note, however, that the "Consonant,Halant" subsequence in the above example may still trigger a half-forms feature. To prevent the application of the half-forms feature in addition to preventing the conjunct, the zero-width non-joiner (ZWNJ) must be used instead. The sequence "Consonant,Halant,ZWNJ,Consonant" should produce the first consonant in its standard form, followed by an explicit "Halant".

A secondary usage of the zero-width joiner is to prevent the formation of "Reph". An initial "Ra,Halant,ZWJ" sequence should not produce a "Reph", where an initial "Ra,Halant" sequence without the zero-width joiner otherwise would.

The no-break space (NBSP<.abbr>) is primarily used to display those codepoints that are defined as non-spacing (marks, dependent vowels (matras), below-base consonant forms, and post-base consonant forms) in an isolated context, as an alternative to displaying them superimposed on the dotted-circle placeholder. These sequences will match "NBSP,ZWJ,Halant,Consonant", "NBSP,mark", or "NBSP,matra".

In addition to general punctuation, runs of Khmer text often use the danda (U+0964) and double danda (U+0965) punctuation marks from the Devanagari block.