This document lists the per-character shaping information needed to shape Khmer text.
Table of Contents
Khmer glyphs should be classified as in the following table. Codepoints in the Khmer block with no assigned meaning are designated as unassigned in the Unicode category column.
Assigned codepoints with a null in the Shaping class column evoke no special behavior from the shaping engine. Note that this does include some valid codepoints, such as currency marks, punctuation, and other symbols.
Note: the
NUMBER
andSYMBOL
Shaping classes are important during syllable identification, but generally evoke no further special behavior during the rest of the shaping process.
The Mark-placement subclass column indicates mark-placement positioning for codepoints in the Mark category. Assigned, non-mark codepoints have a null in this column and evoke no special mark-placement behavior. Marks tagged with [Mn] in the Unicode category column are categorized as non-spacing; marks tagged with [Mc] are categorized as spacing-combining.
Some codepoints in the following table use a Shaping class that differs from the codepoint's Unicode General Category. The Shaping class takes precedence during OpenType shaping, as it captures more specific, script-aware behavior.
Codepoint | Unicode category | Shaping class | Mark-placement subclass | Glyph |
---|---|---|---|---|
U+1780 |
Letter | CONSONANT | null | ក Ka |
U+1781 |
Letter | CONSONANT | null | ខ Kha |
U+1782 |
Letter | CONSONANT | null | គ Ko |
U+1783 |
Letter | CONSONANT | null | ឃ Kho |
U+1784 |
Letter | CONSONANT | null | ង Ngo |
U+1785 |
Letter | CONSONANT | null | ច Ca |
U+1786 |
Letter | CONSONANT | null | ឆ Cha |
U+1787 |
Letter | CONSONANT | null | ជ Co |
U+1788 |
Letter | CONSONANT | null | ឈ Cho |
U+1789 |
Letter | CONSONANT | null | ញ Nyo |
U+178A |
Letter | CONSONANT | null | ដ Da |
U+178B |
Letter | CONSONANT | null | ឋ Ttha |
U+178C |
Letter | CONSONANT | null | ឌ Do |
U+178D |
Letter | CONSONANT | null | ឍ Ttho |
U+178E |
Letter | CONSONANT | null | ណ Nno |
U+178F |
Letter | CONSONANT | null | ត Ta |
U+1790 |
Letter | CONSONANT | null | ថ Tha |
U+1791 |
Letter | CONSONANT | null | ទ To |
U+1792 |
Letter | CONSONANT | null | ធ Tho |
U+1793 |
Letter | CONSONANT | null | ន No |
U+1794 |
Letter | CONSONANT | null | ប Ba |
U+1795 |
Letter | CONSONANT | null | ផ Pha |
U+1796 |
Letter | CONSONANT | null | ព Po |
U+1797 |
Letter | CONSONANT | null | ភ Pho |
U+1798 |
Letter | CONSONANT | null | ម Mo |
U+1799 |
Letter | CONSONANT | null | យ Yo |
U+179A |
Letter | CONSONANT | null | រ Ro |
U+179B |
Letter | CONSONANT | null | ល Lo |
U+179C |
Letter | CONSONANT | null | វ Vo |
U+179D |
Letter | CONSONANT | null | ឝ Sha |
U+179E |
Letter | CONSONANT | null | ឞ Sso |
U+179F |
Letter | CONSONANT | null | ស Sa |
U+17A0 |
Letter | CONSONANT | null | ហ Ha |
U+17A1 |
Letter | CONSONANT | null | ឡ La |
U+17A2 |
Letter | CONSONANT | null | អ Qa |
U+17A3 |
Letter | VOWEL_INDEPENDENT | null | ឣ Qaq |
U+17A4 |
Letter | VOWEL_INDEPENDENT | null | ឤ Qaa |
U+17A5 |
Letter | VOWEL_INDEPENDENT | null | ឥ Qi |
U+17A6 |
Letter | VOWEL_INDEPENDENT | null | ឦ Qii |
U+17A7 |
Letter | VOWEL_INDEPENDENT | null | ឧ Qu |
U+17A8 |
Letter | VOWEL_INDEPENDENT | null | ឨ Quk |
U+17A9 |
Letter | VOWEL_INDEPENDENT | null | ឩ Quu |
U+17AA |
Letter | VOWEL_INDEPENDENT | null | ឪ Quuv |
U+17AB |
Letter | VOWEL_INDEPENDENT | null | ឫ Ry |
U+17AC |
Letter | VOWEL_INDEPENDENT | null | ឬ Ryy |
U+17AD |
Letter | VOWEL_INDEPENDENT | null | ឭ Ly |
U+17AE |
Letter | VOWEL_INDEPENDENT | null | ឮ Lyy |
U+17AF |
Letter | VOWEL_INDEPENDENT | null | ឯ Qe |
U+17B0 |
Letter | VOWEL_INDEPENDENT | null | ឰ Qai |
U+17B1 |
Letter | VOWEL_INDEPENDENT | null | ឱ Qoo Type One |
U+17B2 |
Letter | VOWEL_INDEPENDENT | null | ឲ Qoo Type Two |
U+17B3 |
Letter | VOWEL_INDEPENDENT | null | ឳ Qau |
U+17B4 |
Mark [Mn] | null | null | ឴ Inherent Aq |
U+17B5 |
Mark [Mn] | null | null | ឵ Inherent Aa |
U+17B6 |
Mark [Mc] | VOWEL_DEPENDENT | RIGHT_POSITION | ា Sign Aa |
U+17B7 |
Mark [Mn] | VOWEL_DEPENDENT | TOP_POSITION | ិ Sign I |
U+17B8 |
Mark [Mn] | VOWEL_DEPENDENT | TOP_POSITION | ី Sign Ii |
U+17B9 |
Mark [Mn] | VOWEL_DEPENDENT | TOP_POSITION | ឹ Sign Y |
U+17BA |
Mark [Mn] | VOWEL_DEPENDENT | TOP_POSITION | ឺ Sign Yy |
U+17BB |
Mark [Mn] | VOWEL_DEPENDENT | BOTTOM_POSITION | ុ Sign U |
U+17BC |
Mark [Mn] | VOWEL_DEPENDENT | BOTTOM_POSITION | ូ Sign Uu |
U+17BD |
Mark [Mn] | VOWEL_DEPENDENT | BOTTOM_POSITION | ួ Sign Ua |
U+17BE |
Mark [Mc] | VOWEL_DEPENDENT | TOP_AND_LEFT_POSITION | ើ Sign Oe |
U+17BF |
Mark [Mc] | VOWEL_DEPENDENT | TOP_LEFT_AND_RIGHT_POSITION | ឿ Sign Ya |
U+17C0 |
Mark [Mc] | VOWEL_DEPENDENT | LEFT_AND_RIGHT_POSITION | ៀ Sign Ie |
U+17C1 |
Mark [Mc] | VOWEL_DEPENDENT | LEFT_POSITION | េ Sign E |
U+17C2 |
Mark [Mc] | VOWEL_DEPENDENT | LEFT_POSITION | ែ Sign Ae |
U+17C3 |
Mark [Mc] | VOWEL_DEPENDENT | LEFT_POSITION | ៃ Sign Ai |
U+17C4 |
Mark [Mc] | VOWEL_DEPENDENT | LEFT_AND_RIGHT_POSITION | ោ Sign Oo |
U+17C5 |
Mark [Mc] | VOWEL_DEPENDENT | LEFT_AND_RIGHT_POSITION | ៅ Sign Au |
U+17C6 |
Mark [Mn] | NUKTA | TOP_POSITION | ំ Nikahit |
U+17C7 |
Mark [Mc] | VISARGA | RIGHT_POSITION | ះ Reahmuk |
U+17C8 |
Mark [Mc] | VOWEL_DEPENDENT | RIGHT_POSITION | ៈ Yuukaleapintu |
U+17C9 |
Mark [Mn] | REGISTER_SHIFTER | TOP_POSITION | ៉ Muusikatoan |
U+17CA |
Mark [Mn] | REGISTER_SHIFTER | TOP_POSITION | ៊ Triisap |
U+17CB |
Mark [Mn] | SYLLABLE_MODIFIER | TOP_POSITION | ់ Bantoc |
U+17CC |
Mark [Mn] | CONSONANT_POST_REPHA | TOP_POSITION | ៌ Robat |
U+17CD |
Mark [Mn] | CONSONANT_KILLER | TOP_POSITION | ៍ Toandakhiat |
U+17CE |
Mark [Mn] | SYLLABLE_MODIFIER | TOP_POSITION | ៎ Kakabat |
U+17CF |
Mark [Mn] | SYLLABLE_MODIFIER | TOP_POSITION | ៏ Ahsda |
U+17D0 |
Mark [Mn] | SYLLABLE_MODIFIER | TOP_POSITION | ័ Samyok Sannya |
U+17D1 |
Mark [Mn] | PURE_KILLER | TOP_POSITION | ៑ Viriam |
U+17D2 |
Mark [Mn] | INVISIBLE_STACKER | null | ្ Sign Coeng |
U+17D3 |
Mark [Mn] | SYLLABLE_MODIFIER | TOP_POSITION | ៓ Bathamasat |
U+17D4 |
Punctuation | null | null | ។ Khan |
U+17D5 |
Punctuation | null | null | ៕ Bariyoosan |
U+17D6 |
Punctuation | null | null | ៖ Camnuc Pii Kuuh |
U+17D7 |
Letter | null | null | ៗ Lek Too |
U+17D8 |
Punctuation | null | null | ៘ Beyyal |
U+17D9 |
Punctuation | null | null | ៙ Phnaek Muan |
U+17DA |
Punctuation | null | null | ៚ Koomuut |
U+17DB |
Symbol | SYMBOL | null | ៛ Riel |
U+17DC |
Letter | AVAGRAHA | null | ៜ Avakrahasanya |
U+17DD |
Mark [Mn] | SYLLABLE_MODIFIER | TOP_POSITION | ៝ Atthacan |
U+17DE |
unassigned | |||
U+17DF |
unassigned | |||
U+17E0 |
Number | NUMBER | null | ០ Digit Zero |
U+17E1 |
Number | NUMBER | null | ១ Digit One |
U+17E2 |
Number | NUMBER | null | ២ Digit Two |
U+17E3 |
Number | NUMBER | null | ៣ Digit Three |
U+17E4 |
Number | NUMBER | null | ៤ Digit Four |
U+17E5 |
Number | NUMBER | null | ៥ Digit Five |
U+17E6 |
Number | NUMBER | null | ៦ Digit Six |
U+17E7 |
Number | NUMBER | null | ៧ Digit Seven |
U+17E8 |
Number | NUMBER | null | ៨ Digit Eight |
U+17E9 |
Number | NUMBER | null | ៩ Digit Nine |
U+17EA |
unassigned | |||
U+17EB |
unassigned | |||
U+17EC |
unassigned | |||
U+17ED |
unassigned | |||
U+17EE |
unassigned | |||
U+17EF |
unassigned | |||
U+17F0 |
Number | null | null | ៰ Lek Attak Son |
U+17F1 |
Number | null | null | ៱ Lek Attak Muoy |
U+17F2 |
Number | null | null | ៲ Lek Attak Pii |
U+17F3 |
Number | null | null | ៳ Lek Attak Bei |
U+17F4 |
Number | null | null | ៴ Lek Attak Buon |
U+17F5 |
Number | null | null | ៵ Lek Attak Pram |
U+17F6 |
Number | null | null | ៶ Lek Attak Pram-Muoy |
U+17F7 |
Number | null | null | ៷ Lek Attak Pram-Pii |
U+17F8 |
Number | null | null | ៸ Lek Attak Pram-Bei |
U+17F9 |
Number | null | null | ៹ Lek Attak Pram-Buon |
U+17FA |
unassigned | |||
U+17FB |
unassigned | |||
U+17FC |
unassigned | |||
U+17FD |
unassigned | |||
U+17FE |
unassigned | |||
U+17FF |
unassigned |
The Khmer Symbols block contains miscellaneous symbols used for lunar-date calendars. None evoke any special behavior from the shaping engine.
Codepoint | Unicode category | Shaping class | Mark-placement subclass | Glyph |
---|---|---|---|---|
U+19E0 |
Symbol | null | null | ᧠ Pathamasat |
U+19E1 |
Symbol | null | null | ᧡ Muoy Koet |
U+19E2 |
Symbol | null | null | ᧢ Pii Koet |
U+19E3 |
Symbol | null | null | ᧣ Bei Koet |
U+19E4 |
Symbol | null | null | ᧤ Buon Koet |
U+19E5 |
Symbol | null | null | ᧥ Pram Koet |
U+19E6 |
Symbol | null | null | ᧦ Pram-Muoy Koet |
U+19E7 |
Symbol | null | null | ᧧ Pram-Pii Koet |
U+19E8 |
Symbol | null | null | ᧨ Pram-Bei Koet |
U+19E9 |
Symbol | null | null | ᧩ Pram-Buon Koet |
U+19EA |
Symbol | null | null | ᧪ Dap Koet |
U+19EB |
Symbol | null | null | ᧫ Dap-Muoy Koet |
U+19EC |
Symbol | null | null | ᧬ Dap-Pii Koet |
U+19ED |
Symbol | null | null | ᧭ Dap-Bei Koet |
U+19EE |
Symbol | null | null | ᧮ Dap-Buon Koet |
U+19EF |
Symbol | null | null | ᧯ Dap-Pram Koet |
U+19F0 |
Symbol | null | null | ᧰ Tuteyasat |
U+19F1 |
Symbol | null | null | ᧱ Muoy ROC |
U+19F2 |
Symbol | null | null | ᧲ Pii Roc |
U+19F3 |
Symbol | null | null | ᧳ Bei Roc |
U+19F4 |
Symbol | null | null | ᧴ Buon Roc |
U+19F5 |
Symbol | null | null | ᧵ Pram Roc |
U+19F6 |
Symbol | null | null | ᧶ Pram-Muoy Roc |
U+19F7 |
Symbol | null | null | ᧷ Pram-Pii Roc |
U+19F8 |
Symbol | null | null | ᧸ Pram-Bei Roc |
U+19F9 |
Symbol | null | null | ᧹ Pram-Buon Roc |
U+19FA |
Symbol | null | null | ᧺ Dap Roc |
U+19FB |
Symbol | null | null | ᧻ Dap-Muoy Roc |
U+19FC |
Symbol | null | null | ᧼ Dap-Pii Roc |
U+19FD |
Symbol | null | null | ᧽ Dap-Bei Roc |
U+19FE |
Symbol | null | null | ᧾ Dap-Buon Roc |
U+19FF |
Symbol | null | null | ᧿ Dap-Pram Roc |
Other important characters that may be encountered when shaping runs
of Khmer text include the dotted-circle placeholder (U+25CC
), the
zero-width joiner (U+200D
) and zero-width non-joiner (U+200C
), and
the no-break space (U+00A0
).
The dotted-circle placeholder is frequently used when displaying a dependent vowel (matra) or a combining mark in isolation. Real-world text syllables may also use other characters, such as hyphens or dashes, in a similar placeholder fashion; shaping engines should cope with this situation gracefully.
Codepoint | Unicode category | Shaping class | Mark-placement subclass | Glyph |
---|---|---|---|---|
U+00A0 |
Separator | PLACEHOLDER | null | No-break space |
U+200C |
Other | NON_JOINER | null | Zero-width non-joiner |
U+200D |
Other | JOINER | null | Zero-width joiner |
U+2010 |
Punctuation | PLACEHOLDER | null | ‐ Hyphen |
U+2011 |
Punctuation | PLACEHOLDER | null | ‑ No-break hyphen |
U+2012 |
Punctuation | PLACEHOLDER | null | ‒ Figure dash |
U+2013 |
Punctuation | PLACEHOLDER | null | – En dash |
U+2014 |
Punctuation | PLACEHOLDER | null | — Em dash |
U+25CC |
Symbol | DOTTED_CIRCLE | null | ◌ Dotted circle |
The zero-width joiner (ZWJ) is primarily used to prevent the formation of a conjunct from a "Consonant,Halant,Consonant" sequence. The sequence "Consonant,Halant,ZWJ,Consonant" blocks the formation of a conjunct between the two consonants.
Note, however, that the "Consonant,Halant" subsequence in the above example may still trigger a half-forms feature. To prevent the application of the half-forms feature in addition to preventing the conjunct, the zero-width non-joiner (ZWNJ) must be used instead. The sequence "Consonant,Halant,ZWNJ,Consonant" should produce the first consonant in its standard form, followed by an explicit "Halant".
A secondary usage of the zero-width joiner is to prevent the formation of "Reph". An initial "Ra,Halant,ZWJ" sequence should not produce a "Reph", where an initial "Ra,Halant" sequence without the zero-width joiner otherwise would.
The no-break space (NBSP<.abbr>) is primarily used to display those codepoints that are defined as non-spacing (marks, dependent vowels (matras), below-base consonant forms, and post-base consonant forms) in an isolated context, as an alternative to displaying them superimposed on the dotted-circle placeholder. These sequences will match "NBSP,ZWJ,Halant,Consonant", "NBSP,mark", or "NBSP,matra".
In addition to general punctuation, runs of Khmer text often use the
danda (U+0964
) and double danda (U+0965
) punctuation marks from
the Devanagari block.