1+ ^{:kindly/hide-code true ; don't render this code to the HTML document
2+ :clay {:title " About Transliteration"
3+ :quarto {:author :echeran
4+ :type :post
5+ :date " 2025-06-08"
6+ :category :clojure
7+ :tags [:internationalization :i18n :transliteration :text
8+ :string :transformation ]}}}
9+ (ns internationalization.transliteration
10+ (:require [clj-thamil.format :as fmt]))
11+
12+ ; ; Transliteration is about systematically converting the way in which text encodes
13+ ; ; language (or information) from one writing system (or convention or format) to
14+ ; ; another.
15+ ; ;
16+ ; ; We most commonly think of this for human languages, when converting the sounds
17+ ; ; spoken in a language from one writing system to another (ex: Chinese language
18+ ; ; sounds written as ideographs into English language sounds written in the Latin
19+ ; ; script).
20+ ; ;
21+ ; ; The idea of transliteration can be thought of more generically for computers
22+ ; ; that need to transform text or even file formats.
23+
24+
25+ (def translit-map
26+ " This map defines a transliteration scheme for transforming text, in this case,
27+ from Latin script character sequences (of English words) into emojis.
28+
29+ We define our transformation mappings in a map. In this way, it looks a lot like an
30+ input to the Clojure `replace` function. This map will be used as an input for the prefix tree
31+ (a.k.a. trie) data struture used to convert."
32+ {" happy" " 🙂"
33+ " happier" " 😀"
34+ " happiest" " 😄" })
35+
36+ (def translit-trie
37+ " Create the prefix tree (a.k.a. trie) data structure based on our transliteration mappings
38+ map that defines our transliteration."
39+ (fmt/make-trie translit-map))
40+
41+ ; ; A prefix tree is also called a trie. A prefix tree is a way to store a collection of
42+ ; ; sequences (ex: strings) efficiently when there is a lot of overlapping prefixes among
43+ ; ; the strings.
44+ ; ;
45+ ; ; A dictionary for an alphabetic language is a good example of when a prefix tree is
46+ ; ; efficient in space. Imagine all of the words in a single page of the dictionary.
47+ ; ; It could look like "cat", "catamaran", "catamount", "category", "caternary", etc.
48+ ; ; It could instead be stored as:
49+ ; ;
50+ ; ; ```
51+ ; ; c - a - t *
52+ ; ; a - m
53+ ; ; a - r - a - n *
54+ ; ; o - u - n - t *
55+ ; ; e
56+ ; ; g - o - r - y *
57+ ; ; r - n - a - r - y *
58+ ; ; ```
59+
60+ ; ; Why would we use a prefix tree? Even if the source text patterns in the replacement rules are
61+ ; ; overlapping, we could perform replacement without a tree if we order the replacement rules
62+ ; ; by the source text pattern, such that a pattern that contains another pattern is applied earlier.
63+ ; ; However, to perform this ordering in a globally scalable way would effectively require
64+ ; ; constructing a prefix tree. Furthermore, a map of rules better models the notion of rules being
65+ ; ; independent data that are not complected with other rules. Also, as the number of rules increases,
66+ ; ; there may be performance benefits in terms of lookup in a prefix tree versus attempting to apply
67+ ; ; all rules in the ruleset sequentially.
68+
69+ ; ; Let's introspect into our prefix tree. Let's see which input strings have a
70+ (fmt/in-trie? translit-trie " hap" )
71+ (fmt/in-trie? translit-trie " happy" )
72+ (fmt/in-trie? translit-trie " happier" )
73+ (fmt/in-trie? translit-trie " happiest" )
74+ (fmt/in-trie? translit-trie " happiest!" )
75+
76+ (def s " Hello, world! Happiness is not being happiest or happier than the rest, but instead just being happy." )
77+
78+ (defn convert
79+ " Use our translit-trie to convert the input string into the output string"
80+ [s]
81+ (->> (fmt/str->elems translit-trie s)
82+ (apply str)))
83+
84+ (def converted
85+ " Create the converted string according to our transliteration rules."
86+ (convert s))
87+
88+ converted
89+
90+ ; ; It's worth noting that a prefix tree, when used to do transliteration conversions, is
91+ ; ; effectively the finite state machine (FSM) needed to parse and transform.
92+ ; ;
93+ ; ; For next time: What if we implicitly did that same conversion by constructing a regular expression (regex)
94+ ; ; that can match on the input patterns. Could that be equally fast, or faster than our naive Clojure
95+ ; ; implementation? A regex might work like so:
96+ ; ;
97+ ; ; ```js
98+ ; ; let text = "this is a test";
99+ ; ; const replacementMap = { 'th': 'X', 't': 'Y' };
100+ ; ;
101+ ; ; let result = text.replace(/th|t/g, (match) => {
102+ ; ; return replacementMap[match];
103+ ; ; });
104+ ; ;
105+ ; ; console.log(result);
106+ ; ; ```
0 commit comments