Skip to content

Commit 0ec4d46

Browse files
Merge pull request #24 from echeran/translit2
Add beginnings of post #2 on transliteration
2 parents c3cfa84 + f82e790 commit 0ec4d46

File tree

1 file changed

+155
-0
lines changed

1 file changed

+155
-0
lines changed
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
^{:kindly/hide-code true ; don't render this code to the HTML document
2+
:clay {:title "More on transliteration"
3+
:quarto {:author :echeran
4+
:type :post
5+
:date "2025-06-22"
6+
:category :clojure
7+
:tags [:internationalization :i18n :transliteration :text
8+
:string :transformation :regex :icu :tree :graph :traversal]}}}
9+
(ns internationalization.transliteration2
10+
(:require [clj-thamil.format :as fmt]
11+
[clj-thamil.format.convert :as cvt]
12+
[clojure.string :as str]))
13+
14+
;; In the last post on [transliteration](transliteration.html), I introduced the idea of transliteration
15+
;; as implemented in programming, and pointed out that the process of transforming text is more general.
16+
;; In that regard, the implementation that works for one use case will work for another. Now, the
17+
;; question is what is the most efficient and appropriate implementation?
18+
;;
19+
;; I talked about a prefix tree as easy for storing the sub-/strings to match on. However, in my pure
20+
;; Clojure implementation of a prefix tree, which is implemented using nested maps, the performance is
21+
;; slow. Very slow! But that's not a reflection of Clojure, which is a language that is very practical
22+
;; and optimizes what it can. And the ethos of Clojure programming follows the maxim in programming,
23+
;; stemming
24+
;; [from early Unix, of "make it work, make it right, make it fast"](https://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast).
25+
;; As such, we should think about how to make this fast.
26+
;;
27+
;; Tim asked me why this text transformation couldn't have been implemented in a regex, and doing so
28+
;; would certainly make it fast. For example, to transliterate Tamil language text in Latin script into
29+
;; the Tamil script, my existing implementation would look like:
30+
(def s "vaNakkam. padippavarkaLukku n-anRi.")
31+
(def expected "வணக்கம். படிப்பவர்களுக்கு நன்றி.")
32+
(cvt/romanized->தமிழ் s)
33+
(assert (= expected (cvt/romanized->தமிழ் s)))
34+
35+
;; That transliteration is converting Latin script into Tamil script in a somewhat predictable and intuitive
36+
;; way, such that: `a` -> அ, `aa` -> ஆ, ..., `k` -> க், `ng` -> ங், etc. Tim's point is that you can
37+
;; detect the input substrings using the regex, and then feed the matching substring occurrences into
38+
;; a replacement map to get the translation. His previous pseudocode in JS looked like this:
39+
;; ```js
40+
;; let text = "this is a test";
41+
;; const replacementMap = { 'th': 'X', 't': 'Y' };
42+
;;
43+
;; let result = text.replace(/th|t/g, (match) => {
44+
;; return replacementMap[match];
45+
;; });
46+
;;
47+
;; console.log(result);
48+
;; ```
49+
;; He is taking into account the caveat that some of the substrings will overlap or be a superstring of
50+
;; other substrings, and therefore, order matters so that the right "rule" (match + replace) is triggered.
51+
;;
52+
;; This should work. Let's try it. In the "romanized->தமிழ்" function, where the word "romanized" really
53+
;; should be "Latin" for the name of the script, the conversions are more or less defined
54+
;; [here](https://github.com/echeran/clj-thamil/blob/78bb810b2ac73cf05d027b52528ba30118e3720e/src/clj_thamil/format/convert.cljc#L25):
55+
;; Let's just reuse it!
56+
cvt/romanized-தமிழ்-phoneme-map
57+
;; Now to handle the caveat. As you can see, `"t"` is a substring of `"th"`, and both are keys in the map.
58+
;; We effectively have to do a topological sort or some other graph traversal based on which
59+
;; keys are substrings of which other ones. In this particular case, a shortcut that is a huge hack
60+
;; (because it cannot possibly be generalizable) would be to sort the match strings in order of longest to shortest
61+
;; en route to constructing our regex string:
62+
(->> (keys cvt/romanized-தமிழ்-phoneme-map)
63+
(sort-by count)
64+
reverse)
65+
;; Our regex string will end up looking like:
66+
(->> (keys cvt/romanized-தமிழ்-phoneme-map)
67+
(sort-by count)
68+
reverse
69+
(interpose \|)
70+
(apply str))
71+
;; Our regex would be formed by feeding it to `re-pattern`:
72+
(def regex (re-pattern (->> (keys cvt/romanized-தமிழ்-phoneme-map)
73+
(sort-by count)
74+
reverse
75+
(interpose \|)
76+
(apply str))))
77+
;; We can do segmentation on the input string based on the transliteration/transformation
78+
;; substring match keys:
79+
(re-seq regex s)
80+
;; We can't naively just transform the strings that match, however. Ex: you would lose the
81+
;; whitespace and punctuation in this example.
82+
(->> (re-seq regex s)
83+
(map cvt/romanized-தமிழ்-phoneme-map)
84+
fmt/phonemes->str)
85+
;; So we need to adjust our regex to be smart enough to have a "default branch" that
86+
;; matches the next character if nothing else matches. We do this by appending the match all
87+
;; shortcut `.` to the end of the giant pattern alternation:
88+
(def regex (re-pattern (str (->> (keys cvt/romanized-தமிழ்-phoneme-map)
89+
(sort-by count)
90+
reverse
91+
(interpose \|)
92+
(apply str))
93+
"|.")))
94+
;; Now, we get non-matching characters in the output
95+
(->> (re-seq regex s)
96+
(map #(or (cvt/romanized-தமிழ்-phoneme-map %) %))
97+
fmt/phonemes->str)
98+
;; And for that matter, since the `.` regex alternation pattern matches a single
99+
;; character anyways, and you're always doing a lookup on what is returned by the
100+
;; regex, we can remove any 1-character length strings from the regex pattern without
101+
;; change in functionality:
102+
(def regex (re-pattern (str (->> (keys cvt/romanized-தமிழ்-phoneme-map)
103+
(sort-by count)
104+
reverse
105+
(remove #(= 1 (count %)))
106+
(interpose \|)
107+
(apply str))
108+
"|.")))
109+
;; Check that the output is the same:
110+
(->> (re-seq regex s)
111+
(map #(or (cvt/romanized-தமிழ்-phoneme-map %) %))
112+
fmt/phonemes->str)
113+
114+
;; Let's see that the new regex is faster than the slightly older regex, and that
115+
;; they are indeed faster than the unoptimized pure Clojure prefix tree implementation.
116+
(def regex1 (re-pattern (str (->> (keys cvt/romanized-தமிழ்-phoneme-map)
117+
(sort-by count)
118+
reverse
119+
(interpose \|)
120+
(apply str))
121+
"|.")))
122+
(def regex2 (re-pattern (str (->> (keys cvt/romanized-தமிழ்-phoneme-map)
123+
(sort-by count)
124+
reverse
125+
(remove #(= 1 (count %)))
126+
(interpose \|)
127+
(apply str))
128+
"|.")))
129+
130+
(def NUM-REPS 100)
131+
(time (dotimes [_ NUM-REPS]
132+
(cvt/romanized->தமிழ் s)))
133+
(time (dotimes [_ NUM-REPS]
134+
(->> (re-seq regex1 s)
135+
(map #(or (cvt/romanized-தமிழ்-phoneme-map %) %))
136+
fmt/phonemes->str)))
137+
(time (dotimes [_ NUM-REPS]
138+
(->> (re-seq regex2 s)
139+
(map #(or (cvt/romanized-தமிழ்-phoneme-map %) %))
140+
fmt/phonemes->str)))
141+
142+
;; Well, this is surprising. I assumed that the regex implementation would be
143+
;; significantly faster. Let's try to investigate.
144+
;;
145+
;; Maybe the difference is less than we thought because `fmt/phonemes->str` is
146+
;; suspiciously inefficient (and also based on the prefix tree code). So what if
147+
;; we strike that out from the above expressions that were timed?
148+
(time (dotimes [_ NUM-REPS]
149+
(->> (re-seq regex2 s)
150+
(map #(or (cvt/romanized-தமிழ்-phoneme-map %) %))
151+
str/join)))
152+
;; So `fmt/phonemes->str` is the culprit. And the implementation of it uses prefix tree
153+
;; code, which is ripe for optimization, perhaps similar to what we just proved
154+
;; here?
155+

0 commit comments

Comments
 (0)