|
| 1 | +^{:kindly/hide-code true ; don't render this code to the HTML document |
| 2 | + :clay {:title "More on transliteration" |
| 3 | + :quarto {:author :echeran |
| 4 | + :type :post |
| 5 | + :date "2025-06-22" |
| 6 | + :category :clojure |
| 7 | + :tags [:internationalization :i18n :transliteration :text |
| 8 | + :string :transformation :regex :icu :tree :graph :traversal]}}} |
| 9 | +(ns internationalization.transliteration2 |
| 10 | + (:require [clj-thamil.format :as fmt] |
| 11 | + [clj-thamil.format.convert :as cvt] |
| 12 | + [clojure.string :as str])) |
| 13 | + |
| 14 | +;; In the last post on [transliteration](transliteration.html), I introduced the idea of transliteration |
| 15 | +;; as implemented in programming, and pointed out that the process of transforming text is more general. |
| 16 | +;; In that regard, the implementation that works for one use case will work for another. Now, the |
| 17 | +;; question is what is the most efficient and appropriate implementation? |
| 18 | +;; |
| 19 | +;; I talked about a prefix tree as easy for storing the sub-/strings to match on. However, in my pure |
| 20 | +;; Clojure implementation of a prefix tree, which is implemented using nested maps, the performance is |
| 21 | +;; slow. Very slow! But that's not a reflection of Clojure, which is a language that is very practical |
| 22 | +;; and optimizes what it can. And the ethos of Clojure programming follows the maxim in programming, |
| 23 | +;; stemming |
| 24 | +;; [from early Unix, of "make it work, make it right, make it fast"](https://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast). |
| 25 | +;; As such, we should think about how to make this fast. |
| 26 | +;; |
| 27 | +;; Tim asked me why this text transformation couldn't have been implemented in a regex, and doing so |
| 28 | +;; would certainly make it fast. For example, to transliterate Tamil language text in Latin script into |
| 29 | +;; the Tamil script, my existing implementation would look like: |
| 30 | +(def s "vaNakkam. padippavarkaLukku n-anRi.") |
| 31 | +(def expected "வணக்கம். படிப்பவர்களுக்கு நன்றி.") |
| 32 | +(cvt/romanized->தமிழ் s) |
| 33 | +(assert (= expected (cvt/romanized->தமிழ் s))) |
| 34 | + |
| 35 | +;; That transliteration is converting Latin script into Tamil script in a somewhat predictable and intuitive |
| 36 | +;; way, such that: `a` -> அ, `aa` -> ஆ, ..., `k` -> க், `ng` -> ங், etc. Tim's point is that you can |
| 37 | +;; detect the input substrings using the regex, and then feed the matching substring occurrences into |
| 38 | +;; a replacement map to get the translation. His previous pseudocode in JS looked like this: |
| 39 | +;; ```js |
| 40 | +;; let text = "this is a test"; |
| 41 | +;; const replacementMap = { 'th': 'X', 't': 'Y' }; |
| 42 | +;; |
| 43 | +;; let result = text.replace(/th|t/g, (match) => { |
| 44 | +;; return replacementMap[match]; |
| 45 | +;; }); |
| 46 | +;; |
| 47 | +;; console.log(result); |
| 48 | +;; ``` |
| 49 | +;; He is taking into account the caveat that some of the substrings will overlap or be a superstring of |
| 50 | +;; other substrings, and therefore, order matters so that the right "rule" (match + replace) is triggered. |
| 51 | +;; |
| 52 | +;; This should work. Let's try it. In the "romanized->தமிழ்" function, where the word "romanized" really |
| 53 | +;; should be "Latin" for the name of the script, the conversions are more or less defined |
| 54 | +;; [here](https://github.com/echeran/clj-thamil/blob/78bb810b2ac73cf05d027b52528ba30118e3720e/src/clj_thamil/format/convert.cljc#L25): |
| 55 | +;; Let's just reuse it! |
| 56 | +cvt/romanized-தமிழ்-phoneme-map |
| 57 | +;; Now to handle the caveat. As you can see, `"t"` is a substring of `"th"`, and both are keys in the map. |
| 58 | +;; We effectively have to do a topological sort or some other graph traversal based on which |
| 59 | +;; keys are substrings of which other ones. In this particular case, a shortcut that is a huge hack |
| 60 | +;; (because it cannot possibly be generalizable) would be to sort the match strings in order of longest to shortest |
| 61 | +;; en route to constructing our regex string: |
| 62 | +(->> (keys cvt/romanized-தமிழ்-phoneme-map) |
| 63 | + (sort-by count) |
| 64 | + reverse) |
| 65 | +;; Our regex string will end up looking like: |
| 66 | +(->> (keys cvt/romanized-தமிழ்-phoneme-map) |
| 67 | + (sort-by count) |
| 68 | + reverse |
| 69 | + (interpose \|) |
| 70 | + (apply str)) |
| 71 | +;; Our regex would be formed by feeding it to `re-pattern`: |
| 72 | +(def regex (re-pattern (->> (keys cvt/romanized-தமிழ்-phoneme-map) |
| 73 | + (sort-by count) |
| 74 | + reverse |
| 75 | + (interpose \|) |
| 76 | + (apply str)))) |
| 77 | +;; We can do segmentation on the input string based on the transliteration/transformation |
| 78 | +;; substring match keys: |
| 79 | +(re-seq regex s) |
| 80 | +;; We can't naively just transform the strings that match, however. Ex: you would lose the |
| 81 | +;; whitespace and punctuation in this example. |
| 82 | +(->> (re-seq regex s) |
| 83 | + (map cvt/romanized-தமிழ்-phoneme-map) |
| 84 | + fmt/phonemes->str) |
| 85 | +;; So we need to adjust our regex to be smart enough to have a "default branch" that |
| 86 | +;; matches the next character if nothing else matches. We do this by appending the match all |
| 87 | +;; shortcut `.` to the end of the giant pattern alternation: |
| 88 | +(def regex (re-pattern (str (->> (keys cvt/romanized-தமிழ்-phoneme-map) |
| 89 | + (sort-by count) |
| 90 | + reverse |
| 91 | + (interpose \|) |
| 92 | + (apply str)) |
| 93 | + "|."))) |
| 94 | +;; Now, we get non-matching characters in the output |
| 95 | +(->> (re-seq regex s) |
| 96 | + (map #(or (cvt/romanized-தமிழ்-phoneme-map %) %)) |
| 97 | + fmt/phonemes->str) |
| 98 | +;; And for that matter, since the `.` regex alternation pattern matches a single |
| 99 | +;; character anyways, and you're always doing a lookup on what is returned by the |
| 100 | +;; regex, we can remove any 1-character length strings from the regex pattern without |
| 101 | +;; change in functionality: |
| 102 | +(def regex (re-pattern (str (->> (keys cvt/romanized-தமிழ்-phoneme-map) |
| 103 | + (sort-by count) |
| 104 | + reverse |
| 105 | + (remove #(= 1 (count %))) |
| 106 | + (interpose \|) |
| 107 | + (apply str)) |
| 108 | + "|."))) |
| 109 | +;; Check that the output is the same: |
| 110 | +(->> (re-seq regex s) |
| 111 | + (map #(or (cvt/romanized-தமிழ்-phoneme-map %) %)) |
| 112 | + fmt/phonemes->str) |
| 113 | + |
| 114 | +;; Let's see that the new regex is faster than the slightly older regex, and that |
| 115 | +;; they are indeed faster than the unoptimized pure Clojure prefix tree implementation. |
| 116 | +(def regex1 (re-pattern (str (->> (keys cvt/romanized-தமிழ்-phoneme-map) |
| 117 | + (sort-by count) |
| 118 | + reverse |
| 119 | + (interpose \|) |
| 120 | + (apply str)) |
| 121 | + "|."))) |
| 122 | +(def regex2 (re-pattern (str (->> (keys cvt/romanized-தமிழ்-phoneme-map) |
| 123 | + (sort-by count) |
| 124 | + reverse |
| 125 | + (remove #(= 1 (count %))) |
| 126 | + (interpose \|) |
| 127 | + (apply str)) |
| 128 | + "|."))) |
| 129 | + |
| 130 | +(def NUM-REPS 100) |
| 131 | +(time (dotimes [_ NUM-REPS] |
| 132 | + (cvt/romanized->தமிழ் s))) |
| 133 | +(time (dotimes [_ NUM-REPS] |
| 134 | + (->> (re-seq regex1 s) |
| 135 | + (map #(or (cvt/romanized-தமிழ்-phoneme-map %) %)) |
| 136 | + fmt/phonemes->str))) |
| 137 | +(time (dotimes [_ NUM-REPS] |
| 138 | + (->> (re-seq regex2 s) |
| 139 | + (map #(or (cvt/romanized-தமிழ்-phoneme-map %) %)) |
| 140 | + fmt/phonemes->str))) |
| 141 | + |
| 142 | +;; Well, this is surprising. I assumed that the regex implementation would be |
| 143 | +;; significantly faster. Let's try to investigate. |
| 144 | +;; |
| 145 | +;; Maybe the difference is less than we thought because `fmt/phonemes->str` is |
| 146 | +;; suspiciously inefficient (and also based on the prefix tree code). So what if |
| 147 | +;; we strike that out from the above expressions that were timed? |
| 148 | +(time (dotimes [_ NUM-REPS] |
| 149 | + (->> (re-seq regex2 s) |
| 150 | + (map #(or (cvt/romanized-தமிழ்-phoneme-map %) %)) |
| 151 | + str/join))) |
| 152 | +;; So `fmt/phonemes->str` is the culprit. And the implementation of it uses prefix tree |
| 153 | +;; code, which is ripe for optimization, perhaps similar to what we just proved |
| 154 | +;; here? |
| 155 | + |
0 commit comments