Skip to content

Commit 237c123

Browse files
Merge pull request #22 from echeran/translit
Add post on transliteration
2 parents 8358a42 + b761c10 commit 237c123

File tree

3 files changed

+114
-0
lines changed

3 files changed

+114
-0
lines changed

clay.edn

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,13 @@
3939
:url "https://github.com/puredanger"
4040
:affiliation [:clojure.core]
4141
:links [{:icon "github" :text "GitHub" :href "https://github.com/puredanger"}]}
42+
:echeran
43+
{:name "Elango Cheran"
44+
:image "https://www.unicode.org/consortium/img/cheran-150px.jpg"
45+
:url "https://github.com/echeran"
46+
:affiliation []
47+
:links [{:icon "github" :text "GitHub" :href "https://github.com/echeran"}
48+
{:icon "home" :text "Personal site" :href "https://www.elangocheran.com"}]}
4249
:seancorfield
4350
{:name "Sean Corfield"
4451
:image "https://avatars.githubusercontent.com/u/43875?v=4"

deps.edn

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
io.github.clojure/core.async.flow-monitor {:git/tag "v0.1.1" :git/sha "61e8d31"}
99
metosin/malli {:mvn/version "0.18.0"}
1010
clj-fuzzy/clj-fuzzy {:mvn/version "0.4.1"}
11+
clj-thamil/clj-thamil {:mvn/version "0.2.0"}
1112
org.scicloj/clay {:git/url "https://github.com/scicloj/clay"
1213
:git/sha "d64df566e3dd0e90ac9360d86a481a7be7587eaf"}
1314
org.eclipse.elk/org.eclipse.elk.core {:mvn/version "0.10.0"}
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
^{:kindly/hide-code true ; don't render this code to the HTML document
2+
:clay {:title "About Transliteration"
3+
:quarto {:author :echeran
4+
:type :post
5+
:date "2025-06-08"
6+
:category :clojure
7+
:tags [:internationalization :i18n :transliteration :text
8+
:string :transformation]}}}
9+
(ns internationalization.transliteration
10+
(:require [clj-thamil.format :as fmt]))
11+
12+
;; Transliteration is about systematically converting the way in which text encodes
13+
;; language (or information) from one writing system (or convention or format) to
14+
;; another.
15+
;;
16+
;; We most commonly think of this for human languages, when converting the sounds
17+
;; spoken in a language from one writing system to another (ex: Chinese language
18+
;; sounds written as ideographs into English language sounds written in the Latin
19+
;; script).
20+
;;
21+
;; The idea of transliteration can be thought of more generically for computers
22+
;; that need to transform text or even file formats.
23+
24+
25+
(def translit-map
26+
"This map defines a transliteration scheme for transforming text, in this case,
27+
from Latin script character sequences (of English words) into emojis.
28+
29+
We define our transformation mappings in a map. In this way, it looks a lot like an
30+
input to the Clojure `replace` function. This map will be used as an input for the prefix tree
31+
(a.k.a. trie) data struture used to convert."
32+
{"happy" "🙂"
33+
"happier" "😀"
34+
"happiest" "😄"})
35+
36+
(def translit-trie
37+
"Create the prefix tree (a.k.a. trie) data structure based on our transliteration mappings
38+
map that defines our transliteration."
39+
(fmt/make-trie translit-map))
40+
41+
;; A prefix tree is also called a trie. A prefix tree is a way to store a collection of
42+
;; sequences (ex: strings) efficiently when there is a lot of overlapping prefixes among
43+
;; the strings.
44+
;;
45+
;; A dictionary for an alphabetic language is a good example of when a prefix tree is
46+
;; efficient in space. Imagine all of the words in a single page of the dictionary.
47+
;; It could look like "cat", "catamaran", "catamount", "category", "caternary", etc.
48+
;; It could instead be stored as:
49+
;;
50+
;; ```
51+
;; c - a - t *
52+
;; a - m
53+
;; a - r - a - n *
54+
;; o - u - n - t *
55+
;; e
56+
;; g - o - r - y *
57+
;; r - n - a - r - y *
58+
;; ```
59+
60+
;; Why would we use a prefix tree? Even if the source text patterns in the replacement rules are
61+
;; overlapping, we could perform replacement without a tree if we order the replacement rules
62+
;; by the source text pattern, such that a pattern that contains another pattern is applied earlier.
63+
;; However, to perform this ordering in a globally scalable way would effectively require
64+
;; constructing a prefix tree. Furthermore, a map of rules better models the notion of rules being
65+
;; independent data that are not complected with other rules. Also, as the number of rules increases,
66+
;; there may be performance benefits in terms of lookup in a prefix tree versus attempting to apply
67+
;; all rules in the ruleset sequentially.
68+
69+
;; Let's introspect into our prefix tree. Let's see which input strings have a
70+
(fmt/in-trie? translit-trie "hap")
71+
(fmt/in-trie? translit-trie "happy")
72+
(fmt/in-trie? translit-trie "happier")
73+
(fmt/in-trie? translit-trie "happiest")
74+
(fmt/in-trie? translit-trie "happiest!")
75+
76+
(def s "Hello, world! Happiness is not being happiest or happier than the rest, but instead just being happy.")
77+
78+
(defn convert
79+
"Use our translit-trie to convert the input string into the output string"
80+
[s]
81+
(->> (fmt/str->elems translit-trie s)
82+
(apply str)))
83+
84+
(def converted
85+
"Create the converted string according to our transliteration rules."
86+
(convert s))
87+
88+
converted
89+
90+
;; It's worth noting that a prefix tree, when used to do transliteration conversions, is
91+
;; effectively the finite state machine (FSM) needed to parse and transform.
92+
;;
93+
;; For next time: What if we implicitly did that same conversion by constructing a regular expression (regex)
94+
;; that can match on the input patterns. Could that be equally fast, or faster than our naive Clojure
95+
;; implementation? A regex might work like so:
96+
;;
97+
;; ```js
98+
;; let text = "this is a test";
99+
;; const replacementMap = { 'th': 'X', 't': 'Y' };
100+
;;
101+
;; let result = text.replace(/th|t/g, (match) => {
102+
;; return replacementMap[match];
103+
;; });
104+
;;
105+
;; console.log(result);
106+
;; ```

0 commit comments

Comments
 (0)