TLDR: It is an extension of the regular expressions for text editing and a grep
-like command line tool.
Regular expressions is a great tool for searching patterns in text. But I always found it unnatural for text editing. The group logic works as a post-processor and can be complicated. Here I propose an extension to the regular expression language for pattern matching and text modification. I call it transductive regular expressions or simply trre
.
It introduces the :
symbol to define transformations. The simplest form is a:b
, which replaces a with b. I call this a transductive pair
or transduction
.
To demonstrate the concept, I have created a command line tool trre
. It feels similar to the grep -E
command.
To change cat
to dog
we use the following expression:
$ echo 'cat' | trre 'c:da:ot:g'
dog
A more readable version:
$ echo 'cat' | trre '(cat):(dog)'
dog
It can be used like sed
to replace all matches in a string:
$ echo 'Mary had a little lamb.' | trre '(lamb):(cat)'
Mary had a little cat.
Deletion:
$ echo 'xor' | trre '(x:)or'
or
The expression (x:)
could be interpreted as of translation of x
to an empty string.
Insertion:
$ echo 'or' | trre '(:x)or'
xor
We could think of the expression (:x)
as of translation of an empty string into x
.
As for normal regular expression we could use alternations with |
symbol:
$ echo 'cat dog' | trre 'c:bat|d:hog'
bat hog
Or use the star over trre
to repeat the transformation:
$ echo 'catcatcat' | trre '((cat):(dog))*'
dogdogdog
In the default scan
mode, star can be omitted:
$ echo 'catcatcat' | trre '(cat):(dog)'
dogdogdog
You can also use the star in the left part to "consume" a pattern infinitely:
$ echo 'catcatcat' | trre '(cat)*:(dog)'
dog
Avoid using *
or +
in the right part, as it can cause infinite loops:
$ echo '' | trre ':a*' # <- do NOT do this
...
Instead, use finite iterations:
$ echo '' | trre ':(repeat-10-times){10}'
repeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-times
Transform ranges of characters:
$ echo "regular expressions" | trre "[a:A-z:Z]"
REGULAR EXPRESSIONS
As more complex example, lets create a toy cipher. Below is the Caesar cipher(1) implementation:
$ echo 'caesar cipher' | trre '[a:b-y:zz:a]'
dbftbs djqifs
And decrypt it back:
$ echo 'dbftbs djqifs' | trre '[a:zb:a-y:x]'
caesar cipher
trre
can generate multiple output strings for a single input. By default, it uses the first possible match. You can also generate all possible outputs.
Binary sequences:
$ echo '' | trre -ma ':(0|1){3}'
000
001
010
011
100
101
110
111
Subsets:
$ echo '' | trre -ma ':(0|1){,3}?'
0
00
000
001
01
010
011
1
10
100
101
11
110
111
Informally, we define a trre
as a pair pattern-to-match
:pattern-to-generate
. The pattern-to-match
can be a string or regexp. The pattern-to-generate
normally is a string. But it can be a regex
as well. Moreover, we can do normal regular expression over these pairs.
We can specify a smiplified grammar of trre
as following:
TRRE <- TRRE* TRRE|TRRE TRRE.TRRE
TRRE <- REGEX REGEX:REGEX
Where REGEX
is a usual regular expression, and .
stands for concatenation.
For now I consider the operator :
as non-associative and the expression TRRE:TRRE
as unsyntactical. There is a natural meaning for this expression as a composition of relations defined by **trre
**s. But it can make things too complex. Maybe I change this later.
Under the hood, trre
constructs a special automaton called a Finite State Transducer (FST), which is similar to a Finite State Automaton (FSA) used in normal regular expressions but handles input-output pairs instead of plain strings.
Key differences:
-
trre
defines a binary relation between two regular languages. -
It uses FSTs instead of FSAs for inference.
-
It supports on-the-fly determinization for performance (experimental and under construction!).
To justify the laguage of trunsductive regular expression we need to prove the correspondence between trre
expressions and the corresponding FSTs. There is my sketch of a the proof: theory.pdf.
There are tons of decisions:
-
Associativity of
:
. The:
symbol is non-associative, meaning a:b:c is invalid in the current version. There is a natural meaning of transducer composition but it could make things too complicated. Alternative syntaxes (e.g., > and <) could be explored. -
Precedence of
:
. The priority of : in expressions needs clarification. -
Implicit Epsilon: should there be an explicit symbol?
trre
supports two modes:
-
Scan Mode (default): Applies transformations sequentially.
-
Match Mode: Checks the entire string against the expression (use
-m
flag).
Use -a
to generate all possible outputs.
The ?
modifier makes *
, +
, and {,}
operators non-greedy:
$ echo '<cat><dog>' | trre '<(.:)*>'
<>
$ echo '<cat><dog>' | trre '<(.:)*?>'
<><>
The important part of the modern regex engines is determinization. This routine converts the non-deterministic automata to the deterministic one. Once converted it has linear time inference on the input string length. It is handy but the convertion is exponential in the worst case.
For trre
the similar approach is possible. The bad news is that not all the non-deterministic transducers (NFT) can be converted to a deterministic (DFT). In case of two "bad" cycles with same input labels the algorithm is trapped in the infinite loop of a state creation. There is a way to detect such loops but it is expensive (see more in Allauzen, Mohri, Efficient Algorithms for testing the twins property).
The NFT (non-deterministic) version is a bit slower then sed
:
$ wget https://www.gutenberg.org/cache/epub/57333/pg57333.txt -O chekhov.txt
$ time cat chekhov.txt | trre '(vodka):(VODKA)' > /dev/null
real 0m0.046s
user 0m0.043s
sys 0m0.007s
$ time cat chekhov.txt | sed 's/vodka/VODKA/' > /dev/null
real 0m0.024s
user 0m0.020s
sys 0m0.010s
For complex tasks, trre_dft
(deterministic version) can outperform sed:
$ time cat chekhov.txt | sed -e 's/\(.*\)/\U\1/' > /dev/null
real 0m0.508s
user 0m0.504s
sys 0m0.015s
$ time cat chekhov.txt | trre_dft '[a:A-z:Z]' > /dev/null
real 0m0.131s
user 0m0.127s
sys 0m0.009s
No pre-built binaries are available yet. Clone the repository and compile:
git clone [email protected]:c0stya/trre.git trre
cd trre
make && sh test.sh
Then move the binary to a directory in your $PATH
.
- Complete the ERE feature set:
- negation
^
within[]
- character classes
- '$^' anchoring symbols
- negation
- Full unicode support
- Efficient range processing
- Stable DFT version
- The approach is strongly inspired by Russ Cox article: Regular Expression Matching Can Be Simple And Fast
- The idea of transducer determinization comes from this paper: Cyril Allauzen, Mehryar Mohri, "Finitely Subsequential Transducers"
- Parsing approach comes from Double-E algorithm by Erik Eidt which is close to the classical Shunting Yard algorithm.