Skip to content

Commit 0f7051f

Browse files
committed
More improvements to compact form.
w3c/mathml#176
1 parent aaeeaef commit 0f7051f

File tree

3 files changed

+140
-79
lines changed

3 files changed

+140
-79
lines changed

index.html

Lines changed: 65 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -5283,7 +5283,6 @@ <h3>Operator Dictionary</h3>
52835283
</section>
52845284
<section id="operator-dictionary-compact">
52855285
<h3>Operator Dictionary (Compact)</h3>
5286-
<div class="issue" data-number="209">Remove fence/separator?</div>
52875286
<p>
52885287
The following dictionary provides a compact form for the
52895288
<a href="#operator-dictionary">operator dictionary</a>, suitable for
@@ -5316,23 +5315,58 @@ <h3>Operator Dictionary (Compact)</h3>
53165315
<code>fence</code>, <code>separator</code> to <code>false</code>.
53175316
</li>
53185317
<li>If <code>Content</code> is a single character in the
5319-
BMP Private Use Area (range U+E000–U+F8FF)
5318+
range U+0320–U+03FF
53205319
then exit with <code>NotFound</code> status.</li>
53215320
<li>
53225321
If <code>Content</code> an UTF-16 strings of lengths more than 1
53235322
(including the case of surrogate pairs) and is listed in
53245323
<a href="#operator-dictionary-compact-special-tables"><code>Operators_multichar</code></a> then
53255324
replace <code>Content</code> with the Unicode character
5326-
"U+E000 plus the index of <code>Content</code> in
5327-
<code>Operators_multichar</code>". Otherwise, exit with
5328-
<code>NotFound</code> status.
5325+
"U+0320 plus the index of <code>Content</code> in
5326+
<code>Operators_multichar</code>". If it is not listed, then
5327+
exit with <code>NotFound</code> status.
53295328
</li>
5330-
<li>If (<code>Content</code>, <code>Form</code>)
5331-
corresponds to one category of
5332-
<a href="#operator-dictionary-category-table"></a> then
5333-
set the properties according to
5334-
<a href="#operator-dictionary-categories-values"></a>.
5335-
Otherwise, exit with <code>NotFound</code> status.
5329+
<li>
5330+
During this step, the algorithm will try and find a category
5331+
corresponding to (<code>Content</code>, <code>Form</code>) from
5332+
<a href="#operator-dictionary-category-table"></a> and
5333+
either exit with <code>NotFound</code> status or and move to
5334+
the next point. More precisely, this can be done as follows:
5335+
<ul>
5336+
<li>For categories that don't have an encoding in
5337+
<a href="#operator-dictionary-categories-values"></a>
5338+
(namely K, M, N) perform a few direct verifications
5339+
on (<code>Content</code>, <code>Form</code>)
5340+
according to <a href="#operator-dictionary-category-table"></a>.
5341+
If a result is found then set the properties according to
5342+
<a href="#operator-dictionary-categories-values"></a>.
5343+
Otherwise exit with <code>NotFound</code> status.
5344+
</li>
5345+
<li>For other categories, perform the following steps:
5346+
<ul>
5347+
<li>Set <code>Key</code> to <code>Content</code> if it is in
5348+
range U+0000–U+03FF ; or to <code>Content</code> − 0x1C00
5349+
if it is in range U+2000–U+2BFF. Otherwise, exit with
5350+
<code>NotFound</code> status.
5351+
<code>Key</code> is at most 0x0FFF.
5352+
</li>
5353+
<li>Add 0x0000, 0x1000, 0x2000
5354+
to <code>Key</code> according to whether <code>Form</code>
5355+
is <code>infix</code>, <code>prefix</code>,
5356+
<code>postfix</code> respectively.
5357+
<code>Key</code> is at most 0x2FFF.
5358+
</li>
5359+
<li>Search an <code>Entry</code> in table
5360+
<a href="#operator-dictionary-categories-hexa-table"></a>
5361+
such <code>Entry</code> % 0x4000 is equal to
5362+
<code>Key</code>. Either exit with
5363+
<code>NotFound</code> status or
5364+
set the properties corresponding to the category with
5365+
encoding <code>Entry</code> / 0x1000 in
5366+
<a href="#operator-dictionary-categories-values"></a>.
5367+
</ul>
5368+
</li>
5369+
</ul>
53365370
</li>
53375371
<li>If <code>Content</code> is in
53385372
<a href="#operator-dictionary-compact-special-tables"><code>Operators_fence</code></a> then set property <code>fence</code> to true.</li>
@@ -5348,58 +5382,41 @@ <h3>Operator Dictionary (Compact)</h3>
53485382
</li>
53495383
</ol>
53505384

5385+
<div class="note">
5386+
The <code>fence</code> and <code>separator</code> properties do not
5387+
have any visible effect on the layout described in this
5388+
specification. So step 5 and 6 as well as the corresponding tables
5389+
may be ignored.
5390+
</div>
5391+
<div class="issue" data-number="209">Remove fence/separator?</div>
5392+
53515393
<div id="operator-dictionary-entries"
53525394
data-include="tables/operator-dictionary-compact.html"></div>
53535395

53545396
<div class="note" id="operator-dictionary-compact-implementations">
53555397
<p>
5356-
After conversion to a single UTF-16 character, determining the
5357-
category of ('Content', 'Form') can be done by binary searches
5358-
on the tables corresponding to the 'Form' value
5359-
of <a href="#operator-dictionary-category-table"></a>.
5360-
For tables of ranges, the binary search can be performed on the
5361-
range start code point. Note that small tables only have a few
5362-
ranges or code points to check and so can be handled by direct
5363-
comparaisons.
5398+
When encoded as ranges, one can perform a binary search by looking
5399+
for the range start, followed by an extra check on the range length.
5400+
Since log is concave,
5401+
it is worse to do one binary search on each large subtable
5402+
of <a href="#operator-dictionary-category-table"></a> than one
5403+
binary search on the whole table of
5404+
<a href="#operator-dictionary-categories-hexa-table"></a>.
5405+
One can see that there are several contiguous Unicode blocks, so
5406+
encoding tables as ranges allow to get almost 8 bits per entry.
53645407
</p>
53655408
<p>
5366-
The possible characters 'Content' values after conversion
5367-
characters are located into the three small ranges
5368-
U+0000–U+03FF, U+2000–U+2BFF and
5369-
U+E000–U+E04F and after simple offset shift can be encoded on
5370-
12 bits. Note that all Unicode ranges from
5371-
and <a href="#operator-dictionary-category-table"></a>
5372-
contain between 1 and 32 characters. By splitting ranges into
5373-
at most two parts, each range can be encoded on 16 bits.
5374-
Due to several contiguous Unicode blocks, the tables would still be
5375-
encoded in significantly less than 16bits/entry but all the
5376-
tables are now encoded and treated the same way.
5377-
</p>
5378-
<p>
5379-
Alternatively, discarding the smallest tables as explained above,
5380-
one can consider only those having a 4bits encoding in
5381-
<a href="#operator-dictionary-categories-values"></a>.
5382-
Using the 12-bit encoding of the 'Content' described
5383-
above this means that these tables can be encoded with
5384-
16bits/entry but binary search would now be performed on a single
5385-
table.
5386-
</p>
5387-
<p>
5388-
Continuing on the previous approach, it is possible to
5409+
Alternatively, it is possible to
53895410
use a perfect hash function to implement table lookup in constant
5390-
time [[?gperf]] [[?CMPH]]. This would add 16 bits per empty entry
5411+
time [[?gperf]] [[?CMPH]]. This would instead take
5412+
16 bits per entry, plus 16 bits per extra empty entry
53915413
(for non-minimal perfect hash function) as well as extra data to
53925414
store the hash function parameters. For minimal perfect hash
53935415
function, the theorical lower bound for storing these parameters is
53945416
1.44bits/entry and existing algorithms range from close to that
53955417
limit up to 4bits/entry.
53965418
</p>
53975419
</div>
5398-
<div class="issue">
5399-
TODO give more compact form in two tables combining the two ideas above for encoding categories 0-9 + 11:
5400-
Table 1: 12bits (start code point) + 4bit (form+category)
5401-
Table 2: 1, 2 or 4bits? (number of code points in contiguous block)
5402-
</div>
54035420
</section>
54045421
<section id="stretchy-operator-axis">
54055422
<h3>Stretchy Operator Axis</h3>

0 commit comments

Comments
 (0)