@@ -5283,7 +5283,6 @@ <h3>Operator Dictionary</h3>
5283
5283
</ section >
5284
5284
< section id ="operator-dictionary-compact ">
5285
5285
< h3 > Operator Dictionary (Compact)</ h3 >
5286
- < div class ="issue " data-number ="209 "> Remove fence/separator?</ div >
5287
5286
< p >
5288
5287
The following dictionary provides a compact form for the
5289
5288
< a href ="#operator-dictionary "> operator dictionary</ a > , suitable for
@@ -5316,23 +5315,58 @@ <h3>Operator Dictionary (Compact)</h3>
5316
5315
< code > fence</ code > , < code > separator</ code > to < code > false</ code > .
5317
5316
</ li >
5318
5317
< li > If < code > Content</ code > is a single character in the
5319
- BMP Private Use Area ( range U+E000 –U+F8FF)
5318
+ range U+0320 –U+03FF
5320
5319
then exit with < code > NotFound</ code > status.</ li >
5321
5320
< li >
5322
5321
If < code > Content</ code > an UTF-16 strings of lengths more than 1
5323
5322
(including the case of surrogate pairs) and is listed in
5324
5323
< a href ="#operator-dictionary-compact-special-tables "> < code > Operators_multichar</ code > </ a > then
5325
5324
replace < code > Content</ code > with the Unicode character
5326
- "U+E000 plus the index of < code > Content</ code > in
5327
- < code > Operators_multichar</ code > ". Otherwise, exit with
5328
- < code > NotFound</ code > status.
5325
+ "U+0320 plus the index of < code > Content</ code > in
5326
+ < code > Operators_multichar</ code > ". If it is not listed, then
5327
+ exit with < code > NotFound</ code > status.
5329
5328
</ li >
5330
- < li > If (< code > Content</ code > , < code > Form</ code > )
5331
- corresponds to one category of
5332
- < a href ="#operator-dictionary-category-table "> </ a > then
5333
- set the properties according to
5334
- < a href ="#operator-dictionary-categories-values "> </ a > .
5335
- Otherwise, exit with < code > NotFound</ code > status.
5329
+ < li >
5330
+ During this step, the algorithm will try and find a category
5331
+ corresponding to (< code > Content</ code > , < code > Form</ code > ) from
5332
+ < a href ="#operator-dictionary-category-table "> </ a > and
5333
+ either exit with < code > NotFound</ code > status or and move to
5334
+ the next point. More precisely, this can be done as follows:
5335
+ < ul >
5336
+ < li > For categories that don't have an encoding in
5337
+ < a href ="#operator-dictionary-categories-values "> </ a >
5338
+ (namely K, M, N) perform a few direct verifications
5339
+ on (< code > Content</ code > , < code > Form</ code > )
5340
+ according to < a href ="#operator-dictionary-category-table "> </ a > .
5341
+ If a result is found then set the properties according to
5342
+ < a href ="#operator-dictionary-categories-values "> </ a > .
5343
+ Otherwise exit with < code > NotFound</ code > status.
5344
+ </ li >
5345
+ < li > For other categories, perform the following steps:
5346
+ < ul >
5347
+ < li > Set < code > Key</ code > to < code > Content</ code > if it is in
5348
+ range U+0000–U+03FF ; or to < code > Content</ code > − 0x1C00
5349
+ if it is in range U+2000–U+2BFF. Otherwise, exit with
5350
+ < code > NotFound</ code > status.
5351
+ < code > Key</ code > is at most 0x0FFF.
5352
+ </ li >
5353
+ < li > Add 0x0000, 0x1000, 0x2000
5354
+ to < code > Key</ code > according to whether < code > Form</ code >
5355
+ is < code > infix</ code > , < code > prefix</ code > ,
5356
+ < code > postfix</ code > respectively.
5357
+ < code > Key</ code > is at most 0x2FFF.
5358
+ </ li >
5359
+ < li > Search an < code > Entry</ code > in table
5360
+ < a href ="#operator-dictionary-categories-hexa-table "> </ a >
5361
+ such < code > Entry</ code > % 0x4000 is equal to
5362
+ < code > Key</ code > . Either exit with
5363
+ < code > NotFound</ code > status or
5364
+ set the properties corresponding to the category with
5365
+ encoding < code > Entry</ code > / 0x1000 in
5366
+ < a href ="#operator-dictionary-categories-values "> </ a > .
5367
+ </ ul >
5368
+ </ li >
5369
+ </ ul >
5336
5370
</ li >
5337
5371
< li > If < code > Content</ code > is in
5338
5372
< a href ="#operator-dictionary-compact-special-tables "> < code > Operators_fence</ code > </ a > then set property < code > fence</ code > to true.</ li >
@@ -5348,58 +5382,41 @@ <h3>Operator Dictionary (Compact)</h3>
5348
5382
</ li >
5349
5383
</ ol >
5350
5384
5385
+ < div class ="note ">
5386
+ The < code > fence</ code > and < code > separator</ code > properties do not
5387
+ have any visible effect on the layout described in this
5388
+ specification. So step 5 and 6 as well as the corresponding tables
5389
+ may be ignored.
5390
+ </ div >
5391
+ < div class ="issue " data-number ="209 "> Remove fence/separator?</ div >
5392
+
5351
5393
< div id ="operator-dictionary-entries "
5352
5394
data-include ="tables/operator-dictionary-compact.html "> </ div >
5353
5395
5354
5396
< div class ="note " id ="operator-dictionary-compact-implementations ">
5355
5397
< p >
5356
- After conversion to a single UTF-16 character, determining the
5357
- category of ('Content', 'Form') can be done by binary searches
5358
- on the tables corresponding to the 'Form' value
5359
- of < a href ="#operator-dictionary-category-table "> </ a > .
5360
- For tables of ranges, the binary search can be performed on the
5361
- range start code point. Note that small tables only have a few
5362
- ranges or code points to check and so can be handled by direct
5363
- comparaisons.
5398
+ When encoded as ranges, one can perform a binary search by looking
5399
+ for the range start, followed by an extra check on the range length.
5400
+ Since log is concave,
5401
+ it is worse to do one binary search on each large subtable
5402
+ of < a href ="#operator-dictionary-category-table "> </ a > than one
5403
+ binary search on the whole table of
5404
+ < a href ="#operator-dictionary-categories-hexa-table "> </ a > .
5405
+ One can see that there are several contiguous Unicode blocks, so
5406
+ encoding tables as ranges allow to get almost 8 bits per entry.
5364
5407
</ p >
5365
5408
< p >
5366
- The possible characters 'Content' values after conversion
5367
- characters are located into the three small ranges
5368
- U+0000–U+03FF, U+2000–U+2BFF and
5369
- U+E000–U+E04F and after simple offset shift can be encoded on
5370
- 12 bits. Note that all Unicode ranges from
5371
- and < a href ="#operator-dictionary-category-table "> </ a >
5372
- contain between 1 and 32 characters. By splitting ranges into
5373
- at most two parts, each range can be encoded on 16 bits.
5374
- Due to several contiguous Unicode blocks, the tables would still be
5375
- encoded in significantly less than 16bits/entry but all the
5376
- tables are now encoded and treated the same way.
5377
- </ p >
5378
- < p >
5379
- Alternatively, discarding the smallest tables as explained above,
5380
- one can consider only those having a 4bits encoding in
5381
- < a href ="#operator-dictionary-categories-values "> </ a > .
5382
- Using the 12-bit encoding of the 'Content' described
5383
- above this means that these tables can be encoded with
5384
- 16bits/entry but binary search would now be performed on a single
5385
- table.
5386
- </ p >
5387
- < p >
5388
- Continuing on the previous approach, it is possible to
5409
+ Alternatively, it is possible to
5389
5410
use a perfect hash function to implement table lookup in constant
5390
- time [[?gperf]] [[?CMPH]]. This would add 16 bits per empty entry
5411
+ time [[?gperf]] [[?CMPH]]. This would instead take
5412
+ 16 bits per entry, plus 16 bits per extra empty entry
5391
5413
(for non-minimal perfect hash function) as well as extra data to
5392
5414
store the hash function parameters. For minimal perfect hash
5393
5415
function, the theorical lower bound for storing these parameters is
5394
5416
1.44bits/entry and existing algorithms range from close to that
5395
5417
limit up to 4bits/entry.
5396
5418
</ p >
5397
5419
</ div >
5398
- < div class ="issue ">
5399
- TODO give more compact form in two tables combining the two ideas above for encoding categories 0-9 + 11:
5400
- Table 1: 12bits (start code point) + 4bit (form+category)
5401
- Table 2: 1, 2 or 4bits? (number of code points in contiguous block)
5402
- </ div >
5403
5420
</ section >
5404
5421
< section id ="stretchy-operator-axis ">
5405
5422
< h3 > Stretchy Operator Axis</ h3 >
0 commit comments