Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of static probability tables format #8

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions STATIC PREDICTION TABLES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# About this document

This is a Work in Progress.

This document describes a format for storing Static Prediction Tables.

It is possible that some of all of this format will be usable as part of
individual compressed source files, to be determined.

# About Static Prediction Tables (SPT)

A Static Prediction Table is a form of external dictionary, shipped either
with the JavaScript VM, or separately, and which may be referenced by any
number of compressed Binary AST Source Files.

Shipping Static Prediction Tables makes it possible to considerably reduce
the size of individual compressed files.

This document does not attempt to document how and when a SPT is loaded,
or how and when an individual compressed file references a SPT.

# Design guidelines

- A SPT must be usable by many compressed source files.
- A VM must be able to manage several SPTs simultaneously.
- As JavaScript is a changing language, a SPT that is complete at a given point in time may not be expected to remain
complete forever.
- As Binary AST never reuses the same interface name for distinct purposes, a Path in the AST that is valid at a point
in time will remain valid forever.
- For upgrade purposes, a SPT may be defined as an amendment to another SPT.
- A SPT may define additional strings of various natures.

# Format

## Header

The header:
- specifies the kind of file;
- references the grammar version;
- optionally, references a SPT file it amends.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this early stage, "delta" SPTs do not seem valuable to spend effort in speccing. They are off the fastpath anyway, and I can see them adding a lot of complexity to the spec. Let's leave the deltas until we actually feel we need them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the highest priority, but let's keep an eye on the road :)


TBD

## Tables of Strings

These tables add new strings that may be referenced both in the tables of probabilities

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, the probability tables for strings just predict indexes into a move-to-front cache. We do not actually need to assign general probabilities to the string table itself - they will be predicted well after they are first referenced (and encoded using some varuint-encoding), and subsequently added to the MoveToFront String cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true for string literals, identifier names and property keys.

On the other hand, it's not true for interface names and string enums.

I'll amend the text to clarify.

and in the compressed files.

### Table of property keys
TBD

### Table of string enum constants

This table adds string enum constants. It is used when updating the JavaScript grammar.

TBD

### Table of interface names

This table adds interface names. It is used when updating the JavaScript grammar.

TBD

### Table of literal strings
TBD

### Table of identifier names
TBD


## Tables of Probabilities

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify the specification of probability tables by specifying that independently. We know that each probability table will specify the probabilities for a finite and relatively "small" set of symbols.

For context-prediction of tree types, it's the set of schema-bounded types at that location. For string predictions, its the set {0, 1, .., N-1, MISS} where N is the size of the MoveToFront string cache, etc.

Each table can be encoded simply as a series of 32-bit integers, where the sum of all entries are guaranteed to be less than UINT32_MAX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get where anything is simplified.


These tables increase or reset to 0 the number of instances of (value) at (path).

Entries with depth N look like:
- list of
- the Path itself, as a list of exactly N entries of
- InterfaceName (as a number, exact format to be determined)
- Field index
- the distribution at this Path, as a list of
- Field index
- Value (format to be determined)
- Number of instances, where
- 0 means that we remove this (Path, Field, Value) from the probability table
- otherwise, if (Path, Field, Value) was in the probability table, we increase its previous number of instances
- otherwise, we add (Path, Field, Value) to the probability table with `Number of instances` instances.

The table of probabilities contains
- One entry of depth 0, with a single field: the root.
- Entries of depth 1, for possible children of the root.
- Entries of depth 2, for possible grandchildren of the root.
- ...
- Entries of depth D, for all other nodes.


TBD