Skip to content

Algebraic tree description #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Oct 27, 2020
193 changes: 138 additions & 55 deletions spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,37 @@ We also use the following operators and functions:
i.e. if $X = \langle X_0, \dots, X_N \rangle$ and $Y = \langle Y_0,
\dots, Y_M \rangle$ then $X \mathbin{\|} Y = \langle X_0, \dots, X_N, Y_0, \dots, Y_M
\rangle$
- $\operatorname{min}(x, y)$ denotes the minimum of $x$ and $y$.
- $\min(x, y)$ denotes the minimum of $x$ and $y$ and $\max(x, y)$ denotes the maximum
- $\operatorname{Type}(x)$ denotes the type of $x$.

Finally, we define the “prefix” $\mathbb{P}_q(X)$
of a non-empty sequence $X$
with respect to a given predicate $q$
to be the initial subsequence $X^\prime$ of $X$
up to and including the first member that makes $q(X^\prime)$ true.
And we define the “remainder” $\mathbb{R}_q(X)$
to be everything left after removing the prefix.

Formally,
given a sequence $X = \langle X_0, \dots, X_{|X|-1} \rangle$
and a predicate $q \in \operatorname{Type}(X) \rightarrow \{\text{true},\text{false}\}$,

$\mathbb{P}_q(X) = \langle X_0, \dots, X_e \rangle$

for the smallest integer $e$ such that:

- $0 \le e < |X|$ and
- $q(\langle X_0, \dots, X_e \rangle) = \text{true}$

or $|X|-1$ if no such integer exists.
(I.e., if nothing satisfies $q$, the prefix is the whole sequence.)
And:

$\mathbb{R}_q(X) = \langle X_b, \dots, X_{|X|-1} \rangle$

where $b = |\mathbb{P}_q(\langle X_0, \dots, X_{|X|-1} \rangle)|$.

Note that when $\mathbb{P}_q(X) = X$, $\mathbb{R}_q(X) = \langle \rangle$.

# Splitting

Expand All @@ -74,7 +104,7 @@ functions:

$\operatorname{SPLIT}_C \in V_8 \rightarrow V_v$

...which is parameterized by a configuration $C$, consisting of:
...which is parameterized by a _configuration_ $C$, consisting of:

- $S_{\text{min}} \in U_{32}$, the minimum split size
- $S_{\text{max}} \in U_{32}$, the maximum split size
Expand All @@ -87,24 +117,21 @@ The configuration must satisfy $S_{\text{max}} \ge S_{\text{min}} > 0$.

We define the constant $W$, which we call the "window size," to be 64.

The "split index" $I(X)$ of a sequence $X$ is either the smallest
non-negative integer $i$ satisfying:
We define the predicate $q_C(X)$
on a non-empty byte sequence $X$
with respect to a configuration $C$
to be:

- $i \le |X|$ and
- $S_{\text{max}} \ge i \ge S_{\text{min}}$ and
- $H(\langle X_{i-W}, \dots, X_{i-1} \rangle) \mod 2^T = 0$

...or $\operatorname{min}(|X|, S_{\text{max}})$, if no such $i$ exists. For the
purposes of this definition we set $X_i = 0$ for $i < 0$.

The “prefix” $P(X)$ of a non-empty sequence $X$ is $\langle X_0, \dots, X_{I(X)-1} \rangle$.

The “remainder” $R(X)$ of a non-empty sequence $X$ is $\langle X_{I(X)}, \dots, X_{|X|-1} \rangle$.
- $\text{true}$ if $|X| = S_{\text{max}}$; otherwise
- $\text{true}$ if $|X| \ge S_{\text{min}}$ and $H(\langle X_{\max(0,|X|-W)}, \dots, X_{|X|-1} \rangle) \mod 2^T = 0$
(i.e., the last $W$ bytes of $X$ hash to a value with at least $T$ trailing zeroes);
otherwise
- $\text{false}$.

We define $\operatorname{SPLIT}_C(X)$ recursively, as follows:

- If $|X| = 0$, $\operatorname{SPLIT}_C(X) = \langle \rangle$
- Otherwise, $\operatorname{SPLIT}_C(X) = \langle P(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(R(X))$
- Otherwise, $\operatorname{SPLIT}_C(X) = \langle \mathbb{P}_{q_C}(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(\mathbb{R}_{q_C}(X))$

# Tree Construction

Expand Down Expand Up @@ -133,63 +160,119 @@ will differ only in the subtrees in the vicinity of the differences.

## Definitions

The “hashval” $V(X)$ of a sequence $X$ is:
A “chunk” is a member of the sequence produced by $\operatorname{SPLIT}_C$.

The “hashval” $V_C(X)$ of a byte sequence $X$ is:

$H(\langle X_{\operatorname{max}(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$
$H(\langle X_{\max(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$

(i.e., the hash of the last $W$ bytes of $X$).

The “level” $L(X)$ of a sequence $X$ is $Q - T$,
A “node” $N_{h,i}$ in a hashsplit tree
at non-negative “height” $h$
is a sequence of children.
The children of a node at height 0 are chunks.
The children of a node at height $h+1$ are nodes at height $h$.

A “tier” of a hashsplit tree is a sequence of nodes
$N_h = \langle N_{h,0}, \dots, N_{h,k} \rangle$
at a given height $h$.

The function $\operatorname{Rightmost}(N_{h,i})$
on a node $N_{h,i} = \langle S_0, \dots, S_e \rangle$
produces the “rightmost leaf chunk”
defined recursively as follows:

- If $h = 0$, $\operatorname{Rightmost}(N_{h,i}) = S_e$
- If $h > 0$, $\operatorname{Rightmost}(N_{h,i}) = \operatorname{Rightmost}(S_e)$

The “level” $L_C(X)$ of a given chunk $X$
is $\max(0, Q - T)$,
where $Q$ is the largest integer such that

- $Q \le 32$ and
- $V(P(X)) \mod 2^Q = 0$

(i.e., the level is the number of trailing zeroes in the rolling checksum in excess of the threshold needed to produce the prefix chunk $P(X)$).

(Note:
When $|R(X)| > 0$,
$L(X)$ is non-negative,
because $P(X)$ is defined in terms of a hash with $T$ trailing zeroes.
But when $|R(X)| = 0$,
that hash may have fewer than $T$ trailing zeroes,
and so $L(X)$ may be negative.
This makes no difference to the algorithm below, however.)

A “node” in a hashsplit tree
is a pair $(D, C)$
where $D$ is the node’s “depth”
and $C$ is a sequence of children.
The children of a node at depth 0 are chunks
(i.e., subsequences of the input).
The children of a node at depth $D > 0$ are nodes at depth $D - 1$.

The function $\operatorname{Children}(N)$ on a node $N = (D, C)$ produces $C$
(the sequence of children).
- $V_C(\mathbb{P}_{q_C}(X)) \mod 2^Q = 0$

(i.e., the level is the number of trailing zeroes in the hashval
in excess of the threshold needed
to produce the prefix chunk $\mathbb{P}_{q_C}(X)$).

The level $L_C(N)$ of a given _node_ $N$
is the level of its rightmost leaf chunk:
$L_C(N) = L_C(\operatorname{Rightmost}(N))$

The predicate $z_{C,h}(K)$
on a sequence $K = \langle K_0, \dots, K_e \rangle$
of chunks or of nodes
with respect to a height $h$
is defined as:

- $\text{true}$ if $L_C(K_e) > h$; otherwise
- $\text{false}$.

For conciseness, define

- $P_C(X) = \mathbb{P}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$ and
- $R_C(X) = \mathbb{R}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$

## Algorithm

To compute a hashsplit tree from sequence $X$,
This section contains two descriptions of hashsplit trees:
an algebraic description for formal reasoning,
and a procedural description for practical construction.

### Algebraic description

The tier $N_0$
of hashsplit tree nodes
for a given byte sequence $X$
is equal to

$\langle P_C(X) \rangle \mathbb{\|} R_C(X)$

The tier $N_{h+1}$
of hashsplit tree nodes
for a given byte sequence $X$
is equal to

$\langle \mathbb{P}_{z_{C,h+1}}(N_h) \rangle \mathbb{\|} \mathbb{R}_{z_{C,h+1}}(N_h)$

(I.e., each node in the tree has as its children
a sequence of chunks or lower-tier nodes,
as appropriate,
up to and including the first one
whose “level” is greater than the node’s height.)

The root of the hashsplit tree is $N_{h^\prime,0}$
for the smallest value of $h^\prime$
such that $|N_{h^\prime}| = 1$

### Procedural description

For this description we use $N_h$ to denote a single node at height $h$.
The algorithm must keep track of the “rightmost” such node for each tier in the tree.

To compute a hashsplit tree from a byte sequence $X$,
compute its “root node” as follows.

1. Let $N_0$ be $(0, \langle\rangle)$ (i.e., a node at depth 0 with no children).
1. Let $N_0$ be $\langle\rangle$ (i.e., a node at height 0 with no children).
2. If $|X| = 0$, then:
a. Let $d$ be the largest depth such that $N_d$ exists.
b. If $|\operatorname{Children}(N_0)| > 0$, then:
i. For each integer $i$ in $[0 .. d]$, “close” $N_i$.
ii. Set $d \leftarrow d+1$.
c. [pruning] While $d > 0$ and $|\operatorname{Children}(N_d)| = 1$, set $d \leftarrow d-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child).
d. **Terminate** with $N_d$ as the root node.
3. Otherwise, set $N_0 \leftarrow (0, \operatorname{Children}(N_0) \mathbin{\|} \langle P(X) \rangle)$ (i.e., add $P(X)$ to the list of children in $N_0$).
4. For each integer $i$ in $[0 .. L(X))$, “close” the node $N_i$ (see below).
5. Set $X \leftarrow R(X)$.
a. Let $h$ be the largest height such that $N_h$ exists.
b. If $|N_0| > 0$, then:
i. For each integer $i$ in $[0 .. h]$, “close” $N_i$ (see below).
ii. Set $h \leftarrow h+1$.
c. [pruning] While $h > 0$ and $|N_h| = 1$, set $h \leftarrow h-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child).
d. **Terminate** with $N_h$ as the root node.
3. Otherwise, set $N_0 \leftarrow N_0 \mathbin{\|} \langle P_C(X) \rangle$ (i.e., add $P_C(X)$ to the list of children in $N_0$).
4. For each integer $i$ in $[0 .. L_C(X))$, “close” the node $N_i$ (see below).
5. Set $X \leftarrow R_C(X)$.
6. Go to step 2.

To “close” a node $N_i$:

1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $(i+1, \langle\rangle)$ (i.e., a node at depth ${i + 1}$ with no children).
2. Set $N_{i+1} \leftarrow (i+1, \operatorname{Children}(N_{i+1}) \mathbin{\|} \langle N_i \rangle)$ (i.e., add $N_i$ as a child to $N_{i+1}$).
3. Let $N_i$ be $(i, \langle\rangle)$ (i.e., new node at depth $i$ with no children).
1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $\langle\rangle$ (i.e., a node at height ${i + 1}$ with no children).
2. Set $N_{i+1} \leftarrow N_{i+1} \mathbin{\|} \langle N_i \rangle$ (i.e., add $N_i$ as a child to $N_{i+1}$).
3. Let $N_i$ be $\langle\rangle$ (i.e., new node at height $i$ with no children).

# Rolling Hash Functions

Expand Down