Skip to content
193 changes: 138 additions & 55 deletions spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,37 @@ We also use the following operators and functions:
i.e. if $X = \langle X_0, \dots, X_N \rangle$ and $Y = \langle Y_0,
\dots, Y_M \rangle$ then $X \mathbin{\|} Y = \langle X_0, \dots, X_N, Y_0, \dots, Y_M
\rangle$
- $\operatorname{min}(x, y)$ denotes the minimum of $x$ and $y$.
- $\min(x, y)$ denotes the minimum of $x$ and $y$ and $\max(x, y)$ denotes the maximum
- $\operatorname{Type}(x)$ denotes the type of $x$.

Finally, we define the “prefix” $\mathbb{P}_q(X)$
of a non-empty sequence $X$
with respect to a given predicate $q$
to be the initial subsequence $X^\prime$ of $X$
up to and including the first member that makes $q(X^\prime)$ true.
And we define the “remainder” $\mathbb{R}_q(X)$
to be everything left after removing the prefix.

Formally,
given a sequence $X = \langle X_0, \dots, X_{|X|-1} \rangle$
and a predicate $q \in \operatorname{Type}(X) \rightarrow \{\text{true},\text{false}\}$,

$\mathbb{P}_q(X) = \langle X_0, \dots, X_e \rangle$

for the smallest integer $e$ such that:

- $0 \le e < |X|$ and
- $q(\langle X_0, \dots, X_e \rangle) = \text{true}$

or $|X|-1$ if no such integer exists.
(I.e., if nothing satisfies $q$, the prefix is the whole sequence.)
And:

$\mathbb{R}_q(X) = \langle X_b, \dots, X_{|X|-1} \rangle$

where $b = |\mathbb{P}_q(\langle X_0, \dots, X_{|X|-1} \rangle)|$.

Note that when $\mathbb{P}_q(X) = X$, $\mathbb{R}_q(X) = \langle \rangle$.

# Splitting

Expand All @@ -74,7 +104,7 @@ functions:

$\operatorname{SPLIT}_C \in V_8 \rightarrow V_v$

...which is parameterized by a configuration $C$, consisting of:
...which is parameterized by a _configuration_ $C$, consisting of:

- $S_{\text{min}} \in U_{32}$, the minimum split size
- $S_{\text{max}} \in U_{32}$, the maximum split size
Expand All @@ -87,24 +117,21 @@ The configuration must satisfy $S_{\text{max}} \ge S_{\text{min}} > 0$.

We define the constant $W$, which we call the "window size," to be 64.

The "split index" $I(X)$ of a sequence $X$ is either the smallest
non-negative integer $i$ satisfying:
We define the predicate $q_C(X)$
on a non-empty byte sequence $X$
with respect to a configuration $C$
to be:

- $i \le |X|$ and
- $S_{\text{max}} \ge i \ge S_{\text{min}}$ and
- $H(\langle X_{i-W}, \dots, X_{i-1} \rangle) \mod 2^T = 0$

...or $\operatorname{min}(|X|, S_{\text{max}})$, if no such $i$ exists. For the
purposes of this definition we set $X_i = 0$ for $i < 0$.

The “prefix” $P(X)$ of a non-empty sequence $X$ is $\langle X_0, \dots, X_{I(X)-1} \rangle$.

The “remainder” $R(X)$ of a non-empty sequence $X$ is $\langle X_{I(X)}, \dots, X_{|X|-1} \rangle$.
- $\text{true}$ if $|X| = S_{\text{max}}$; otherwise
- $\text{true}$ if $|X| \ge S_{\text{min}}$ and $H(\langle X_{\max(0,|X|-W)}, \dots, X_{|X|-1} \rangle) \mod 2^T = 0$
(i.e., the last $W$ bytes of $X$ hash to a value with at least $T$ trailing zeroes);
otherwise
- $\text{false}$.

We define $\operatorname{SPLIT}_C(X)$ recursively, as follows:

- If $|X| = 0$, $\operatorname{SPLIT}_C(X) = \langle \rangle$
- Otherwise, $\operatorname{SPLIT}_C(X) = \langle P(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(R(X))$
- Otherwise, $\operatorname{SPLIT}_C(X) = \langle \mathbb{P}_{q_C}(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(\mathbb{R}_{q_C}(X))$

# Tree Construction

Expand Down Expand Up @@ -133,63 +160,119 @@ will differ only in the subtrees in the vicinity of the differences.

## Definitions

The “hashval” $V(X)$ of a sequence $X$ is:
A “chunk” is a member of the sequence produced by $\operatorname{SPLIT}_C$.

The “hashval” $V_C(X)$ of a byte sequence $X$ is:

$H(\langle X_{\operatorname{max}(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$
$H(\langle X_{\max(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$

(i.e., the hash of the last $W$ bytes of $X$).

The “level” $L(X)$ of a sequence $X$ is $Q - T$,
A “node” $N_{h,i}$ in a hashsplit tree
at non-negative “height” $h$
is a sequence of children.
The children of a node at height 0 are chunks.
The children of a node at height $h+1$ are nodes at height $h$.

A “tier” of a hashsplit tree is a sequence of nodes
$N_h = \langle N_{h,0}, \dots, N_{h,k} \rangle$
at a given height $h$.

The function $\operatorname{Rightmost}(N_{h,i})$
on a node $N_{h,i} = \langle S_0, \dots, S_e \rangle$
produces the “rightmost leaf chunk”
defined recursively as follows:

- If $h = 0$, $\operatorname{Rightmost}(N_{h,i}) = S_e$
- If $h > 0$, $\operatorname{Rightmost}(N_{h,i}) = \operatorname{Rightmost}(S_e)$

The “level” $L_C(X)$ of a given chunk $X$
is $\max(0, Q - T)$,
where $Q$ is the largest integer such that

- $Q \le 32$ and
- $V(P(X)) \mod 2^Q = 0$

(i.e., the level is the number of trailing zeroes in the rolling checksum in excess of the threshold needed to produce the prefix chunk $P(X)$).

(Note:
When $|R(X)| > 0$,
$L(X)$ is non-negative,
because $P(X)$ is defined in terms of a hash with $T$ trailing zeroes.
But when $|R(X)| = 0$,
that hash may have fewer than $T$ trailing zeroes,
and so $L(X)$ may be negative.
This makes no difference to the algorithm below, however.)

A “node” in a hashsplit tree
is a pair $(D, C)$
where $D$ is the node’s “depth”
and $C$ is a sequence of children.
The children of a node at depth 0 are chunks
(i.e., subsequences of the input).
The children of a node at depth $D > 0$ are nodes at depth $D - 1$.

The function $\operatorname{Children}(N)$ on a node $N = (D, C)$ produces $C$
(the sequence of children).
- $V_C(\mathbb{P}_{q_C}(X)) \mod 2^Q = 0$

(i.e., the level is the number of trailing zeroes in the hashval
in excess of the threshold needed
to produce the prefix chunk $\mathbb{P}_{q_C}(X)$).

The level $L_C(N)$ of a given _node_ $N$
is the level of its rightmost leaf chunk:
$L_C(N) = L_C(\operatorname{Rightmost}(N))$

The predicate $z_{C,h}(K)$
on a sequence $K = \langle K_0, \dots, K_e \rangle$
of chunks or of nodes
with respect to a height $h$
is defined as:

- $\text{true}$ if $L_C(K_e) > h$; otherwise
- $\text{false}$.

For conciseness, define

- $P_C(X) = \mathbb{P}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$ and
- $R_C(X) = \mathbb{R}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$

## Algorithm

To compute a hashsplit tree from sequence $X$,
This section contains two descriptions of hashsplit trees:
an algebraic description for formal reasoning,
and a procedural description for practical construction.

### Algebraic description

The tier $N_0$
of hashsplit tree nodes
for a given byte sequence $X$
is equal to

$\langle P_C(X) \rangle \mathbb{\|} R_C(X)$

The tier $N_{h+1}$
of hashsplit tree nodes
for a given byte sequence $X$
is equal to

$\langle \mathbb{P}_{z_{C,h+1}}(N_h) \rangle \mathbb{\|} \mathbb{R}_{z_{C,h+1}}(N_h)$

(I.e., each node in the tree has as its children
a sequence of chunks or lower-tier nodes,
as appropriate,
up to and including the first one
whose “level” is greater than the node’s height.)

The root of the hashsplit tree is $N_{h^\prime,0}$
for the smallest value of $h^\prime$
such that $|N_{h^\prime}| = 1$

### Procedural description

For this description we use $N_h$ to denote a single node at height $h$.
The algorithm must keep track of the “rightmost” such node for each tier in the tree.

To compute a hashsplit tree from a byte sequence $X$,
compute its “root node” as follows.

1. Let $N_0$ be $(0, \langle\rangle)$ (i.e., a node at depth 0 with no children).
1. Let $N_0$ be $\langle\rangle$ (i.e., a node at height 0 with no children).
2. If $|X| = 0$, then:
a. Let $d$ be the largest depth such that $N_d$ exists.
b. If $|\operatorname{Children}(N_0)| > 0$, then:
i. For each integer $i$ in $[0 .. d]$, “close” $N_i$.
ii. Set $d \leftarrow d+1$.
c. [pruning] While $d > 0$ and $|\operatorname{Children}(N_d)| = 1$, set $d \leftarrow d-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child).
d. **Terminate** with $N_d$ as the root node.
3. Otherwise, set $N_0 \leftarrow (0, \operatorname{Children}(N_0) \mathbin{\|} \langle P(X) \rangle)$ (i.e., add $P(X)$ to the list of children in $N_0$).
4. For each integer $i$ in $[0 .. L(X))$, “close” the node $N_i$ (see below).
5. Set $X \leftarrow R(X)$.
a. Let $h$ be the largest height such that $N_h$ exists.
b. If $|N_0| > 0$, then:
i. For each integer $i$ in $[0 .. h]$, “close” $N_i$ (see below).
ii. Set $h \leftarrow h+1$.
c. [pruning] While $h > 0$ and $|N_h| = 1$, set $h \leftarrow h-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child).
d. **Terminate** with $N_h$ as the root node.
3. Otherwise, set $N_0 \leftarrow N_0 \mathbin{\|} \langle P_C(X) \rangle$ (i.e., add $P_C(X)$ to the list of children in $N_0$).
4. For each integer $i$ in $[0 .. L_C(X))$, “close” the node $N_i$ (see below).
5. Set $X \leftarrow R_C(X)$.
6. Go to step 2.

To “close” a node $N_i$:

1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $(i+1, \langle\rangle)$ (i.e., a node at depth ${i + 1}$ with no children).
2. Set $N_{i+1} \leftarrow (i+1, \operatorname{Children}(N_{i+1}) \mathbin{\|} \langle N_i \rangle)$ (i.e., add $N_i$ as a child to $N_{i+1}$).
3. Let $N_i$ be $(i, \langle\rangle)$ (i.e., new node at depth $i$ with no children).
1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $\langle\rangle$ (i.e., a node at height ${i + 1}$ with no children).
2. Set $N_{i+1} \leftarrow N_{i+1} \mathbin{\|} \langle N_i \rangle$ (i.e., add $N_i$ as a child to $N_{i+1}$).
3. Let $N_i$ be $\langle\rangle$ (i.e., new node at height $i$ with no children).

# Rolling Hash Functions

Expand Down