diff --git a/README.md b/README.md index 5c9ca2c..7d66817 100644 --- a/README.md +++ b/README.md @@ -2,10 +2,15 @@ [![Build Status](https://drone.io/github.com/vacuumlabs/persistent/status.png)](https://drone.io/github.com/vacuumlabs/persistent/latest) -check out [changes in 2.0 version!] (https://github.com/vacuumlabs/persistent/wiki/20changes). +Check out [changes in 2.0 version!] (changes_2_0.md) -The project is forked from +Learn how you can use [transients] (transients.md) + +Want to understand the code? Want to contribute? See [technical overview] (technical.md) + + ## What are persistent data structures *Persistent* data structure is an immutable structure; the main difference with standard data structures is how you 'write' to them: instead of mutating @@ -69,5 +74,3 @@ or Dart2JS on Node (the numbers are quite independent of the structure size): Although the factors are quite big, the whole operation is still very fast and it probably won't be THE bottleneck which would slow down your app. -Some [advanced topics](https://github.com/vacuumlabs/persistent/wiki/Advanced-topics). - diff --git a/changes_2_0.md b/changes_2_0.md new file mode 100644 index 0000000..f741310 --- /dev/null +++ b/changes_2_0.md @@ -0,0 +1,9 @@ +- memory footprint reduced with a factor of 15 (wait what? Was the old implementation so + ineffective? Or the new one is so cool? The truth is: both. Check out benchmarks) + +- changes in API, most notably PersistentMap -> PMap, PersistentVector -> PVec + +- more effective == and != on PMap + +- deleted several classes, the whole class/interface hierarchy becomes much simpler (although little bit dirtier; some performance-motivated compromises were introduced) + diff --git a/technical.md b/technical.md new file mode 100644 index 0000000..9418012 --- /dev/null +++ b/technical.md @@ -0,0 +1,64 @@ +# Technical overview + +The implementation of Persistent Vector is very similar to the one found in +Facebook's [immutable.js] (https://github.com/facebook/immutable-js). We show almost no invention +here. The rest of the document describes our design of Persistent Map which is more unusual and needs +more explanation. + + +## PMap technical overview + +The implementation is a version of HAMT, be sure you understand the basic concepts before reading +further. Good places to start are [wikipedia] (http://en.wikipedia.org/wiki/Hash_array_mapped_trie) +or this [blog post] +(http://blog.higher-order.net/2009/09/08/understanding-clojures-persistenthashmap-deftwice.html). +The following text explains issues, that are specific for our implementation. + +The whole HAMT consists of two types of nodes: Node and Leaf (they are called _Node and _Leaf in the +code). + +Node is typical HAMT inner node. It's branching factor is set to 16 (this may change) currently this +gets us best results in the benchmarks. Note that Node implements PMap interface. + +Leaf can hold several key-value pairs. These are stored in a simple List such as: +[hash1, ke1, value1, hash2, key2, value2, etc..] +if the leaf grows big (currently, > 48 such h,k,v triplets), it is split up to several Nodes. Similarly, +if Node stores only few k,v pairs (in all its nodes) it is compacted to one single Leaf (threshold +for this is currently set to < 32 triplets) + +Few things to note here: + +- In the tree, h,k,v triplets are stored in a way to guarantee the following property: if iterating + through one Node by inorder (i.e. you are recursively visiting its children from the 0th to the + 15-th), you enumerate h,k,v triplets sorted by hash value. This may look unimportant on the first + glance, but it simplifies several things; for example comparing Leaf with Node on equality, or doing intersection + with Leaf and Node gets easier. For this purpose, we do the following: + + - In a single Leaf, h,k,v triplets are sorted by the hash. This allows us to binsearch for the + correct value, when doing lookup. + + - In the put / lookup process, we consume the key hash from the first digits (not from the last, + as usual). Note that hashes of small objects (especially, small ints) tend to have just zeros + in the leading places. To overcome this problem, we work with mangled hash, which has enough + entropy also in the first digits (check out _mangeHash function). + +- In the Node implementation we're not compacting the array of children. Typically, to save memory, HAMT + implementation stores only not-null children. Such implementations then use bitmask to correctly + determine, what the proper indexes of individual (not-null) children would be (if the nulls were + there). Such trick is neat, but it costs time, and moreover, we don't need it. Why? Because we + store up to 48 values in a single Leaf. This means, when the Leaf gets expanded to a proper Node, + most of its children will be not null. (Exercise: you randomly pick 48 numbers from + interval 0,15 inclusive. What is the expectation for the count of numbers not picked at least once?) + +- Node is a strange class. It serves for two purposes (which is probably not the cleanest design): + it implements all PMap methods (in fact, when you construct new PMap, what you got is Node) and it + implements low-level method for HAMT manipulation. Moreover, PMap methods (such as assoc) can be + called only on the root Node - on every other Node, such call will lead to inconsistent result. + Why such bad design? + + - The main purpose is to save time and memory by creating an additional object that would encapsulate the + root Node (yes, it matters). + + - All "bad things" happen only internally and there is no possibility for the end-user to get the + structure to the inconsistent state. So, it's not such a bad design after all. + diff --git a/transients.md b/transients.md new file mode 100644 index 0000000..610aa40 --- /dev/null +++ b/transients.md @@ -0,0 +1,84 @@ +## Working with Transients + +Persistent data structure can 'unfreeze' into *Transient* structure (TMap, TVec, ..), which is mutable. The purpose of this is to gain some speedup while still working with Persistents. The typical workflow is as follows: + + 1. TMap trans = pers.asTransient(); + 2. do a lot of mutations on trans + 3. Persistent result = trans.asPersistent(); + +There are several notable things here: +- When working with transients, methods like `assoc` or `delete` are no longer there. Instead of these, use `doAssoc`, `doDelete`, etc. These methods return void and mutate the structure inplace. +- once `.asPersistent` is called, you cannot modify transient anymore. If you try doing it, you'll get an exception. +- conversion Persistent -> Transient and vice-versa is O(1), which means fast, in this case, really fast. +- everything is safe, i.e. basic contract 'there is no way, how to modify Persistent data structure' still holds +- since there is some repeating pattern in steps 1-3, `PersistentStructure.withTransient(modifier)` helper method exists. + +### Equality and hash + +Two persistent structures are equal if they carry the equal data. +This allows them to be used as map keys - the key is the data in the context, +not the object itself. + +Two transient structures are equal in the standard meaning of a word - if they are the same object. + +The hash code is consistent with the equality operator. + +## Example + + import 'package:persistent/persistent.dart'; + + main() { + + // Persistency: + + PMap map1 = new PMap.from({"a":1, "b":2}); + PMap map2 = new PMap.from({"b":3, "c":4}); + + print(map1["a"]); // 1 + print(map1.lookup("b")); // 2 + print(map1.lookup("c", orElse: ()=>":(")); // :( + + print(map1.insert("c", 3)); // {a: 1, b: 2, c: 3} + print(map1.insert("d", 4)); // {a: 1, b: 2, d: 4} + + final map3 = map2.insert("c", 3, (x,y) => x+y); + print(map3.delete("b")); // {c: 7} + print(map3.delete("a", safe: true)); // {b: 3, c: 7} + + print(map1); // {a: 1, b: 2} + print(map2); // {b: 3, c: 4} + print(map3); // {b: 3, c: 7} + + // Transiency: + + final vector1 = new PersistentVector.from(["x", "y"]); + + print(vector1.push("z")); // (x, y, z) + print(vector1.push("q")); // (x, y, q) + + var temp = vector1.asTransient(); + temp.doPush("z"); + temp.doPush("q"); + temp[1] = "Y"; + final vector2 = temp.asPersistent(); + + final vector3 = vector2.withTransient((TransientVector v){ + v.doSet(2, "Z"); + v.doPop(); + v[0] = "X"; + }); + + print(vector1); // (x, y) + print(vector2); // (x, Y, z, q) + print(vector3); // (X, Y, Z) + + // Features + + print(map1.toList()); // [Pair(a, 1), Pair(b, 2)] + + final set1 = new PersistentSet.from(["a", "b"]); + final set2 = new PersistentSet.from([1, 2, 3]); + print((set1 * set2).toList()); + // [Pair(a, 2), Pair(a, 1), Pair(b, 3), Pair(b, 2), Pair(b, 1), Pair(a, 3)] + + }