This repository has been archived by the owner on Aug 27, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.jmd
101 lines (79 loc) · 3.47 KB
/
README.jmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# FastGroupBy
Faster algorithms for doing vector group-by. This package currently support faster group-bys where the group-by vector is of type `CategoricalVector` or `Vector{T}` for `T<:Union{Integer, Bool, String}`.
## Installation
```julia; eval=false
# install
Pkg.add("FastGroupBy")
# install latest version
Pkg.clone("https://github.com/xiaodaigh/FastGroupBy.jl.git")
```
# `fastby` and `fastby!`
The `fastby` and `fastby!` functions allow the user to perform arbitrary computation on a vector (`valvec`) grouped by another vector (`byvec`). Their output format is a `Tuple` where the first element are the distinct groups and the second are the results of applying the function, `fn` on the `valvec` grouped-by `by`, see below for explanation of `fn`, `byvec`, and `valvec`.
The difference between `fastby` and `fastby!` is that `fastby!` may change the input vectors `byvec` and `valvec` whereas `fastby` won't.
Both functions have the same three main arguments, but we shall illustrate using `fastby` only
```julia; eval = false
fastby(fn, byvec, valvec)
```
* `fn` is a function `fn` to be applied to each by-group of `valvec`
* `byvec` is the vector to group-by
* `valvec` is the vector that `fn` is applied to
For example `fastby(sum, byvec, valvec)` is equivalent to `StatsBase`'s `countmap(byvec, weights(valvec))`. Consider the below
```julia
using FastGroupBy
byvec = [88, 888, 8, 88, 888, 88]
valvec = [1 , 2 , 3, 4 , 5 , 6]
```
to compute the sum value of `valvec` in each group of `byvec` we do
```julia
grpsum = fastby(sum, byvec, valvec)
expected_result = Dict(88 => 11, 8 => 3, 888 => 7)
Dict(zip(grpsum...)) == expected_result # true
```
## `fastby!` with an arbitrary `fn`
You can also compute arbitrary functions for each by-group e.g. `mean`
```julia
using Statistics: mean
@time a = fastby(mean, byvec, valvec)
```
This generalizes to arbitrary user-defined functions e.g. the below computes the `sizeof` each element within each by group
```julia
byvec = [88 , 888 , 8 , 88 , 888 , 88]
valvec = ["abc", "def", "g", "hi", "jk", "lmop"]
@time a = fastby(yy -> sizeof.(yy), byvec, valvec);
```
Julia's do-notation can be used
```julia
@time a = fastby(byvec, valvec) do grouped_y
# you can perform complex calculations here knowing that grouped_y is y grouped by x
grouped_y[end] * grouped_y[1]
end;
```
The `fastby` is fast if group by a vector of `Bool`'s as well
```julia
using Random
Random.seed!(1)
x = rand(Bool, 100_000_000);
y = rand(100_000_000);
@time fastby(sum, x, y)
```
The `fastby` works on `String` type as well but is still slower than `countmap` and uses MUCH more RAM and therefore is **NOT recommended (at this stage)**.
```julia
using Random
const M=10_000_000; const K=100;
Random.seed!(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:8))...) for k in 1:M÷K], M);
y = repeat([1], inner=length(svec1));
@time a = fastby!(sum, svec1, y);
a_dict = Dict(zip(a...))
using StatsBase
@time b = countmap(svec1, alg = :dict);
a_dict == b #true
```
## `fastby` on `DataFrames`
One can also apply `fastby` on `DataFrame` by supplying the DataFrame as the second argument and its columns using `Symbol` in the third and fourth argument, being `bycol` and `valcol` respectively. For example
```julia
using DataFrames
df1 = DataFrame(grps = rand(1:100, 1_000_000), val = rand(1_000_000))
# compute the difference between the number rows in that group and the mean of `val` in that group
res = fastby(val_grouped -> length(val_grouped) - mean(val_grouped), df1, :grps, :val)
```