Skip to content

Commit

Permalink
data.cube vignette and readme
Browse files Browse the repository at this point in the history
  • Loading branch information
jangorecki committed Apr 10, 2016
1 parent a68c2c5 commit 9b72278
Show file tree
Hide file tree
Showing 9 changed files with 245 additions and 133 deletions.
6 changes: 3 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
Package: data.cube
Type: Package
Title: OLAP cube data type
Version: 0.2.1
Date: 2016-02-16
Version: 0.3.0
Date: 2016-04-10
Author: Jan Gorecki
Maintainer: Jan Gorecki <[email protected]>
Description: Extends multidimensional array for OLAP operations backed by data.table. Optionally auto sharding over multiple R nodes and detailed process logging to database.
Description: Extends array for OLAP operations on multidimensional data powered by data.table.
Imports: data.table (>= 1.9.7), R6
Suggests: big.data.table (>= 0.3.3), RSclient, Rserve, logR (>= 2.1.4), knitr, rmarkdown
License: GPL-3
Expand Down
11 changes: 7 additions & 4 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
# data.cube 0.3

* new `data.cube` class
* aggregate while subset available in `[.data.cube`
* enhanced hierarchy metadata storage in `data.cube`
* designed to work in sharded mode across distributed set of R nodes

# data.cube 0.2.1

* `format.data.cube`
* `[.data.cube` passes firsts tests
* `levels` kept in `dimension` instead `hierarchy`.
* `fact$query` works for local and remote, `i` / `i.dt`.
* `schema` methods to produce denormalized metadata.
* added `logR` to *Suggests*, added to tests, heavy non-R dep for postgres drivers.
* added `time_week` and `time_weekday` attributes of `time_day` level in time dimension.
Expand Down
1 change: 1 addition & 0 deletions R/data.cube.R
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,7 @@ head.data.cube = function(x, n, ...) {
#' @param FUN function, by default it will apply \code{fun.aggregate} defined for each measure
#' @param ... arguments passed to *FUN*
#' @description Wraps to \code{[.data.cube}.
#' @note When \code{FUN} argument was used, new data.cube is created with new measures.
apply.data.cube = function(X, MARGIN, FUN, ...) {
if (!is.integer(MARGIN) && is.numeric(MARGIN)) MARGIN = as.integer(MARGIN) # 1 -> 1L
if (is.integer(MARGIN)) MARGIN = X$id.vars[MARGIN] # 1L -> colnames[1L]
Expand Down
2 changes: 1 addition & 1 deletion R/fact.R
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ fact = R6Class(
if (self$local) {
setindexv(self$data, if (!drop) self$id.vars)
} else {
stop("TO DO DEV")
stop("TODO")
}
invisible(self)
}
Expand Down
195 changes: 75 additions & 120 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,119 +3,76 @@ In-memory *OLAP cubes* R data type. Uses high performance C-implemented [data.ta

# Features and examples

- [x] scalable multidimensional `array` alternative, data modeled in *star schema*
- [x] scalable multidimensional `array` alternative
- [x] uses [data.table](https://github.com/Rdatatable/data.table) under the hood
- [x] use base R `array` query API
- [x] `[.cube` uses base R `[.array` method API for *slice* and *dice*, see [tests/tests-sub-.cube.R](tests/tests-sub-.cube.R)
- [x] `capply`/`aggregate.cube`/`rollup` uses base R `apply` function like API for *rollup*, *drilldown*, see [tests/tests-capply.R](tests/tests-capply.R) and [tests/tests-rollup.R](tests/tests-rollup.R)
- [x] for *pivot* use `format`/`as.data.table` with `dcast.data.table` API, see [tests/tests-format.R](tests/tests-format.R)
- [x] `[.data.cube` uses base R `[.array` method API for *slice* and *dice*
- [x] extended for aggregation subsetting with `[.array`
- [x] `apply.data.cube` uses base R `apply` function like API
- [ ] `rollup` for `data.cube`
- [x] for *pivot* use `format`/`as.data.table` with `dcast.data.table` API
- [x] base R `array` API is extended by accepting multiple attributes from dimensions and hierarchies
- [ ] new `[[.cube` method combine and optimize `[.cube` and `capply` into single call with *data.table*-like API, see [tests/tests-sub-sub-.cube.R](tests/tests-sub-sub-.cube.R)
- [ ] *i* accept same input as `...` argument of `[.cube` wrapped into `.(...)`
- [ ] *j* accept input like data.table *j* or a function to apply on all measures
- [ ] *by* acts like a `MARGIN` arg of `apply`, accept input like data.table *by*
- [x] direct access to *cube* class methods and attributes, see `ls.str(x)` on *cube* object
- [ ] logging of queries against the cube
- [x] direct access to *data.cube* class methods and attributes, see `ls.str(x)` on *data.cube* object
- [ ] logging of queries against data.cube
- [x] query optimization
- [x] uses blazingly fast data.table's *binary search* where possible
- [ ] share dimensions between cubes
- [ ] new `data.cube` available to work on top of [big.data.table](https://gitlab.com/jangorecki/big.data.table)
- [x] can uses blazingly fast data.table *indexes*
- [ ] works on sharded engine using [big.data.table](https://gitlab.com/jangorecki/big.data.table)

Contribution welcome!

# Installation

```r
install.packages("data.cube", repos = paste0("https://",
c("jangorecki.github.io/data.cube","cran.rstudio.com")
))
install.packages("data.cube", repos = paste0("https://", c("jangorecki.gitlab.io/data.cube","Rdatatable.github.io/data.table","cran.rstudio.com"))
)
```

# Usage

Check following vignettes:

- `data.cube` class (not yet in readme, api in dev)
- [Basics TO DO](https://jangorecki.gitlab.io/data.cube/).
- [Distributed backend for data.cube](https://jangorecki.gitlab.io/data.cube/doc/big.data.cube.html) covers *cube* and *array* subset methods.
- `data.cube` class
- [Subset and aggregate multidimensional data with data.cube](https://jangorecki.gitlab.io/data.cube/library/data.cube/doc/sub-.data.cube.html)

- old basic `cube` class
- [Subset multidimensional data vignette](https://jangorecki.gitlab.io/data.cube/doc/sub-.cube.html) covers *cube* and *array* subset methods.
- [Subset multidimensional data](https://jangorecki.gitlab.io/data.cube/library/data.cube/doc/sub-.cube.html)

## Basics

```r
library(data.table)
library(data.cube)

# sample array
set.seed(1L)
ar.dimnames = list(color = sort(c("green","yellow","red")),
year = as.character(2011:2015),
status = sort(c("active","inactive","archived","removed")))
ar.dim = sapply(ar.dimnames, length)
ar = array(sample(c(rep(NA, 4), 4:7/2), prod(ar.dim), TRUE),
unname(ar.dim),
ar.dimnames)
print(ar)

cb = as.cube(ar)
print(cb)
str(cb)
all.equal(ar, as.array(cb))
all.equal(dim(ar), dim(cb))
all.equal(dimnames(ar), dimnames(cb))
set.seed(1)
# array
ar = array(rnorm(8,10,5), rep(2,3),
dimnames = list(color = c("green","red"),
year = c("2014","2015"),
country = c("IN","UK"))) # sorted
# cube normalized to star schema just on natural keys
dc = as.data.cube(ar)

# slice

arr = ar["green",,]
print(arr)
r = cb["green",]
print(r)
all.equal(arr, as.array(r))

arr = ar["green",,,drop=FALSE]
print(arr)
r = cb["green",,,drop=FALSE]
print(r)
all.equal(arr, as.array(r))

arr = ar["green",,"active"]
r = cb["green",,"active"]
all.equal(arr, as.array(r))
ar["green","2015",]
dc["green","2015"]
format(dc["green","2015"])

# dice

arr = ar["green",, c("active","archived","inactive")]
r = cb["green",, c("active","archived","inactive")]
all.equal(arr, as.array(r))
as.data.table(r)
as.data.table(r, na.fill = TRUE)
# array-like print using data.table, useful cause as.array doesn't scale
as.data.table(r, na.fill = TRUE, dcast = TRUE, formula = year ~ status)
print(arr)

# apply

format(aggregate(cb, c("year","status"), sum))
format(capply(cb, c("year","status"), sum))

# rollup and drilldown

# granular data with all totals
r = rollup(cb, MARGIN = c("color","year"), FUN = sum)
format(r)

# chose subtotals - drilldown to required levels of aggregates
r = rollup(cb, MARGIN = c("color","year"), INDEX = 1:2, FUN = sum)
format(r)

# pivot
r = capply(cb, c("year","status"), sum)
format(r, dcast = TRUE, formula = year ~ status)
ar[c("green","red"),c("2014","2015"),]
dc[c("green","red"),c("2014","2015")]
format(dc[c("green","red"),c("2014","2015")])

# exact tabular representation of array is just a formatting on the cube
ar["green",c("2014","2015"),]
format(dc["green",c("2014","2015")],
dcast = TRUE,
formula = year ~ country)
ar[,"2015",c("UK","IN")]
format(dc[,"2015",c("UK","IN")],
dcast = TRUE,
formula = color ~ country) # sorted dimensions levels
```

## Extension to array
## Dimension hierarchies and attributes, aggregation

Build data.cube from set of tables defined with star schema, single fact table and multiple dimensions.

```r
library(data.table)
Expand All @@ -124,19 +81,44 @@ library(data.cube)
X = populate_star(1e5)
lapply(X, sapply, ncol)
lapply(X, sapply, nrow)
cb = as.cube(X)
str(cb)
dc = as.data.cube(X)
str(dc) # model live description

# slice and dice on dimension hierarchy
cb["Mazda RX4",, .(curr_type = "crypto"),, .(time_year = 2014L, time_quarter_name = c("Q1","Q2"))]
dc["Mazda RX4",, .(curr_type = "crypto"),, .(time_year = 2014L, time_quarter_name = c("Q1","Q2"))]
# same as above but more verbose
cb$dims
cb[product = "Mazda RX4",
names(dc$dimensions)
dc[product = "Mazda RX4",
customer = .(),
currency = .(curr_type = "crypto"),
geography = .(),
time = .(time_year = 2014L, time_quarter_name = c("Q1","Q2"))]

# aggregate by droppping dimension with just `.` symbol, group by customer and currency
dc[product = .,
customer = .(),
currency = .(curr_type="crypto"),
geography = .,
time = .]
```

### old `cube` class with more methods

```r
library(data.table)
library(data.cube)
# sample array
set.seed(1L)
ar.dimnames = list(color = sort(c("green","yellow","red")),
year = as.character(2011:2015),
status = sort(c("active","inactive","archived","removed")))
ar.dim = sapply(ar.dimnames, length)
ar = array(sample(c(rep(NA, 4), 4:7/2), prod(ar.dim), TRUE),
unname(ar.dim),
ar.dimnames)
print(ar)

cb = as.cube(ar)
# apply on dimension hierarchy
format(aggregate(cb, c("time_year","geog_region_name"), sum))
format(capply(cb, c("time_year","geog_region_name"), sum))
Expand Down Expand Up @@ -166,26 +148,10 @@ as.data.table(r, dcast = TRUE, formula = geog_division_name ~ time_year)

# denormalize
cb$denormalize()

# out
X = as.list(cb)
dt = as.data.table(cb) # wraps to cb$denormalize
#ar = as.array(cb) # arrays scales badly, prepare task manager to kill R

# in
#as.cube(ar)
as.cube(X)
dimcolnames = cb$dapply(names)
print(dimcolnames)
as.cube(dt, fact = "sales", dims = dimcolnames)
```

## Advanced

### Normalization

Data in *cube* are normalized into star schema. In case of rollup on attributes from the same hierarchy, the dimension will be wrapped with new surrogate key. Use `normalize=FALSE` to return data.table with subtotals.

### data.table indexes

User can utilize data.table indexes which dramatically reduce query time.
Expand Down Expand Up @@ -222,21 +188,10 @@ cb["Mazda RX4",, .(curr_type = c("fiat","crypto")),, .(time_year = 2011:2012)] #
options(op)
```

### Architecture

Design concept is very simple.
Cube is [R6](https://github.com/wch/R6) class object, which is enhanced R environment object.
A cube class keeps another plain R environment container to store all tables.
Tables are stored as [data.table](https://github.com/Rdatatable/data.table) class object, which is enhanced R data.frame object.
All of the cube attributes are dynamic, static part is only *star schema* modeled multidimensional data.
Logic of cubes can be isolated from the data, they can also run as a service.

#### client-server
### client-server

Another package development is planned to wrap services upon data.cube.
It would allow to use `[.cube` and `[[.cube` methods via [Rserve: TCP/IP or local sockets](https://github.com/s-u/Rserve) or [httpuv: HTTP and WebSocket server](https://github.com/rstudio/httpuv).
Basic parsers of [MDX](https://en.wikipedia.org/wiki/MultiDimensional_eXpressions) queries and [XMLA](https://en.wikipedia.org/wiki/XML_for_Analysis) requests.
It could potentially utilize `Rserve` for parallel processing on distributed data partitions, see [this gist](https://gist.github.com/jangorecki/ecccfa5471a633acad17).
Running as a services with data.cube could be run [Rserve: TCP/IP or local sockets](https://github.com/s-u/Rserve), [httpuv: HTTP and WebSocket server](https://github.com/rstudio/httpuv) or [svSocket](https://github.com/SciViews/svSocket).
[MDX](https://en.wikipedia.org/wiki/MultiDimensional_eXpressions) queries parser skills welcome, could be wrapped into simple [XMLA](https://en.wikipedia.org/wiki/XML_for_Analysis) requests.

# Interesting reading

Expand Down
4 changes: 2 additions & 2 deletions tests/tests-big.data.cube.R
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ if(apkg){
measure.vars = c("amount","value"),
na.rm = TRUE)
r = bdt[1L]
# TO DO: add data.cube `[` queries when ready
# TODO: add data.cube `[` queries when ready
options(on)
lr = logR::logR_dump()
stopifnot(
Expand All @@ -161,7 +161,7 @@ if(apkg){
lr[7:10, expr] == "x[1L]"
)
rm(bdt)
# data.cube queries # TO DO
# data.cube queries # TODO

on = options("datatable.prettyprint.char" = 80L)
print(lr[])
Expand Down
3 changes: 2 additions & 1 deletion tests/tests-data.cube.R
Original file line number Diff line number Diff line change
Expand Up @@ -419,7 +419,8 @@ stopifnot( # apply with new FUN
as.array(apply.data.cube(dc, 1:3, mean, na.rm=TRUE), na.fill = NaN),
apply(ar, 1:3, mean, na.rm=TRUE)
)
# rev order of MARGIN - TODO - waiting for rev order dimension subsetting
# TODO: rev order of MARGIN: waiting for rev order dimension subsetting

)

# subset with apply ----
Expand Down
6 changes: 4 additions & 2 deletions vignettes/sub-.cube.Rmd
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
---
vignette: >
%\VignetteIndexEntry{Subset multidimensional data}
%\VignetteIndexEntry{Subset multidimensional data with cube}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

# Subset multidimensional data
# Subset multidimensional data with cube
*Jan Gorecki, 2015-11-19*

R *cube* class defined in [data.cube](https://gitlab.com/jangorecki/data.cube) package.

> Please note there is a new updated oop cube class called `data.cube`. See *Subset and aggregate multidimensional data with data.cube* vignette instead.
## characteristic of `[.cube`

### what's the same in `[.array`
Expand Down
Loading

0 comments on commit 9b72278

Please sign in to comment.