Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: First steps in defining a common interface for Tables/Frames #16

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/src/formula.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ fields with possibly heterogeneous types. One of the primary goals of
`StatsModels` is to make it simpler to transform tabular data into matrix format
suitable for statistical modeling.

At the moment, "tabular data" means an `AbstractDataTable`. Ultimately, the
At the moment, "tabular data" means an `Table`. Ultimately, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a Table, not an

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or change Table to AbstractTable. :-)

goal is to support any tabular data format that adheres to a minimal API,
**regardless of backend**.

Expand Down Expand Up @@ -88,7 +88,7 @@ dropterm

The main use of `Formula`s is for fitting statistical models based on tabular
data. From the user's perspective, this is done by `fit` methods that take a
`Formula` and a `DataTable` instead of numeric matrices.
`Formula` and a `Table` instead of numeric matrices.

Internally, this is accomplished in three stages:

Expand Down
3 changes: 2 additions & 1 deletion docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,6 @@ developers when dealing with statistical models and tabular data.
* `RegressionModel`

Much of this package was formerly part
of [`DataTables`](https://www.github.com/JuliaStats/DataTables.jl)
of [`DataTables`](https://www.github.com/JuliaStats/DataTables.jl)/
[`DataFrames`](https://www.github.com/JuliaStats/DataFrames.jl)
and [`StatsBase`](https://www.github.com/JuliaStats/StatsBase.jl).
12 changes: 9 additions & 3 deletions src/StatsModels.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,12 @@ __precompile__(true)
module StatsModels

using Compat
using DataTables
using TableBase
using TableBase: Table
# using DataTables
using StatsBase
using NullableArrays
using CategoricalArrays
# using NullableArrays
# using CategoricalArrays


export @formula,
Expand All @@ -24,6 +26,10 @@ export @formula,
dropterm,
setcontrasts!

# TEMPORARY DEFINITIONS
const AbstractCategoricalVector = Any


map(include,
[
"contrasts.jl",
Expand Down
62 changes: 31 additions & 31 deletions src/modelframe.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
Wrapper which combines Formula (Terms) and an AbstractDataTable
Wrapper which combines Formula (Terms) and an Table

This wrapper encapsulates all the information that's required to transform data
of the same structure as the wrapped data frame into a model matrix. This goes
Expand All @@ -13,19 +13,19 @@ then creates the necessary contrasts matrices and stores the results.
# Constructors

```julia
ModelFrame(f::Formula, df::AbstractDataTable; contrasts::Dict = Dict())
ModelFrame(ex::Expr, d::AbstractDataTable; contrasts::Dict = Dict())
ModelFrame(terms::Terms, df::AbstractDataTable; contrasts::Dict = Dict())
ModelFrame(f::Formula, df::Table; contrasts::Dict = Dict())
ModelFrame(ex::Expr, d::Table; contrasts::Dict = Dict())
ModelFrame(terms::Terms, df::Table; contrasts::Dict = Dict())
# Inner constructors:
ModelFrame(df::AbstractDataTable, terms::Terms, missing::BitArray)
ModelFrame(df::AbstractDataTable, terms::Terms, missing::BitArray, contrasts::Dict{Symbol, ContrastsMatrix})
ModelFrame(df::Table, terms::Terms, missing::BitArray)
ModelFrame(df::Table, terms::Terms, missing::BitArray, contrasts::Dict{Symbol, ContrastsMatrix})
```

# Arguments

* `f::Formula`: Formula whose left hand side is the *response* and right hand
side are the *predictors*.
* `df::AbstractDataTable`: The data being modeled. This is used at this stage
* `df::Table`: The data being modeled. This is used at this stage
to determine which variables are categorical, and otherwise held for
[`ModelMatrix`](@ref).
* `contrasts::Dict`: An optional Dict of contrast codings for each categorical
Expand All @@ -41,13 +41,13 @@ ModelFrame(df::AbstractDataTable, terms::Terms, missing::BitArray, contrasts::Di
# Examples

```julia
julia> df = DataTable(x = 1:4, y = 5:9)
julia> df = Table(x = 1:4, y = 5:9)
julia> mf = ModelFrame(y ~ 1 + x, df)
```

"""
type ModelFrame
df::AbstractDataTable
df::Table
terms::Terms
msng::BitArray
## mapping from df keys to contrasts matrices
Expand All @@ -69,7 +69,7 @@ is_categorical(::AbstractArray) = true
##
## This modifies the Terms, setting `trms.is_non_redundant = true` for all non-
## redundant evaluation terms.
function check_non_redundancy!(trms::Terms, df::AbstractDataTable)
function check_non_redundancy!(trms::Terms, df::Table)

(n_eterms, n_terms) = size(trms.factors)

Expand Down Expand Up @@ -104,46 +104,46 @@ end

const DEFAULT_CONTRASTS = DummyCoding

_unique(x::CategoricalArray) = unique(x)
_unique(x::NullableCategoricalArray) = [get(l) for l in unique(x) if !isnull(l)]
# _unique(x::CategoricalArray) = unique(x)
# _unique(x::NullableCategoricalArray) = [get(l) for l in unique(x) if !isnull(l)]

function _unique{T<:Nullable}(x::AbstractArray{T})
levs = [get(l) for l in unique(x) if !isnull(l)]
try; sort!(levs); end
return levs
end
# function _unique{T<:Nullable}(x::AbstractArray{T})
# levs = [get(l) for l in unique(x) if !isnull(l)]
# try; sort!(levs); end
# return levs
# end

function _unique(x::AbstractArray)
levs = unique(x)
try; sort!(levs); end
return levs
end
# function _unique(x::AbstractArray)
# levs = unique(x)
# try; sort!(levs); end
# return levs
# end

## Set up contrasts:
## Combine actual DF columns and contrast types if necessary to compute the
## actual contrasts matrices, levels, and term names (using DummyCoding
## as the default)
function evalcontrasts(df::AbstractDataTable, contrasts::Dict = Dict())
function evalcontrasts(df::Table, contrasts::Dict = Dict())
evaledContrasts = Dict()
for (term, col) in eachcol(df)
is_categorical(col) || continue
evaledContrasts[term] = ContrastsMatrix(haskey(contrasts, term) ?
contrasts[term] :
DEFAULT_CONTRASTS(),
_unique(col))
unique(col))
end
return evaledContrasts
end

## Default NULL handler. Others can be added as keyword arguments
function null_omit(df::DataTable)
function null_omit(df::Table)
cc = completecases(df)
df[cc,:], cc
end

function ModelFrame(trms::Terms, d::AbstractDataTable;
function ModelFrame(trms::Terms, d::Table;
contrasts::Dict = Dict())
df, msng = null_omit(DataTable(map(x -> d[x], trms.eterms)))
df, msng = null_omit(typeof(d)(map(x -> d[x], trms.eterms)))
names!(df, convert(Vector{Symbol}, map(string, trms.eterms)))

evaledContrasts = evalcontrasts(df, contrasts)
Expand All @@ -154,17 +154,17 @@ function ModelFrame(trms::Terms, d::AbstractDataTable;
ModelFrame(df, trms, msng, evaledContrasts)
end

ModelFrame(df::AbstractDataTable, term::Terms, msng::BitArray) = ModelFrame(df, term, msng, evalcontrasts(df))
ModelFrame(f::Formula, d::AbstractDataTable; kwargs...) = ModelFrame(Terms(f), d; kwargs...)
ModelFrame(ex::Expr, d::AbstractDataTable; kwargs...) = ModelFrame(Formula(ex), d; kwargs...)
ModelFrame(df::Table, term::Terms, msng::BitArray) = ModelFrame(df, term, msng, evalcontrasts(df))
ModelFrame(f::Formula, d::Table; kwargs...) = ModelFrame(Terms(f), d; kwargs...)
ModelFrame(ex::Expr, d::Table; kwargs...) = ModelFrame(Formula(ex), d; kwargs...)

"""
setcontrasts!(mf::ModelFrame, new_contrasts::Dict)

Modify the contrast coding system of a ModelFrame in place.
"""
function setcontrasts!(mf::ModelFrame, new_contrasts::Dict)
new_contrasts = Dict([ Pair(col, ContrastsMatrix(contr, _unique(mf.df[col])))
new_contrasts = Dict([ Pair(col, ContrastsMatrix(contr, unique(mf.df[col])))
for (col, contr) in filter((k,v)->haskey(mf.df, k), new_contrasts) ])

mf.contrasts = merge(mf.contrasts, new_contrasts)
Expand Down
2 changes: 1 addition & 1 deletion src/modelmatrix.jl
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ modelmat_cols{T<:AbstractFloatMatrix}(::Type{T}, v::AbstractVector, contrast::Co


function modelmat_cols{T<:AbstractFloatMatrix}(::Type{T},
v::Union{CategoricalVector, NullableCategoricalVector},
v::AbstractCategoricalVector,
contrast::ContrastsMatrix)
## make sure the levels of the contrast matrix and the categorical data
## are the same by constructing a re-indexing vector. Indexing into
Expand Down
34 changes: 17 additions & 17 deletions src/statsmodel.jl
Original file line number Diff line number Diff line change
Expand Up @@ -31,23 +31,23 @@ macro delegate(source, targets)
return result
end

# Wrappers for DataTableStatisticalModel and DataTableRegressionModel
immutable DataTableStatisticalModel{M,T} <: StatisticalModel
# Wrappers for TableStatisticalModel and TableRegressionModel
immutable TableStatisticalModel{M,T} <: StatisticalModel
model::M
mf::ModelFrame
mm::ModelMatrix{T}
end

immutable DataTableRegressionModel{M,T} <: RegressionModel
immutable TableRegressionModel{M,T} <: RegressionModel
model::M
mf::ModelFrame
mm::ModelMatrix{T}
end

for (modeltype, dfmodeltype) in ((:StatisticalModel, DataTableStatisticalModel),
(:RegressionModel, DataTableRegressionModel))
for (modeltype, dfmodeltype) in ((:StatisticalModel, TableStatisticalModel),
(:RegressionModel, TableRegressionModel))
@eval begin
function StatsBase.fit{T<:$modeltype}(::Type{T}, f::Formula, df::AbstractDataTable,
function StatsBase.fit{T<:$modeltype}(::Type{T}, f::Formula, df::Table,
args...; contrasts::Dict = Dict(), kwargs...)
mf = ModelFrame(f, df, contrasts=contrasts)
mm = ModelMatrix(mf)
Expand All @@ -58,24 +58,24 @@ for (modeltype, dfmodeltype) in ((:StatisticalModel, DataTableStatisticalModel),
end

# Delegate functions from StatsBase that use our new types
typealias DataTableModels @compat(Union{DataTableStatisticalModel, DataTableRegressionModel})
@delegate DataTableModels.model [StatsBase.coef, StatsBase.confint,
typealias TableModels @compat(Union{TableStatisticalModel, TableRegressionModel})
@delegate TableModels.model [StatsBase.coef, StatsBase.confint,
StatsBase.deviance, StatsBase.nulldeviance,
StatsBase.loglikelihood, StatsBase.nullloglikelihood,
StatsBase.dof, StatsBase.dof_residual, StatsBase.nobs,
StatsBase.stderr, StatsBase.vcov]
@delegate DataTableRegressionModel.model [StatsBase.residuals, StatsBase.model_response,
@delegate TableRegressionModel.model [StatsBase.residuals, StatsBase.model_response,
StatsBase.predict, StatsBase.predict!]
# Need to define these manually because of ambiguity using @delegate
StatsBase.r2(mm::DataTableRegressionModel) = r2(mm.model)
StatsBase.adjr2(mm::DataTableRegressionModel) = adjr2(mm.model)
StatsBase.r2(mm::DataTableRegressionModel, variant::Symbol) = r2(mm.model, variant)
StatsBase.adjr2(mm::DataTableRegressionModel, variant::Symbol) = adjr2(mm.model, variant)
StatsBase.r2(mm::TableRegressionModel) = r2(mm.model)
StatsBase.adjr2(mm::TableRegressionModel) = adjr2(mm.model)
StatsBase.r2(mm::TableRegressionModel, variant::Symbol) = r2(mm.model, variant)
StatsBase.adjr2(mm::TableRegressionModel, variant::Symbol) = adjr2(mm.model, variant)

# Predict function that takes data frame as predictor instead of matrix
function StatsBase.predict(mm::DataTableRegressionModel, df::AbstractDataTable; kwargs...)
function StatsBase.predict(mm::TableRegressionModel, df::Table; kwargs...)
# copy terms, removing outcome if present (ModelFrame will complain if a
# term is not found in the DataTable and we don't want to remove elements with missing y)
# term is not found in the Table and we don't want to remove elements with missing y)
newTerms = dropresponse!(mm.mf.terms)
# create new model frame/matrix
mf = ModelFrame(newTerms, df; contrasts = mm.mf.contrasts)
Expand All @@ -89,7 +89,7 @@ end


# coeftable implementation
function StatsBase.coeftable(model::DataTableModels)
function StatsBase.coeftable(model::TableModels)
ct = coeftable(model.model)
cfnames = coefnames(model.mf)
if length(ct.rownms) == length(cfnames)
Expand All @@ -99,7 +99,7 @@ function StatsBase.coeftable(model::DataTableModels)
end

# show function that delegates to coeftable
function Base.show(io::IO, model::DataTableModels)
function Base.show(io::IO, model::TableModels)
try
ct = coeftable(model)
println(io, "$(typeof(model))")
Expand Down