Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Preparation for Julia 0.6 #1164

Merged
merged 13 commits into from
Mar 11, 2017
Merged

RFC: Preparation for Julia 0.6 #1164

merged 13 commits into from
Mar 11, 2017

Conversation

andreasnoack
Copy link
Member

The plan is to remove that formula and modeling functionality and hopefully StatsModels will be able to handle that part. For now, I've commented out the formula and modeling related tests. To make this easier, I've loosened the requirement on the Vector input to the DataFrame constructors such that the element type doesn't have to be Any. With that change, the ModelFrame method in StatsModels works.

There are also various routine syntax updates that should be uncontroversial.

Finally, I plan to remove the dependency on Juno. It was controversial when added and has caused problems since added. E.g., right now Juno breaks the REPL and this is not captured by Juno's tests. Getting nice printing of DataFrames in Juno must be handled elsewhere.

@nalimilan
Copy link
Member

nalimilan commented Feb 15, 2017

Too bad I just made the same changes in DataTables (JuliaData/DataTables.jl#9, JuliaData/DataTables.jl#8; FWIW the latter has a few more fixes). Maintaining the two repos in parallel leads to some waste of time...

One issue is whether StatsModels is going to be able to support both DataFrames and DataTables at the same time. We likely need the three of them to use a common AbstractTable interface for that. Until then, I'd rather keep the modeling functions in DataFrames.

Is the Juno dep a problem on 0.6?

@@ -51,7 +51,7 @@ immutable SubDataFrame{T <: AbstractVector{Int}} <: AbstractDataFrame
parent::DataFrame
rows::T # maps from subdf row indexes to parent row indexes

function SubDataFrame(parent::DataFrame, rows::T)
@compat function SubDataFrame{T}(parent::DataFrame, rows::T) where T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is T really needed here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I just found the notes in https://github.com/JuliaLang/julia/pull/20308/files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet, shouldn't T<:AbstractVector{Int}? Not sure what would happen without it, probably an error a bit later. In his PR Jeff added these (duplicated) constraints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. I just followed the deprecation warning. Shouldn't harm to make that restriction though.

@andreasnoack
Copy link
Member Author

Too bad I just made the same changes in DataTables (JuliaData/DataTables.jl#9, JuliaData/DataTables.jl#8; FWIW the latter has a few more fixes). Maintaining the two repos is parallel leads to some waste of time...

I agree. That is why I've been asking all the questions about DataArrays/NullableArrays lately. Ideally, we didn't maintain two almost identical repos in parallel but I find DataArrays so much more convenient to work with that I'm willing to contribute time for now. I've tried to find all the best arguments for NullableArrays to convince myself that moving over is a good idea but so far I'm not convinced that it is an improvement relative to DataArrays.

Until then, I'd rather keep the modeling functions in DataFrames.

I concluded that it wasn't really feasible because ~ is not a macro anymore. Fixing the formula code here would therefore require reimplementing the formula code from StatsModels. I tried to use StatsModels with DataFrames and it mostly worked. Many of the methods can dispatch on DataFrame/DataTable and do the right thing. However, I need your PRs merged into DataTables before I can test this further.

Is the Juno dep a problem on 0.6?

Yes. If you want to use the REPL 😄 (see my last comment). Juno is not sufficiently tested to capture these bugs so it is up to us to detect that using Juno breaks the REPL. But even if that was fixed I think it is the wrong dependency direction. If Juno doesn't want to depend on this package then we'd be better off with a JunoDataFrames. When we have conditional modules, that package could be integrated into Juno.

@nalimilan
Copy link
Member

I tried to use StatsModels with DataFrames and it mostly worked. Many of the methods can dispatch on DataFrame/DataTable and do the right thing. However, I need your PRs merged into DataTables before I can test this further.

That's cool, but we still need to define the functions in a separate interface package so that StatsModels can use them without DataFrames and DataTables conflicting. AbstractTables is the most logical candidate, but I'm not sure how stable is its current API.

I'm not really in favor of keeping the Juno code, I just wanted to know whether it was one of the reasons for the test failures on 0.6.

test/join.jl Outdated
left = outer[!isna(outer[:Name]), :]
inner = left[!isna(left[:Job]), :]
right = outer[.!isna(outer[:Job]), [:Name, :ID, :Job]]
left = outer[.!isna(outer[:Name]), :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will that work on 0.5? I would try with (!).(...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it won't. (!).(...) is indeed necessary. Just realized that while fixing another package.

@andreasnoack andreasnoack changed the title WIP: Preparation for Julia 0.6 RFC: Preparation for Julia 0.6 Feb 22, 2017
@andreasnoack
Copy link
Member Author

Updated. I've dropped 0.4 support, deleted the statsmodels and formula code but kept the contrasts code. However, the contrasts tests rely on ModelFrames for now, so I've disabled the tests. Depending on the StatsModels solution the contrasts could be dropped from this package or we could change the tests to avoid ModelFrames.

@@ -51,14 +51,14 @@ immutable SubDataFrame{T <: AbstractVector{Int}} <: AbstractDataFrame
parent::DataFrame
rows::T # maps from subdf row indexes to parent row indexes

@compat function SubDataFrame{T}(parent::DataFrame, rows::T) where T
function (::Type{SubDataFrame{T}}){T}(parent::DataFrame, rows::T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the new recommended syntax for inner constructors? I'm still confused about this, and the manual doesn't help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure it is recommended but it works without warnings across 0.5 and 0.6 so I think it the right solution until we drop 0.5 support.

```

"""
complete_cases!(df::AbstractDataFrame) = deleterows!(df, find(!complete_cases(df)))
completecases!(df::AbstractDataFrame) = deleterows!(df, find(!completecases(df)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better do find(!, completecases(df)) as in DataTables to avoid allocating a copy.

@@ -181,7 +181,7 @@ end
write(io, "</tr>")
write(io, "</thead>")
write(io, "<tbody>")
tty_rows, tty_cols = _displaysize(io)
tty_rows, tty_cols = Base.displaysize(io)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for Base. here and below.

@@ -292,7 +292,6 @@ function contrasts_matrix(C::HelmertCoding, baseind, n)
mat = mat[[baseind; 1:(baseind-1); (baseind+1):end], :]
return mat
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removes trailing whitespace. Note that the diff shows that this line had the whitespace removed in favor of an empty line below.

end

@render Inline df::AbstractDataFrame _render(df)
# using Juno
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove this altogether, that's easy to find in the git history if needed.

@nalimilan
Copy link
Member

I remove all of the code related to modeling, or keep all of it. I don't like commenting out code.

Other than that, looks good to me.

test/runtests.jl Outdated
"statsmodel.jl",
"contrasts.jl"]
"show.jl"#,
# "contrasts.jl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line and I would say it's good to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And remove the corresponding file...)

@ararslan
Copy link
Member

ararslan commented Feb 22, 2017

Until StatsModels works flawlessly with DataFrames, I think we should just copy the definition of @formula from StatsModels here. None of the other modeling code in this package would need to change, just the definition of formulas. The StatsModels definition has the benefit of being able to work on both 0.5 and 0.6, since it doesn't care whether ~ is parsed as a macro.

@andreasnoack
Copy link
Member Author

The right solution might then be to make an update for 0.5 before these fixes where we deprecate the use of ~ and clearly write how to transition, i.e. use StatsModels. However, it requires that we make StatsModels work generically first. Hopefully soon.

@ararslan
Copy link
Member

That sounds good to me. We should start an issue to track everything that's needed to get StatsModels working generically, or at least across DataFrames and DataTables.

@deprecate DataArray(df::AbstractDataFrame, T::DataType) convert(DataArray{T}, df)

@deprecate read_rda(args...) FileIO.load(args...)

@deprecate complete_cases(df) complete_cases(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deprecate complete_cases(df)  completecases(df)

@deprecate DataArray(df::AbstractDataFrame, T::DataType) convert(DataArray{T}, df)

@deprecate read_rda(args...) FileIO.load(args...)

@deprecate complete_cases(df) complete_cases(df)
@deprecate complete_cases!(df) complete_cases!(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deprecate complete_cases!(df) completecases!(df)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given JuliaData/DataTables.jl#6 (comment) and linked issues could this be changed to deprecate complete_cases!(df) to dropna!(df)?

would need dropna too I guess JuliaData/DataTables.jl#6 (comment)

@nalimilan
Copy link
Member

That sounds good to me. We should start an issue to track everything that's needed to get StatsModels working generically, or at least across DataFrames and DataTables.

JuliaStats/StatsModels.jl#14

@ararslan
Copy link
Member

Heh, whoops. Missed that. Thanks!

@nalimilan nalimilan mentioned this pull request Feb 25, 2017
@davidanthoff
Copy link
Contributor

Bump, what is the status here? Would be great if we had a tagged version of DataFrames that works on julia 0.6. I'm trying to move ExcelReaders (and then Query) to 0.6, but my CI builds won't pass until there is a tagged version of DataFrames that works on julia 0.6.

@ararslan
Copy link
Member

DataArrays master works on 0.6 so once that's tagged, I think that version of DataArrays + this PR = a complete working DataFrames on 0.6.

@nalimilan
Copy link
Member

Though model formula support still isn't there AFAIK. I guess we could still tag a beta version without it to fix packages which don't need formulas (most of them).

@andreasnoack
Copy link
Member Author

I've been traveling and busy but I hope to have time work a little on this again next week.

@davidanthoff
Copy link
Contributor

I think it would be great if there was a version tagged with a REQUIRE line julia 0.6- really soon, even if that version still has some known issues. Right now a simple using DataFrames breaks on julia 0.6, and that pretty much means that a huge number of packages can't even start to work on 0.6 compatibility without going through a lot of hoops.

@ararslan
Copy link
Member

I've dealt with the formulas situation temporarily in #1170 by copying over the requisite definitions from StatsModels. That way we have something here that will work in the interim while StatsModels is made to work with DataFrames.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants