-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taking weighting seriously #487
base: master
Are you sure you want to change the base?
Conversation
…liaStats-master
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #487 +/- ##
==========================================
- Coverage 90.33% 86.45% -3.89%
==========================================
Files 8 8
Lines 1107 1277 +170
==========================================
+ Hits 1000 1104 +104
- Misses 107 173 +66 ☔ View full report in Codecov by Sentry. |
Hey, Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables). |
not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR. |
FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures. Here's the test case pulled from #432 which passes with the in #487. @testset "collinearity and weights" begin
rng = StableRNG(1234321)
x1 = randn(100)
x1_2 = 3 * x1
x2 = 10 * randn(100)
x2_2 = -2.4 * x2
y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
f = @formula(y ~ x1 + x2 + x3 + x4)
lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
X = [ones(length(y)) x1_2 x2_2]
W = Diagonal(df.weights)
coef_naive = (X'W*X)\X'W*y
@test lm_model.model.pp.chol isa CholeskyPivoted
@test rank(lm_model.model.pp.chol) == 3
@test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end Can this test set be added? Is there any other feedback for @gragusa ? It would be great to get this merged if good to go. |
Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the code. Lots of comments, but all of these are minor. The main one is mostly stylistic: in most cases it seems that using if wts isa UnitWeights
inside a single method (like the current structure) gives simpler code than defining several methods. Otherwise the PR looks really clean!
What are you thoughts regarding testing? There are a lot of combinations to test and it's not easy to see how to integrate that into the current organization of tests. One way would be to add code for each kind of test to each @testset
that checks a given model family (or a particular case, like collinear variables). There's also the issue of testing the QR factorization, which isn't used by default.
A very nice PR. In the tests can we have some test set that compares the results of |
@nalimilan I think I addressed all issues and comments. |
Thanks and sorry for the delay. I think we're close, but I still see some comments from reviews by @bkamins and I in 2022 which still seem to apply. For example https://github.com/JuliaStats/GLM.jl/pull/487/files#r1032949805, which is an important point to decide. Also Codecov indicates that only 80% of the diff is tested, ideally it should be 100%, at least for code that was introduced by this PR. For example right below the comment I mentioned there seem to be |
|
||
r2(obj::LinearModel) = 1 - deviance(obj)/nulldeviance(obj) | ||
adjr2(obj::LinearModel) = 1 - (1 - r²(obj))*(nobs(obj)-hasintercept(obj))/dof_residual(obj) | ||
|
||
working_residuals(x::LinearModel) = residuals(x) | ||
working_weights(x::LinearModel) = x.pp.wts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define working_weights(x::LinPred)
and call that from here for consistency.
|
||
f, (y, X) = modelframe(f, data, contrasts, LinearModel) | ||
_wts = convert_weights(wts) | ||
_wts = isempty(_wts) ? uweights(length(y)) : _wts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also print a deprecation warning when weights have a different length from y
. We don't want to continue accepting empty vectors in the future as people should use UnitWeights
instead.
src/lm.jl
Outdated
N = length(m.rr.y) | ||
n = sum(log, wts) | ||
0.5*(n - N * (log(2π * nulldeviance(m)/N) + 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this definition is OK for both analytical weights and probability weights? I think we discussed this before, but loglikelihood
throws an error for probability weights so I'm surprised that nullloglikelihood
doesn't.
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Co-authored-by: Milan Bouchet-Valat <[email protected]>
@gragusa would you have time to address the review comments here? Would be good to get this one in. |
hello! I was also trying to get analytical weights to work and stumbled on to this PR. Happy to take a look at it if it would be helpful |
I fetched this PR and made some simple changes: (1) merged in the commits from main, (2) removed. extraneous files (vscode and the .tex) file, and the tests seem to run correctly. Is there a way i can push to this PR? |
Just to be clear, I dont want to step on anyones toes, I was doing this just to be able to use weights on my local |
Some help would be welcome as it doesn't seem @gragusa has too much time to finish this. You'd need him to give you permissions to push to his fork. Otherwise you'll have to make a new PR. |
That's what i thought too based on this tutorial, though I have seen repo maintainers directly push to my fork ( |
I was planning to finish the PR — or at least addressing the outstanding issues.
Sent from Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Milan Bouchet-Valat ***@***.***>
Sent: Tuesday, March 25, 2025 5:11:26 PM
To: JuliaStats/GLM.jl ***@***.***>
Cc: Giuseppe Ragusa ***@***.***>; Mention ***@***.***>
Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)
Some help would be welcome as it doesn't seem @gragusa<https://github.com/gragusa> has too much time to finish this. You'd need him to give you permissions to push to his fork. Otherwise you'll have to make a new PR.
—
Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DATZANPG4LX5PUF2E7L2WFWZ5AVCNFSM6AAAAABSC6KE6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJRG44TGNBQG4>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
[nalimilan]nalimilan left a comment (JuliaStats/GLM.jl#487)<#487 (comment)>
Some help would be welcome as it doesn't seem @gragusa<https://github.com/gragusa> has too much time to finish this. You'd need him to give you permissions to push to his fork. Otherwise you'll have to make a new PR.
—
Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DATZANPG4LX5PUF2E7L2WFWZ5AVCNFSM6AAAAABSC6KE6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJRG44TGNBQG4>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@gragusa Are you waiting for a review or do you need to push more commits first? |
I have the last commit almost ready. Will let you now when I push it (should be tonight or tomorrow at the latest).
Sent from Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Milan Bouchet-Valat ***@***.***>
Sent: Thursday, April 10, 2025 9:59:02 PM
To: JuliaStats/GLM.jl ***@***.***>
Cc: Giuseppe Ragusa ***@***.***>; Mention ***@***.***>
Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)
@gragusa<https://github.com/gragusa> Are you waiting for a review or do you need to push more commits first?
—
Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAQKMYLZ3R2HBRD6JFT2Y3EQNAVCNFSM6AAAAABSC6KE6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJVGAZDONBZGE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
[https://avatars.githubusercontent.com/u/1120448?s=20&v=4]nalimilan left a comment (JuliaStats/GLM.jl#487)<#487 (comment)>
@gragusa<https://github.com/gragusa> Are you waiting for a review or do you need to push more commits first?
—
Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAQKMYLZ3R2HBRD6JFT2Y3EQNAVCNFSM6AAAAABSC6KE6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJVGAZDONBZGE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thanks @gragusa ! So excited to try this out! |
This PR addresses several problems with the current GLM implementation.
Current status
In master, GLM/LM only accepts weights through the keyword
wts
. These weights are implicitly frequency weights.With this PR
FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following
The old behavior -- passing a vector
wts=df.wts
is deprecated and for the moment, the array os coerceddf.wts
to FrequencyWeights.To allow dispatching on the weights,
CholPred
takes a parameterT<:AbstractWeights
. The unweighted LM/GLM has UnitWeights as the parameter for the type.This PR also implements
residuals(r::RegressionModel; weighted::Bool=false)
andmodelmatrix(r::RegressionModel; weighted::Bool = false)
. The new signature for these two methods is pending in StatsApi.There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.
I have also implemented
momentmatrix
, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argumentweighted
. Thus I will amend JuliaStats/StatsAPI.jl#16 to remove such a keyword from the signature.Update
I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for
ProbabilityWeights
) also work in the rank deficient case (and so doescooksdistance
).Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using
Weights
and updatedjldoc
with the new signature ofCholeskyPivoted
.To do:
Closes #186.