Taking weighting seriously #487

gragusa · 2022-07-15T16:07:11Z

This PR addresses several problems with the current GLM implementation.

Current status
In master, GLM/LM only accepts weights through the keyword wts. These weights are implicitly frequency weights.

With this PR
FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following

## Frequency Weights
lm(@formula(y~x), df; wts=fweights(df.wts)
## Analytic Weights
lm(@formula(y~x), df; wts=aweights(df.wts)
## ProbabilityWeights
lm(@formula(y~x), df; wts=pweights(df.wts)

The old behavior -- passing a vector wts=df.wts is deprecated and for the moment, the array os coerced df.wts to FrequencyWeights.

To allow dispatching on the weights, CholPred takes a parameter T<:AbstractWeights. The unweighted LM/GLM has UnitWeights as the parameter for the type.

This PR also implements residuals(r::RegressionModel; weighted::Bool=false) and modelmatrix(r::RegressionModel; weighted::Bool = false). The new signature for these two methods is pending in StatsApi.

There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.

I have also implemented momentmatrix, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argument weighted. Thus I will amend JuliaStats/StatsAPI.jl#16 to remove such a keyword from the signature.

Update

I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for ProbabilityWeights) also work in the rank deficient case (and so does cooksdistance).

Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using Weights and updated jldoc with the new signature of CholeskyPivoted.

To do:

Deal with weighted standard errors with rank deficient designs
Document the new API
Improve testing

Closes #186.

…liaStats-master

codecov-commenter · 2022-07-16T08:43:43Z

Codecov Report

Attention: Patch coverage is 79.81073% with 64 lines in your changes missing coverage. Please review.

Project coverage is 86.45%. Comparing base (89493a4) to head (574ec69).

Files with missing lines	Patch %	Lines
src/glmfit.jl	78.30%	23 Missing ⚠️
src/lm.jl	75.60%	20 Missing ⚠️
src/linpred.jl	84.74%	18 Missing ⚠️
src/glmtools.jl	62.50%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #487      +/-   ##
==========================================
- Coverage   90.33%   86.45%   -3.89%     
==========================================
  Files           8        8              
  Lines        1107     1277     +170     
==========================================
+ Hits         1000     1104     +104     
- Misses        107      173      +66

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lrnv · 2022-07-20T07:45:33Z

Hey,

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables).

gragusa · 2022-07-20T17:14:41Z

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR.

alecloudenback · 2022-08-14T19:14:57Z

FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures.

Here's the test case pulled from #432 which passes with the in #487.

@testset "collinearity and weights" begin
    rng = StableRNG(1234321)
    x1 = randn(100)
    x1_2 = 3 * x1
    x2 = 10 * randn(100)
    x2_2 = -2.4 * x2
    y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
    df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
    f = @formula(y ~ x1 + x2 + x3 + x4)
    lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
    X = [ones(length(y)) x1_2 x2_2]
    W = Diagonal(df.weights)
    coef_naive = (X'W*X)\X'W*y
    @test lm_model.model.pp.chol isa CholeskyPivoted
    @test rank(lm_model.model.pp.chol) == 3
    @test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end

Can this test set be added?

Is there any other feedback for @gragusa ? It would be great to get this merged if good to go.

nalimilan · 2022-08-28T18:27:50Z

Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments.

nalimilan

I've read the code. Lots of comments, but all of these are minor. The main one is mostly stylistic: in most cases it seems that using if wts isa UnitWeights inside a single method (like the current structure) gives simpler code than defining several methods. Otherwise the PR looks really clean!

What are you thoughts regarding testing? There are a lot of combinations to test and it's not easy to see how to integrate that into the current organization of tests. One way would be to add code for each kind of test to each @testset that checks a given model family (or a particular case, like collinear variables). There's also the issue of testing the QR factorization, which isn't used by default.

src/GLM.jl

src/glmfit.jl

src/lm.jl

test/runtests.jl

bkamins · 2022-08-31T08:49:28Z

A very nice PR. In the tests can we have some test set that compares the results of aweights, fweights, and pweights for the same set of data (coeffs, predictions, covariance matrix of the estimates, p-values etc.).

gragusa · 2024-12-13T21:41:29Z

@nalimilan I think I addressed all issues and comments.

nalimilan · 2024-12-18T11:25:23Z

Thanks and sorry for the delay. I think we're close, but I still see some comments from reviews by @bkamins and I in 2022 which still seem to apply. For example https://github.com/JuliaStats/GLM.jl/pull/487/files#r1032949805, which is an important point to decide.

Also Codecov indicates that only 80% of the diff is tested, ideally it should be 100%, at least for code that was introduced by this PR. For example right below the comment I mentioned there seem to be loglik_apweights_obs methods that are not tested at all. Same for some isweighted, loglikelihood or residuals methods.

docs/src/index.md

src/linpred.jl

nalimilan · 2024-12-12T13:18:04Z

src/lm.jl


 r2(obj::LinearModel) = 1 - deviance(obj)/nulldeviance(obj)
 adjr2(obj::LinearModel) = 1 - (1 - r²(obj))*(nobs(obj)-hasintercept(obj))/dof_residual(obj)

+working_residuals(x::LinearModel) = residuals(x)
+working_weights(x::LinearModel) = x.pp.wts


Define working_weights(x::LinPred) and call that from here for consistency.

src/lm.jl

src/linpred.jl

nalimilan · 2024-12-18T11:06:23Z

src/lm.jl


+    f, (y, X) = modelframe(f, data, contrasts, LinearModel)
+    _wts = convert_weights(wts)
+    _wts = isempty(_wts) ? uweights(length(y)) : _wts


Also print a deprecation warning when weights have a different length from y. We don't want to continue accepting empty vectors in the future as people should use UnitWeights instead.

nalimilan · 2024-12-18T11:08:56Z

src/lm.jl

+        N = length(m.rr.y)
+        n = sum(log, wts)
+        0.5*(n - N * (log(2π * nulldeviance(m)/N) + 1))


Are we sure this definition is OK for both analytical weights and probability weights? I think we discussed this before, but loglikelihood throws an error for probability weights so I'm surprised that nullloglikelihood doesn't.

src/lm.jl

Co-authored-by: Milan Bouchet-Valat <[email protected]>

andreasnoack · 2025-03-21T09:17:43Z

@gragusa would you have time to address the review comments here? Would be good to get this one in.

ajinkya-k · 2025-03-22T18:18:10Z

hello! I was also trying to get analytical weights to work and stumbled on to this PR. Happy to take a look at it if it would be helpful

ajinkya-k · 2025-03-25T14:47:49Z

I fetched this PR and made some simple changes: (1) merged in the commits from main, (2) removed. extraneous files (vscode and the .tex) file, and the tests seem to run correctly. Is there a way i can push to this PR?

ajinkya-k · 2025-03-25T14:48:42Z

Just to be clear, I dont want to step on anyones toes, I was doing this just to be able to use weights on my local

nalimilan · 2025-03-25T16:11:03Z

Some help would be welcome as it doesn't seem @gragusa has too much time to finish this. You'd need him to give you permissions to push to his fork. Otherwise you'll have to make a new PR.

ajinkya-k · 2025-03-25T16:16:23Z

That's what i thought too based on this tutorial, though I have seen repo maintainers directly push to my fork (~~example here~~ whoops looks like i nuked that commit with a rebase). That's probably because I checked the "let maintainers edit this PR" box somewhere right?

gragusa · 2025-03-25T16:17:51Z

I was planning to finish the PR — or at least addressing the outstanding issues. Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Milan Bouchet-Valat ***@***.***> Sent: Tuesday, March 25, 2025 5:11:26 PM To: JuliaStats/GLM.jl ***@***.***> Cc: Giuseppe Ragusa ***@***.***>; Mention ***@***.***> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487) Some help would be welcome as it doesn't seem @gragusa<https://github.com/gragusa> has too much time to finish this. You'd need him to give you permissions to push to his fork. Otherwise you'll have to make a new PR. — Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DATZANPG4LX5PUF2E7L2WFWZ5AVCNFSM6AAAAABSC6KE6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJRG44TGNBQG4>. You are receiving this because you were mentioned.Message ID: ***@***.***> [nalimilan]nalimilan left a comment (JuliaStats/GLM.jl#487)<#487 (comment)> Some help would be welcome as it doesn't seem @gragusa<https://github.com/gragusa> has too much time to finish this. You'd need him to give you permissions to push to his fork. Otherwise you'll have to make a new PR. — Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DATZANPG4LX5PUF2E7L2WFWZ5AVCNFSM6AAAAABSC6KE6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJRG44TGNBQG4>. You are receiving this because you were mentioned.Message ID: ***@***.***>

…liaStats-master

nalimilan · 2025-04-10T19:58:39Z

@gragusa Are you waiting for a review or do you need to push more commits first?

gragusa · 2025-04-10T20:00:42Z

I have the last commit almost ready. Will let you now when I push it (should be tonight or tomorrow at the latest). Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Milan Bouchet-Valat ***@***.***> Sent: Thursday, April 10, 2025 9:59:02 PM To: JuliaStats/GLM.jl ***@***.***> Cc: Giuseppe Ragusa ***@***.***>; Mention ***@***.***> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487) @gragusa<https://github.com/gragusa> Are you waiting for a review or do you need to push more commits first? — Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAQKMYLZ3R2HBRD6JFT2Y3EQNAVCNFSM6AAAAABSC6KE6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJVGAZDONBZGE>. You are receiving this because you were mentioned.Message ID: ***@***.***> [https://avatars.githubusercontent.com/u/1120448?s=20&v=4]nalimilan left a comment (JuliaStats/GLM.jl#487)<#487 (comment)> @gragusa<https://github.com/gragusa> Are you waiting for a review or do you need to push more commits first? — Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAQKMYLZ3R2HBRD6JFT2Y3EQNAVCNFSM6AAAAABSC6KE6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJVGAZDONBZGE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

ajinkya-k · 2025-04-10T20:06:43Z

Thanks @gragusa ! So excited to try this out!

gragusa added 20 commits June 10, 2022 20:53

WIP

1754cbd

WIP

1d778a5

WIP

12121a3

Taking weights seriously

4363ba4

WIP

ca702dc

Taking weights seriously

e2b2d12

Merge branch 'master' of https://github.com/JuliaStats/GLM.jl into Ju…

bc8709a

…liaStats-master

Add depwarn for passing wts with Vector

84cd990

Cosmettic changes

cbc329f

WIP

23d67f5

Fix loglik for weighted models

f4d90a9

Fix remaining issues

6b7d95c

Final commit

c236b82

Merge branch 'master'

d4bd0c2

Fix merge

8bdfb55

Fix nulldeviance

3eb2ca4

Bypass crossmodelmatrix drom StatsAPI

63c8358

Delete momentmatrix.jl

e93a919

Delete scratch.jl

7bb0959

Delete settings.json

ded17a8

ararslan requested review from andreasnoack and nalimilan August 15, 2022 19:54

nalimilan mentioned this pull request Aug 28, 2022

Fixed linear model with perfectly collinear rhs variables and weights #432

Closed

nalimilan reviewed Aug 31, 2022

View reviewed changes

gragusa added 9 commits December 12, 2024 13:54

Add weighting information in COMMON_FIT_KWARGS_DOCS

cdeb1a3

Add test for leverage

95d506e

[wip] work on leverage

f124589

Use inverse

cbdadbc

Test leverage

2386ab9

Comment cookdistance

36326ff

Committed by mistake

f26bc0e

leverage returns a vec

2bc2138

Fix cookdistance return type

0569600

nalimilan reviewed Dec 18, 2024

View reviewed changes

gragusa and others added 4 commits December 18, 2024 12:51

Update docs/src/index.md

dd1b4a8

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update docs/src/index.md

1c5953d

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/glmfit.jl

cd39578

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/linpred.jl

574ec69

Co-authored-by: Milan Bouchet-Valat <[email protected]>

gragusa added 2 commits April 1, 2025 01:51

Adress outstanding issues

60f43a8

Merge branch 'JuliaStats-master' of github.com:gragusa/GLM.jl into Ju…

cb7b0a0

…liaStats-master

gragusa added 2 commits April 11, 2025 22:55

Merge branch 'master' into JuliaStats-master

a8a4a34

Merge recent PRs

0f27b2f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taking weighting seriously #487

Taking weighting seriously #487

gragusa commented Jul 15, 2022 •

edited

Loading

codecov-commenter commented Jul 16, 2022 •

edited by codecov bot

Loading

lrnv commented Jul 20, 2022

gragusa commented Jul 20, 2022

alecloudenback commented Aug 14, 2022 •

edited

Loading

nalimilan commented Aug 28, 2022

nalimilan left a comment

bkamins commented Aug 31, 2022

gragusa commented Dec 13, 2024

nalimilan commented Dec 18, 2024

nalimilan Dec 12, 2024

nalimilan Dec 18, 2024

nalimilan Dec 18, 2024

andreasnoack commented Mar 21, 2025

ajinkya-k commented Mar 22, 2025

ajinkya-k commented Mar 25, 2025

ajinkya-k commented Mar 25, 2025

nalimilan commented Mar 25, 2025

ajinkya-k commented Mar 25, 2025 •

edited

Loading

gragusa commented Mar 25, 2025 via email

nalimilan commented Apr 10, 2025

gragusa commented Apr 10, 2025 via email

ajinkya-k commented Apr 10, 2025

Taking weighting seriously #487

Are you sure you want to change the base?

Taking weighting seriously #487

Conversation

gragusa commented Jul 15, 2022 • edited Loading

codecov-commenter commented Jul 16, 2022 • edited by codecov bot Loading

Codecov Report

lrnv commented Jul 20, 2022

gragusa commented Jul 20, 2022

alecloudenback commented Aug 14, 2022 • edited Loading

nalimilan commented Aug 28, 2022

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Aug 31, 2022

gragusa commented Dec 13, 2024

nalimilan commented Dec 18, 2024

nalimilan Dec 12, 2024

Choose a reason for hiding this comment

nalimilan Dec 18, 2024

Choose a reason for hiding this comment

nalimilan Dec 18, 2024

Choose a reason for hiding this comment

andreasnoack commented Mar 21, 2025

ajinkya-k commented Mar 22, 2025

ajinkya-k commented Mar 25, 2025

ajinkya-k commented Mar 25, 2025

nalimilan commented Mar 25, 2025

ajinkya-k commented Mar 25, 2025 • edited Loading

gragusa commented Mar 25, 2025 via email

nalimilan commented Apr 10, 2025

gragusa commented Apr 10, 2025 via email

ajinkya-k commented Apr 10, 2025

gragusa commented Jul 15, 2022 •

edited

Loading

codecov-commenter commented Jul 16, 2022 •

edited by codecov bot

Loading

alecloudenback commented Aug 14, 2022 •

edited

Loading

ajinkya-k commented Mar 25, 2025 •

edited

Loading