-
Notifications
You must be signed in to change notification settings - Fork 94
305 feature request integrate clusters into the doublemldata class #338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
305 feature request integrate clusters into the doublemldata class #338
Conversation
306 refactor data generators
Jan teichert kluge/issue272
… palette handling
Update DoubleML __str__ method
…ldata-class' into s-update-cross-sectional-did
S update cross sectional did
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR integrates support for cluster data by refactoring the DoubleMLData class, deprecating the old DoubleMLClusterData implementation, and updating dataset and test imports accordingly. Key changes include adding cluster_cols handling in data classes, updating references to plm datasets, and deprecating DoubleMLClusterData in favor of DoubleMLData with is_cluster_data=True.
Reviewed Changes
Copilot reviewed 137 out of 137 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
doubleml/datasets/fetch_401K.py | Added cluster support and updated error messages for polynomial features implementation. |
doubleml/data/** (multiple files) | Integrated cluster_cols handling and refactored test cases to use updated APIs. |
.github/ISSUE_TEMPLATE/bug_report.yml | Updated snippet import to reference plm.datasets. |
CONTRIBUTING.md | Updated dataset import from doubleml.datasets to doubleml.plm.datasets. |
data = raw_data.copy() | ||
|
||
if polynomial_features: | ||
raise NotImplementedError("polynomial_features os not implemented yet for fetch_401K.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the error message: 'os' should be 'is'. Consider changing it to "polynomial_features is not implemented yet for fetch_401K.".
raise NotImplementedError("polynomial_features os not implemented yet for fetch_401K.") | |
raise NotImplementedError("polynomial_features is not implemented yet for fetch_401K.") |
Copilot uses AI. Check for mistakes.
@@ -0,0 +1,323 @@ | |||
import io | |||
import pandas as pd | |||
from sklearn.utils.validation import check_array |
Check notice
Code scanning / CodeQL
Unused import Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 days ago
To fix the issue, the unused import statement for check_array
should be removed. This involves deleting the check_array
import on line 3 and ensuring that no other references to it exist in the file. Since check_array
is also redundantly imported on line 8, both instances should be removed. This change will not affect the functionality of the code, as check_array
is not used anywhere in the file.
-
Copy modified line R3 -
Copy modified line R8
@@ -2,3 +2,3 @@ | ||
import pandas as pd | ||
from sklearn.utils.validation import check_array | ||
|
||
from sklearn.utils import assert_all_finite | ||
@@ -7,3 +7,3 @@ | ||
from doubleml.utils._estimation import _assure_2d_array | ||
from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d | ||
from sklearn.utils.validation import check_consistent_length, column_or_1d | ||
from sklearn.utils.multiclass import type_of_target |
from sklearn.utils import assert_all_finite | ||
|
||
from doubleml.data.base_data import DoubleMLData | ||
from doubleml.utils._estimation import _assure_2d_array |
Check notice
Code scanning / CodeQL
Unused import Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 days ago
To fix the issue, the unused import statement should be removed. Specifically, the line from doubleml.utils._estimation import _assure_2d_array
should be deleted from the file doubleml/data/did_data.py
. This will eliminate the unnecessary dependency and improve code readability.
-
Copy modified line R7
@@ -6,3 +6,3 @@ | ||
from doubleml.data.base_data import DoubleMLData | ||
from doubleml.utils._estimation import _assure_2d_array | ||
# Line removed as `_assure_2d_array` is unused. | ||
from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d |
|
||
from doubleml.data.base_data import DoubleMLData | ||
from doubleml.utils._estimation import _assure_2d_array | ||
from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d |
Check notice
Code scanning / CodeQL
Unused import Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 days ago
To fix the issue, the redundant import of check_array
on line 8 should be removed. This will eliminate the unused import and ensure that the code adheres to best practices for clean and minimal imports. The functionality of the code will remain unchanged, as the first import of check_array
on line 3 is sufficient for its usage in the file.
-
Copy modified line R8
@@ -7,3 +7,3 @@ | ||
from doubleml.utils._estimation import _assure_2d_array | ||
from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d | ||
from sklearn.utils.validation import check_consistent_length, column_or_1d | ||
from sklearn.utils.multiclass import type_of_target |
from doubleml.data.base_data import DoubleMLData | ||
from doubleml.utils._estimation import _assure_2d_array | ||
from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d | ||
from sklearn.utils.multiclass import type_of_target |
Check notice
Code scanning / CodeQL
Unused import Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 days ago
To fix the issue, we will remove the unused import statement from sklearn.utils.multiclass import type_of_target
on line 9. This will clean up the code and eliminate the unnecessary dependency on sklearn.utils.multiclass
.
-
Copy modified line R9
@@ -8,3 +8,3 @@ | ||
from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d | ||
from sklearn.utils.multiclass import type_of_target | ||
|
||
|
Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68. | ||
doi:`10.1111/ectj.12097 <https://doi.org/10.1111/ectj.12097>`_. | ||
""" | ||
_array_alias = _get_array_alias() |
Check notice
Code scanning / CodeQL
Unused local variable Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 days ago
To fix the issue, the assignment of _array_alias
should be removed entirely, as it serves no purpose in the current implementation. The removal should be done carefully to ensure that no other functionality is affected. Since _get_array_alias()
is not used elsewhere in the provided code snippet, no additional changes are required.
@@ -42,3 +42,2 @@ | ||
""" | ||
_array_alias = _get_array_alias() | ||
_data_frame_alias = _get_data_frame_alias() |
Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68. | ||
doi:`10.1111/ectj.12097 <https://doi.org/10.1111/ectj.12097>`_. | ||
""" | ||
_array_alias = _get_array_alias() |
Check notice
Code scanning / CodeQL
Unused local variable Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 days ago
To fix the issue, we should remove the assignment to _array_alias
on line 45 entirely, as it serves no purpose in the current implementation. This approach avoids unnecessary clutter in the code and adheres to best practices for maintaining clean and readable code. Since _array_alias
is not used elsewhere in the function or the provided code snippet, its removal will not affect the functionality of the fetch_bonus
function.
@@ -44,3 +44,2 @@ | ||
""" | ||
_array_alias = _get_array_alias() | ||
_data_frame_alias = _get_data_frame_alias() |
|
||
_array_alias = _get_array_alias() | ||
_data_frame_alias = _get_data_frame_alias() | ||
_dml_data_alias = _get_dml_data_alias() | ||
_dml_did_data_alias = _get_dml_did_data_alias() | ||
_dml_panel_data_alias = _get_dml_panel_data_alias() |
Check notice
Code scanning / CodeQL
Unused global variable Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 days ago
To fix the issue, the unused variable _dml_panel_data_alias
should be removed from the code. This involves deleting the assignment on line 12. Care must be taken to ensure that the right-hand side of the assignment (_get_dml_panel_data_alias()
) does not have any side effects. If _get_dml_panel_data_alias()
is a pure function (as its name suggests), it can be safely removed without affecting the rest of the code.
-
Copy modified line R12
@@ -11,3 +11,3 @@ | ||
_dml_did_data_alias = _get_dml_did_data_alias() | ||
_dml_panel_data_alias = _get_dml_panel_data_alias() | ||
|
||
|
@@ -60,7 +62,7 @@ | |||
return res | |||
|
|||
|
|||
def make_did_SZ2020(n_obs=500, dgp_type=1, cross_sectional_data=False, return_type="DoubleMLData", **kwargs): | |||
def make_did_SZ2020(n_obs=500, dgp_type=1, cross_sectional_data=False, return_type="DoubleMLDIDData", **kwargs): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 days ago
To fix the issue, an explicit return statement should be added at the end of the make_did_SZ2020
function. This ensures that the function always returns a value, even if the code execution reaches the end without hitting any explicit return statements. The explicit return value should be None
, as this is the default implicit return value in Python. This change does not alter the existing functionality but makes the code more readable and consistent.
-
Copy modified line R241
@@ -240 +240,2 @@ | ||
raise ValueError("Invalid return_type.") | ||
return None |
@@ -17,7 +17,7 @@ | |||
columns=["y", "d", "score"] + ["x" + str(i) for i in range(data["X"].shape[1])], | |||
) | |||
|
|||
dml_data = DoubleMLData(df, y_col="y", d_cols="d", s_col="score") | |||
dml_data = DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score") |
Check failure
Code scanning / CodeQL
Wrong name for an argument in a class instantiation Error test
DoubleMLRDDData.__init__
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 days ago
To fix the issue, we need to verify the correct parameter name for the DoubleMLRDDData
class's __init__
method. If s_col
is incorrect, it should be replaced with the correct parameter name. Based on the context, s_col
likely refers to the score column, and the correct parameter name might be score_col
or something similar.
Steps:
- Identify the correct parameter name for the score column in the
DoubleMLRDDData
class. - Replace
s_col="score"
with the correct argument name and value in the instantiation ofDoubleMLRDDData
.
-
Copy modified line R20
@@ -19,3 +19,3 @@ | ||
|
||
dml_data = DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score") | ||
dml_data = DoubleMLRDDData(df, y_col="y", d_cols="d", score_col="score") | ||
|
@@ -15,7 +15,7 @@ | |||
np.column_stack((data["Y"], data["D"], data["score"], data["X"])), | |||
columns=["y", "d", "score"] + ["x" + str(i) for i in range(data["X"].shape[1])], | |||
) | |||
dml_data = dml.DoubleMLData(df, y_col="y", d_cols="d", s_col="score") | |||
dml_data = dml.DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score") |
Check failure
Code scanning / CodeQL
Wrong name for an argument in a class instantiation Error test
DoubleMLRDDData.__init__
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 days ago
To fix the issue, we need to replace the incorrect argument name s_col
with the correct parameter name expected by the DoubleMLRDDData
class. Based on the context, it is likely that the correct parameter name is score_col
, as it aligns with the naming convention used for other parameters like y_col
and d_cols
.
Steps to fix:
- Identify the correct parameter name for the score column in the
DoubleMLRDDData
class. This is likelyscore_col
. - Update line 18 to use the correct parameter name.
- Ensure that the rest of the code remains consistent with this change.
-
Copy modified line R18
@@ -17,3 +17,3 @@ | ||
) | ||
dml_data = dml.DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score") | ||
dml_data = dml.DoubleMLRDDData(df, y_col="y", d_cols="d", score_col="score") | ||
|
Thanks for contributing to DoubleML.
Before submitting a PR, please take a look at our contribution guidelines.
Additionally, please fill out the PR checklist below.
Description
Please describe all changes and additions.
In addition, you may want to comment on the diff in GitHub.
Reference to Issues or PRs
Add references to related issues or PRs here.
Comments
Here you can add further comments.
You can also delete this section, if it is not necessary.
PR Checklist
Please fill out this PR checklist (see our contributing guidelines for details).