Skip to content

305 feature request integrate clusters into the doublemldata class #338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 89 commits into
base: main
Choose a base branch
from

Conversation

JanTeichertKluge
Copy link
Member

Thanks for contributing to DoubleML.
Before submitting a PR, please take a look at our contribution guidelines.
Additionally, please fill out the PR checklist below.

Description

Please describe all changes and additions.
In addition, you may want to comment on the diff in GitHub.

Reference to Issues or PRs

Add references to related issues or PRs here.

Comments

Here you can add further comments.
You can also delete this section, if it is not necessary.

PR Checklist

Please fill out this PR checklist (see our contributing guidelines for details).

  • The title of the pull request summarizes the changes made.
  • The PR contains a detailed description of all changes and additions.
  • References to related issues or PRs are added.
  • The code passes all (unit) tests.
  • Enhancements or new feature are equipped with unit tests.
  • The changes adhere to the PEP8 standards.

SvenKlaassen and others added 30 commits June 2, 2025 14:20
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates support for cluster data by refactoring the DoubleMLData class, deprecating the old DoubleMLClusterData implementation, and updating dataset and test imports accordingly. Key changes include adding cluster_cols handling in data classes, updating references to plm datasets, and deprecating DoubleMLClusterData in favor of DoubleMLData with is_cluster_data=True.

Reviewed Changes

Copilot reviewed 137 out of 137 changed files in this pull request and generated 1 comment.

File Description
doubleml/datasets/fetch_401K.py Added cluster support and updated error messages for polynomial features implementation.
doubleml/data/** (multiple files) Integrated cluster_cols handling and refactored test cases to use updated APIs.
.github/ISSUE_TEMPLATE/bug_report.yml Updated snippet import to reference plm.datasets.
CONTRIBUTING.md Updated dataset import from doubleml.datasets to doubleml.plm.datasets.

data = raw_data.copy()

if polynomial_features:
raise NotImplementedError("polynomial_features os not implemented yet for fetch_401K.")
Copy link
Preview

Copilot AI Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo in the error message: 'os' should be 'is'. Consider changing it to "polynomial_features is not implemented yet for fetch_401K.".

Suggested change
raise NotImplementedError("polynomial_features os not implemented yet for fetch_401K.")
raise NotImplementedError("polynomial_features is not implemented yet for fetch_401K.")

Copilot uses AI. Check for mistakes.

@@ -0,0 +1,323 @@
import io
import pandas as pd
from sklearn.utils.validation import check_array

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'check_array' is not used.

Copilot Autofix

AI 5 days ago

To fix the issue, the unused import statement for check_array should be removed. This involves deleting the check_array import on line 3 and ensuring that no other references to it exist in the file. Since check_array is also redundantly imported on line 8, both instances should be removed. This change will not affect the functionality of the code, as check_array is not used anywhere in the file.


Suggested changeset 1
doubleml/data/did_data.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/doubleml/data/did_data.py b/doubleml/data/did_data.py
--- a/doubleml/data/did_data.py
+++ b/doubleml/data/did_data.py
@@ -2,3 +2,3 @@
 import pandas as pd
-from sklearn.utils.validation import check_array
+
 from sklearn.utils import assert_all_finite
@@ -7,3 +7,3 @@
 from doubleml.utils._estimation import _assure_2d_array
-from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d
+from sklearn.utils.validation import check_consistent_length, column_or_1d
 from sklearn.utils.multiclass import type_of_target
EOF
@@ -2,3 +2,3 @@
import pandas as pd
from sklearn.utils.validation import check_array

from sklearn.utils import assert_all_finite
@@ -7,3 +7,3 @@
from doubleml.utils._estimation import _assure_2d_array
from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d
from sklearn.utils.validation import check_consistent_length, column_or_1d
from sklearn.utils.multiclass import type_of_target
Copilot is powered by AI and may make mistakes. Always verify output.
from sklearn.utils import assert_all_finite

from doubleml.data.base_data import DoubleMLData
from doubleml.utils._estimation import _assure_2d_array

Check notice

Code scanning / CodeQL

Unused import Note

Import of '_assure_2d_array' is not used.

Copilot Autofix

AI 5 days ago

To fix the issue, the unused import statement should be removed. Specifically, the line from doubleml.utils._estimation import _assure_2d_array should be deleted from the file doubleml/data/did_data.py. This will eliminate the unnecessary dependency and improve code readability.


Suggested changeset 1
doubleml/data/did_data.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/doubleml/data/did_data.py b/doubleml/data/did_data.py
--- a/doubleml/data/did_data.py
+++ b/doubleml/data/did_data.py
@@ -6,3 +6,3 @@
 from doubleml.data.base_data import DoubleMLData
-from doubleml.utils._estimation import _assure_2d_array
+# Line removed as `_assure_2d_array` is unused.
 from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d
EOF
@@ -6,3 +6,3 @@
from doubleml.data.base_data import DoubleMLData
from doubleml.utils._estimation import _assure_2d_array
# Line removed as `_assure_2d_array` is unused.
from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d
Copilot is powered by AI and may make mistakes. Always verify output.

from doubleml.data.base_data import DoubleMLData
from doubleml.utils._estimation import _assure_2d_array
from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'check_array' is not used.

Copilot Autofix

AI 5 days ago

To fix the issue, the redundant import of check_array on line 8 should be removed. This will eliminate the unused import and ensure that the code adheres to best practices for clean and minimal imports. The functionality of the code will remain unchanged, as the first import of check_array on line 3 is sufficient for its usage in the file.


Suggested changeset 1
doubleml/data/did_data.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/doubleml/data/did_data.py b/doubleml/data/did_data.py
--- a/doubleml/data/did_data.py
+++ b/doubleml/data/did_data.py
@@ -7,3 +7,3 @@
 from doubleml.utils._estimation import _assure_2d_array
-from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d
+from sklearn.utils.validation import check_consistent_length, column_or_1d
 from sklearn.utils.multiclass import type_of_target
EOF
@@ -7,3 +7,3 @@
from doubleml.utils._estimation import _assure_2d_array
from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d
from sklearn.utils.validation import check_consistent_length, column_or_1d
from sklearn.utils.multiclass import type_of_target
Copilot is powered by AI and may make mistakes. Always verify output.
from doubleml.data.base_data import DoubleMLData
from doubleml.utils._estimation import _assure_2d_array
from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d
from sklearn.utils.multiclass import type_of_target

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'type_of_target' is not used.

Copilot Autofix

AI 5 days ago

To fix the issue, we will remove the unused import statement from sklearn.utils.multiclass import type_of_target on line 9. This will clean up the code and eliminate the unnecessary dependency on sklearn.utils.multiclass.

Suggested changeset 1
doubleml/data/did_data.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/doubleml/data/did_data.py b/doubleml/data/did_data.py
--- a/doubleml/data/did_data.py
+++ b/doubleml/data/did_data.py
@@ -8,3 +8,3 @@
 from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d
-from sklearn.utils.multiclass import type_of_target
+
 
EOF
@@ -8,3 +8,3 @@
from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d
from sklearn.utils.multiclass import type_of_target


Copilot is powered by AI and may make mistakes. Always verify output.
Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68.
doi:`10.1111/ectj.12097 <https://doi.org/10.1111/ectj.12097>`_.
"""
_array_alias = _get_array_alias()

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable _array_alias is not used.

Copilot Autofix

AI 5 days ago

To fix the issue, the assignment of _array_alias should be removed entirely, as it serves no purpose in the current implementation. The removal should be done carefully to ensure that no other functionality is affected. Since _get_array_alias() is not used elsewhere in the provided code snippet, no additional changes are required.


Suggested changeset 1
doubleml/datasets/fetch_401K.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/doubleml/datasets/fetch_401K.py b/doubleml/datasets/fetch_401K.py
--- a/doubleml/datasets/fetch_401K.py
+++ b/doubleml/datasets/fetch_401K.py
@@ -42,3 +42,2 @@
     """
-    _array_alias = _get_array_alias()
     _data_frame_alias = _get_data_frame_alias()
EOF
@@ -42,3 +42,2 @@
"""
_array_alias = _get_array_alias()
_data_frame_alias = _get_data_frame_alias()
Copilot is powered by AI and may make mistakes. Always verify output.
Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68.
doi:`10.1111/ectj.12097 <https://doi.org/10.1111/ectj.12097>`_.
"""
_array_alias = _get_array_alias()

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable _array_alias is not used.

Copilot Autofix

AI 5 days ago

To fix the issue, we should remove the assignment to _array_alias on line 45 entirely, as it serves no purpose in the current implementation. This approach avoids unnecessary clutter in the code and adheres to best practices for maintaining clean and readable code. Since _array_alias is not used elsewhere in the function or the provided code snippet, its removal will not affect the functionality of the fetch_bonus function.


Suggested changeset 1
doubleml/datasets/fetch_bonus.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/doubleml/datasets/fetch_bonus.py b/doubleml/datasets/fetch_bonus.py
--- a/doubleml/datasets/fetch_bonus.py
+++ b/doubleml/datasets/fetch_bonus.py
@@ -44,3 +44,2 @@
     """
-    _array_alias = _get_array_alias()
     _data_frame_alias = _get_data_frame_alias()
EOF
@@ -44,3 +44,2 @@
"""
_array_alias = _get_array_alias()
_data_frame_alias = _get_data_frame_alias()
Copilot is powered by AI and may make mistakes. Always verify output.

_array_alias = _get_array_alias()
_data_frame_alias = _get_data_frame_alias()
_dml_data_alias = _get_dml_data_alias()
_dml_did_data_alias = _get_dml_did_data_alias()
_dml_panel_data_alias = _get_dml_panel_data_alias()

Check notice

Code scanning / CodeQL

Unused global variable Note

The global variable '_dml_panel_data_alias' is not used.

Copilot Autofix

AI 5 days ago

To fix the issue, the unused variable _dml_panel_data_alias should be removed from the code. This involves deleting the assignment on line 12. Care must be taken to ensure that the right-hand side of the assignment (_get_dml_panel_data_alias()) does not have any side effects. If _get_dml_panel_data_alias() is a pure function (as its name suggests), it can be safely removed without affecting the rest of the code.


Suggested changeset 1
doubleml/did/datasets/dgp_did_SZ2020.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/doubleml/did/datasets/dgp_did_SZ2020.py b/doubleml/did/datasets/dgp_did_SZ2020.py
--- a/doubleml/did/datasets/dgp_did_SZ2020.py
+++ b/doubleml/did/datasets/dgp_did_SZ2020.py
@@ -11,3 +11,3 @@
 _dml_did_data_alias = _get_dml_did_data_alias()
-_dml_panel_data_alias = _get_dml_panel_data_alias()
+
 
EOF
@@ -11,3 +11,3 @@
_dml_did_data_alias = _get_dml_did_data_alias()
_dml_panel_data_alias = _get_dml_panel_data_alias()


Copilot is powered by AI and may make mistakes. Always verify output.
@@ -60,7 +62,7 @@
return res


def make_did_SZ2020(n_obs=500, dgp_type=1, cross_sectional_data=False, return_type="DoubleMLData", **kwargs):
def make_did_SZ2020(n_obs=500, dgp_type=1, cross_sectional_data=False, return_type="DoubleMLDIDData", **kwargs):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.

Copilot Autofix

AI 5 days ago

To fix the issue, an explicit return statement should be added at the end of the make_did_SZ2020 function. This ensures that the function always returns a value, even if the code execution reaches the end without hitting any explicit return statements. The explicit return value should be None, as this is the default implicit return value in Python. This change does not alter the existing functionality but makes the code more readable and consistent.

Suggested changeset 1
doubleml/did/datasets/dgp_did_SZ2020.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/doubleml/did/datasets/dgp_did_SZ2020.py b/doubleml/did/datasets/dgp_did_SZ2020.py
--- a/doubleml/did/datasets/dgp_did_SZ2020.py
+++ b/doubleml/did/datasets/dgp_did_SZ2020.py
@@ -240 +240,2 @@
             raise ValueError("Invalid return_type.")
+    return None
EOF
@@ -240 +240,2 @@
raise ValueError("Invalid return_type.")
return None
Copilot is powered by AI and may make mistakes. Always verify output.
@@ -17,7 +17,7 @@
columns=["y", "d", "score"] + ["x" + str(i) for i in range(data["X"].shape[1])],
)

dml_data = DoubleMLData(df, y_col="y", d_cols="d", s_col="score")
dml_data = DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score")

Check failure

Code scanning / CodeQL

Wrong name for an argument in a class instantiation Error test

Keyword argument 's_col' is not a supported parameter name of
DoubleMLRDDData.__init__
.

Copilot Autofix

AI 5 days ago

To fix the issue, we need to verify the correct parameter name for the DoubleMLRDDData class's __init__ method. If s_col is incorrect, it should be replaced with the correct parameter name. Based on the context, s_col likely refers to the score column, and the correct parameter name might be score_col or something similar.

Steps:

  1. Identify the correct parameter name for the score column in the DoubleMLRDDData class.
  2. Replace s_col="score" with the correct argument name and value in the instantiation of DoubleMLRDDData.

Suggested changeset 1
doubleml/rdd/tests/test_rdd_exceptions.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/doubleml/rdd/tests/test_rdd_exceptions.py b/doubleml/rdd/tests/test_rdd_exceptions.py
--- a/doubleml/rdd/tests/test_rdd_exceptions.py
+++ b/doubleml/rdd/tests/test_rdd_exceptions.py
@@ -19,3 +19,3 @@
 
-dml_data = DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score")
+dml_data = DoubleMLRDDData(df, y_col="y", d_cols="d", score_col="score")
 
EOF
@@ -19,3 +19,3 @@

dml_data = DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score")
dml_data = DoubleMLRDDData(df, y_col="y", d_cols="d", score_col="score")

Copilot is powered by AI and may make mistakes. Always verify output.
@@ -15,7 +15,7 @@
np.column_stack((data["Y"], data["D"], data["score"], data["X"])),
columns=["y", "d", "score"] + ["x" + str(i) for i in range(data["X"].shape[1])],
)
dml_data = dml.DoubleMLData(df, y_col="y", d_cols="d", s_col="score")
dml_data = dml.DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score")

Check failure

Code scanning / CodeQL

Wrong name for an argument in a class instantiation Error test

Keyword argument 's_col' is not a supported parameter name of
DoubleMLRDDData.__init__
.

Copilot Autofix

AI 5 days ago

To fix the issue, we need to replace the incorrect argument name s_col with the correct parameter name expected by the DoubleMLRDDData class. Based on the context, it is likely that the correct parameter name is score_col, as it aligns with the naming convention used for other parameters like y_col and d_cols.

Steps to fix:

  1. Identify the correct parameter name for the score column in the DoubleMLRDDData class. This is likely score_col.
  2. Update line 18 to use the correct parameter name.
  3. Ensure that the rest of the code remains consistent with this change.

Suggested changeset 1
doubleml/rdd/tests/test_rdd_return_types.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/doubleml/rdd/tests/test_rdd_return_types.py b/doubleml/rdd/tests/test_rdd_return_types.py
--- a/doubleml/rdd/tests/test_rdd_return_types.py
+++ b/doubleml/rdd/tests/test_rdd_return_types.py
@@ -17,3 +17,3 @@
 )
-dml_data = dml.DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score")
+dml_data = dml.DoubleMLRDDData(df, y_col="y", d_cols="d", score_col="score")
 
EOF
@@ -17,3 +17,3 @@
)
dml_data = dml.DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score")
dml_data = dml.DoubleMLRDDData(df, y_col="y", d_cols="d", score_col="score")

Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: Integrate Clusters into the DoubleMLData Class
2 participants