305 feature request integrate clusters into the doublemldata class #338

JanTeichertKluge · 2025-06-17T14:38:01Z

Thanks for contributing to DoubleML.
Before submitting a PR, please take a look at our contribution guidelines.
Additionally, please fill out the PR checklist below.

Description

Please describe all changes and additions.
In addition, you may want to comment on the diff in GitHub.

Reference to Issues or PRs

Add references to related issues or PRs here.

Comments

Here you can add further comments.
You can also delete this section, if it is not necessary.

PR Checklist

Please fill out this PR checklist (see our contributing guidelines for details).

The title of the pull request summarizes the changes made.
The PR contains a detailed description of all changes and additions.
References to related issues or PRs are added.
The code passes all (unit) tests.
Enhancements or new feature are equipped with unit tests.
The changes adhere to the PEP8 standards.

306 refactor data generators

Jan teichert kluge/issue272

… palette handling

Update DoubleML __str__ method

…ldata-class' into s-update-cross-sectional-did

S update cross sectional did

Copilot

Pull Request Overview

This PR integrates support for cluster data by refactoring the DoubleMLData class, deprecating the old DoubleMLClusterData implementation, and updating dataset and test imports accordingly. Key changes include adding cluster_cols handling in data classes, updating references to plm datasets, and deprecating DoubleMLClusterData in favor of DoubleMLData with is_cluster_data=True.

Reviewed Changes

Copilot reviewed 137 out of 137 changed files in this pull request and generated 1 comment.

File	Description
doubleml/datasets/fetch_401K.py	Added cluster support and updated error messages for polynomial features implementation.
doubleml/data/** (multiple files)	Integrated cluster_cols handling and refactored test cases to use updated APIs.
.github/ISSUE_TEMPLATE/bug_report.yml	Updated snippet import to reference plm.datasets.
CONTRIBUTING.md	Updated dataset import from doubleml.datasets to doubleml.plm.datasets.

Copilot · 2025-06-17T14:38:31Z

doubleml/datasets/fetch_401K.py

+    data = raw_data.copy()
+
+    if polynomial_features:
+        raise NotImplementedError("polynomial_features os not implemented yet for fetch_401K.")


There is a typo in the error message: 'os' should be 'is'. Consider changing it to "polynomial_features is not implemented yet for fetch_401K.".

Suggested change

raise NotImplementedError("polynomial_features os not implemented yet for fetch_401K.")

raise NotImplementedError("polynomial_features is not implemented yet for fetch_401K.")

doubleml/data/did_data.py

@@ -0,0 +1,323 @@
+import io
+import pandas as pd
+from sklearn.utils.validation import check_array


To fix the issue, the unused import statement for check_array should be removed. This involves deleting the check_array import on line 3 and ensuring that no other references to it exist in the file. Since check_array is also redundantly imported on line 8, both instances should be removed. This change will not affect the functionality of the code, as check_array is not used anywhere in the file.

doubleml/data/did_data.py

+from sklearn.utils import assert_all_finite
+
+from doubleml.data.base_data import DoubleMLData
+from doubleml.utils._estimation import _assure_2d_array


To fix the issue, the unused import statement should be removed. Specifically, the line from doubleml.utils._estimation import _assure_2d_array should be deleted from the file doubleml/data/did_data.py. This will eliminate the unnecessary dependency and improve code readability.

doubleml/data/did_data.py

+
+from doubleml.data.base_data import DoubleMLData
+from doubleml.utils._estimation import _assure_2d_array
+from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d


To fix the issue, the redundant import of check_array on line 8 should be removed. This will eliminate the unused import and ensure that the code adheres to best practices for clean and minimal imports. The functionality of the code will remain unchanged, as the first import of check_array on line 3 is sufficient for its usage in the file.

doubleml/data/did_data.py

+from doubleml.data.base_data import DoubleMLData
+from doubleml.utils._estimation import _assure_2d_array
+from sklearn.utils.validation import check_array, check_consistent_length, column_or_1d
+from sklearn.utils.multiclass import type_of_target


To fix the issue, we will remove the unused import statement from sklearn.utils.multiclass import type_of_target on line 9. This will clean up the code and eliminate the unnecessary dependency on sklearn.utils.multiclass.

doubleml/datasets/fetch_401K.py

+    Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68.
+    doi:`10.1111/ectj.12097 <https://doi.org/10.1111/ectj.12097>`_.
+    """
+    _array_alias = _get_array_alias()


To fix the issue, the assignment of _array_alias should be removed entirely, as it serves no purpose in the current implementation. The removal should be done carefully to ensure that no other functionality is affected. Since _get_array_alias() is not used elsewhere in the provided code snippet, no additional changes are required.

doubleml/datasets/fetch_bonus.py

+    Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68.
+    doi:`10.1111/ectj.12097 <https://doi.org/10.1111/ectj.12097>`_.
+    """
+    _array_alias = _get_array_alias()


To fix the issue, we should remove the assignment to _array_alias on line 45 entirely, as it serves no purpose in the current implementation. This approach avoids unnecessary clutter in the code and adheres to best practices for maintaining clean and readable code. Since _array_alias is not used elsewhere in the function or the provided code snippet, its removal will not affect the functionality of the fetch_bonus function.

doubleml/did/datasets/dgp_did_SZ2020.py


 _array_alias = _get_array_alias()
 _data_frame_alias = _get_data_frame_alias()
-_dml_data_alias = _get_dml_data_alias()
+_dml_did_data_alias = _get_dml_did_data_alias()
+_dml_panel_data_alias = _get_dml_panel_data_alias()


To fix the issue, the unused variable _dml_panel_data_alias should be removed from the code. This involves deleting the assignment on line 12. Care must be taken to ensure that the right-hand side of the assignment (_get_dml_panel_data_alias()) does not have any side effects. If _get_dml_panel_data_alias() is a pure function (as its name suggests), it can be safely removed without affecting the rest of the code.

doubleml/did/datasets/dgp_did_SZ2020.py

@@ -60,7 +62,7 @@
    return res


-def make_did_SZ2020(n_obs=500, dgp_type=1, cross_sectional_data=False, return_type="DoubleMLData", **kwargs):
+def make_did_SZ2020(n_obs=500, dgp_type=1, cross_sectional_data=False, return_type="DoubleMLDIDData", **kwargs):


To fix the issue, an explicit return statement should be added at the end of the make_did_SZ2020 function. This ensures that the function always returns a value, even if the code execution reaches the end without hitting any explicit return statements. The explicit return value should be None, as this is the default implicit return value in Python. This change does not alter the existing functionality but makes the code more readable and consistent.

doubleml/rdd/tests/test_rdd_exceptions.py

@@ -17,7 +17,7 @@
    columns=["y", "d", "score"] + ["x" + str(i) for i in range(data["X"].shape[1])],
 )

-dml_data = DoubleMLData(df, y_col="y", d_cols="d", s_col="score")
+dml_data = DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score")


To fix the issue, we need to verify the correct parameter name for the DoubleMLRDDData class's __init__ method. If s_col is incorrect, it should be replaced with the correct parameter name. Based on the context, s_col likely refers to the score column, and the correct parameter name might be score_col or something similar.

Steps:

Identify the correct parameter name for the score column in the DoubleMLRDDData class.

Replace s_col="score" with the correct argument name and value in the instantiation of DoubleMLRDDData.

doubleml/rdd/tests/test_rdd_return_types.py

@@ -15,7 +15,7 @@
    np.column_stack((data["Y"], data["D"], data["score"], data["X"])),
    columns=["y", "d", "score"] + ["x" + str(i) for i in range(data["X"].shape[1])],
 )
-dml_data = dml.DoubleMLData(df, y_col="y", d_cols="d", s_col="score")
+dml_data = dml.DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score")


To fix the issue, we need to replace the incorrect argument name s_col with the correct parameter name expected by the DoubleMLRDDData class. Based on the context, it is likely that the correct parameter name is score_col, as it aligns with the naming convention used for other parameters like y_col and d_cols.

Steps to fix:

Identify the correct parameter name for the score column in the DoubleMLRDDData class. This is likely score_col.

Update line 18 to use the correct parameter name.

Ensure that the rest of the code remains consistent with this change.

SvenKlaassen and others added 30 commits June 2, 2025 14:20

add a cross-sectional dgp

ac858cd

add simple test cases for cross sectional dgp

10e532e

reset index for in panel data

c96605d

add basic did_cs_binary version with simple tests

61dbf11

add internal atribute _score_dim to DoubleML class

ceebc6e

check prediction size based on internal n_obs

ade3b9a

update score dimensions init in the cs object

f113e61

Refactor Data Generators #306

d65edf8

update tests acc. to Refactor Data Generators #306

56d832c

update docstrings acc. to Refactor Data Generators #306

02adb24

update docstrings acc. to Refactor Data Generators #306

39d4e7e

update irm submod tests acc. to Refactor Data Generators #306

83cfe9c

update irm submod tests acc. to Refactor Data Generators #306

3ff0edb

update irm submod tests acc. to Refactor Data Generators #306

caa530e

update docstrings acc. to Refactor Data Generators #306

4cb9148

update docstrings acc. to Refactor Data Generators #306

312f601

update docstrings acc. to Refactor Data Generators #306

0d07790

update documentations acc. to Refactor Data Generators #306

8b4f4bc

update tests acc. to Refactor Data Generators #306

5c44395

Merge pull request #331 from DoubleML/306-refactor-data-generators

6fa737c

306 refactor data generators

Merge pull request #332 from DoubleML/JanTeichertKluge/issue272

cada753

Jan teichert kluge/issue272

upd

a9f4284

upd

a2566cb

update lambda and p calculation in did_cs

9ef4e53

add _score_dim property to doubleml class

e90441b

upd 305

eb19efe

update data backends

97abdd8

add _n_obs_sample_splitting property to doubleml class

9f6f5d4

some progress on refactoring the data backends.

b96a839

update check_resampling input

eb951c4

SvenKlaassen and others added 25 commits June 12, 2025 16:39

update return type tests for did cs binary

e7a9f5c

adjust unit tests for ssm

bba5160

adjust unit tests for did

96ebd03

adjust unit tests general

a1686d5

adjust unit tests general

756092c

enhance did_multi plotting with anticipation periods and update color…

6bac76e

… palette handling

update data summary to include unique IDs count in DoubleMLPanelData

77b1a6b

add flexible summary with multiple formats

e52122f

fix format

bf7e16a

Merge pull request #336 from DoubleML/s-update-summary

62a6838

Update DoubleML __str__ method

fix unit tests

6beebd8

adjust workflow in parent class DoubleML

fb421f7

update refactoring acc. to unit test results

b11c0cb

add check for correct data backend

b9bdf7c

renaming after refactoring

4f70523

adjust dummy data (is_cluster_data flag)

19eab81

adjust unit tests

c3fbbb8

adjust t_col setter for DIDData Backend

144ee60

Merge branch '305-feature-request-integrate-clusters-into-the-doublem…

2e0fa4a

…ldata-class' into s-update-cross-sectional-did

Merge pull request #337 from DoubleML/s-update-cross-sectional-did

5056151

S update cross sectional did

fix RDDData (finally...)

70d67ad

adjsut RDD Class

a322e35

adjust DID classes

0a9b3c7

Adjust unit tests for DID

37f11dc

Adjust RDD unit tests

7be2d8f

JanTeichertKluge requested review from SvenKlaassen and Copilot June 17, 2025 14:38

JanTeichertKluge linked an issue Jun 17, 2025 that may be closed by this pull request

[Feature Request]: Integrate Clusters into the DoubleMLData Class #305

Open

Copilot AI reviewed Jun 17, 2025

View reviewed changes

github-advanced-security bot found potential problems Jun 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

305 feature request integrate clusters into the doublemldata class #338

305 feature request integrate clusters into the doublemldata class #338

Uh oh!

JanTeichertKluge commented Jun 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 17, 2025

Uh oh!

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Uh oh!

	raise NotImplementedError("polynomial_features os not implemented yet for fetch_401K.")
	raise NotImplementedError("polynomial_features is not implemented yet for fetch_401K.")

@@ -240 +240,2 @@
                         raise ValueError("Invalid return_type.")
+                return None

@@ -19,3 +19,3 @@
-            dml_data = DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score")
+            dml_data = DoubleMLRDDData(df, y_col="y", d_cols="d", score_col="score")

@@ -17,3 +17,3 @@
             )
-            dml_data = dml.DoubleMLRDDData(df, y_col="y", d_cols="d", s_col="score")
+            dml_data = dml.DoubleMLRDDData(df, y_col="y", d_cols="d", score_col="score")

305 feature request integrate clusters into the doublemldata class #338

Are you sure you want to change the base?

305 feature request integrate clusters into the doublemldata class #338

Uh oh!

Conversation

JanTeichertKluge commented Jun 17, 2025

Description

Reference to Issues or PRs

Comments

PR Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Uh oh!