-
Notifications
You must be signed in to change notification settings - Fork 765
Open
Description
Here is a code to see the problem:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from econml.dml import CausalForestDML
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
# Step 1: Create synthetic dataset
np.random.seed(0)
n = 300
X = pd.DataFrame({
'age': np.random.normal(40, 10, size=n),
'income': np.random.normal(50000, 15000, size=n),
'education_years': np.random.normal(16, 2, size=n)
})
# Introduce missing values in 'income'
missing_idx = np.random.choice(n, size=30, replace=False)
X.loc[missing_idx, 'income'] = np.nan
# Binary treatment and outcome
T = np.random.binomial(1, 0.5, size=n)
Y = 1000 + 500*T + 20*X['age'] + np.random.normal(0, 1000, size=n)
# Step 2: Featurizer to handle missing data
featurizer = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
# Step 3: CausalForestDML model
est = CausalForestDML(
model_t=RandomForestClassifier(),
model_y=RandomForestRegressor(),
featurizer=featurizer,
discrete_treatment=True,
random_state=0
)
# Step 4: Fit the model
est.fit(Y, T, X=X)
Unfortunately this throws an error:
ValueError: Input contains NaN.
The issue appears to be that during fit()
, CausalForestDML runs the following:
X, = check_input_arrays(X, force_all_finite='allow-nan' if 'X' in self._gen_allowed_missing_vars() else True)
Which isn't really right because it should rather check the featurized X which is what is ultimately used for fitting anyways, not X itself.
Metadata
Metadata
Assignees
Labels
No labels