switch to using pingouin

iamdonovan · iamdonovan · commit 7b7ce37155d2 · 2023-10-10T00:05:11.000+01:00
diff --git a/06.regression/regression.ipynb b/06.regression/regression.ipynb
@@ -20,7 +20,7 @@
     "- [pandas](https://pandas.pydata.org/), for reading the data from a file;\n",
     "- [seaborn](https://seaborn.pydata.org/), for plotting the data;\n",
     "- [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html), for calculating correlation coefficients;\n",
-    "- [statsmodels.api](https://www.statsmodels.org/dev/index.html), for linear regression;\n",
+    "- [pingouin](https://pingouin-stats.org/), for linear regression;\n",
     "- [pathlib](https://docs.python.org/3/library/pathlib.html), for working with filesystem paths."
    ]
   },
@@ -34,7 +34,7 @@
     "import pandas as pd\n",
     "import seaborn as sns\n",
     "from scipy import stats\n",
-    "import statsmodels.api as sm\n",
+    "import pingouin as pg\n",
     "from pathlib import Path"
    ]
   },
@@ -73,6 +73,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "rain_tmax_plot = sns.lmplot(data=station_data, x='rain', y='tmax', hue='season', markers=['o', 'x', 's', '+'])\n",
     "# your code goes here!\n",
     "rain_tmax_plot # show the plot"
    ]
@@ -204,6 +205,26 @@
     "print(f\"calculated p-value of r: {corr.pvalue}\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "6c4cb657-182d-4cbc-a022-45a595332a8f",
+   "metadata": {},
+   "source": [
+    "And, using `pg.corr()` ([documentation](https://pingouin-stats.org/build/html/generated/pingouin.corr.html)) gives us even more information, such as the confidence interval for the correlation value, as well as additional options for calculating the correlation coefficient:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fa3e2280-bc03-4ad0-8ef7-7118775db44f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# calculate the biweight midcorrelation between rain and tmax\n",
+    "pg.corr(station_data.dropna(subset=['rain', 'tmax'])['rain'], \n",
+    "        station_data.dropna(subset=['rain', 'tmax'])['tmax'], method='bicor')"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "318a6171-4a05-4136-93d0-196a5eff85e8",
@@ -233,9 +254,9 @@
     "\n",
     "$$ y = \\beta + \\alpha x, $$\n",
     "\n",
-    "where $\\beta$ is the intercept and $\\alpha$ is the slope of the line. To fit a linear model using ordinary least squares, we can first use `sm.OLS()`  ([documentation](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html)) to create an **OLS** object, then use the `.fit()` method ([documentation](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.fit.html)) of that object.\n",
+    "where $\\beta$ is the intercept and $\\alpha$ is the slope of the line. To fit a linear model using `pingouin`, we use `pg.linear_regression()` ([documentation](https://pingouin-stats.org/build/html/generated/pingouin.linear_regression.html)). \n",
     "\n",
-    "When we create the **OLS** object, we pass the observations of the *response* (*dependent*) variable with the first argument, and the observations of the *explanatory* (*independent*) variable(s) in the second argument. Note that by default, **OLS** will not fit a constant, but we can use `sm.add_constant()` ([documentation](https://www.statsmodels.org/dev/generated/statsmodels.tools.tools.add_constant.html)) to add a column of ones to the array.\n",
+    "The main inputs to `pg.linear_regression()` are `X`, the observations of the *explanatory* (*independent*) variable(s), and `y`, the observations of the *response* (*dependent*) variables. We can also specify the significance level (`alpha`) to use when calculating the statistics of the fitted model, as well as additional arguments. By default, `pg.linear_regression()` adds an intercept to be fitted. \n",
     "\n",
     "So, the process to fit a linear relationship between `tmax` and `rain` would look like this:"
    ]
@@ -247,102 +268,35 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "xdata = station_data.dropna(subset=['rain', 'tmax'])['rain'] # select the rain variable, after dropping NaN values\n",
-    "ydata = station_data.dropna(subset=['rain', 'tmax'])['tmax'] # select the tmax variable, after dropping NaN values\n",
+    "xdata = spring.dropna(subset=['rain', 'tmax'])['rain'] # select the rain variable, after dropping NaN values\n",
+    "ydata = spring.dropna(subset=['rain', 'tmax'])['tmax'] # select the tmax variable, after dropping NaN values\n",
     "\n",
-    "xdata = sm.add_constant(xdata) # add a constant to xdata - otherwise, we're only fitting the slope\n",
+    "lin_model = pg.linear_regression(xdata, ydata, alpha=0.01) # run the regression at the 99% significance level\n",
     "\n",
-    "lin_model = sm.OLS(ydata, xdata) # initialize the OLS object\n",
-    "lm_results = lin_model.fit() # fit the model to the data"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "bda01e81-8b28-45d9-9324-0f6f45b0b395",
-   "metadata": {},
-   "source": [
-    "The `params` attribute has the estimated values for the intercept (`const`) and slope (`rain`):"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9b18194d-37c8-4672-abd4-80661309c135",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "lm_results.params # see the regression parameters: const is the intercept, rain is the coefficient for 'rain'"
+    "lin_model.round(3) # round the output table to 3 decimal places"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "0106f89d-bf97-409d-90c3-22f21d70235e",
+   "id": "ed86c6dc-ace9-4c3a-81c3-fe8457b03e2a",
    "metadata": {},
    "source": [
-    "Other useful attributes include:\n",
+    "The output of `pg.linear_regression()` is a **DataFrame** with the following columns:\n",
     "\n",
-    "- `bse`, the estimates of the standard error for the parameters;\n",
-    "- `pvalues`, the two-tailed *p*-values for the *t*-statistics of the parameter estimates;\n",
-    "- `resid`, the model residuals;\n",
-    "- `rsquared` and `rsquared_adj`, the R-squared and adjusted R-squared values for the model.\n",
+    "- `names`: the names of the outputs (`intercept`) and the slope for each explanatory variable;\n",
+    "- `coef`: the values of the regression coefficients;\n",
+    "- `se`: the standard error of the estimated coefficients;\n",
+    "- `T`: the *t*-statistic of the estimates;\n",
+    "- `pval`: the *p*-values of the *t*-statistics;\n",
+    "- `r2`: the coefficient of determination;\n",
+    "- `adj_r2`: the adjusted coefficient of determination;\n",
+    "- `CI{alpha/2}%`: the lower value of the confidence interval;\n",
+    "- `CI{1-alpha/2}%`: the upper value of the confidence interval;\n",
+    "- `relimp`: the relative contribution of each predictor to the final (if `relimp=True`);\n",
+    "- `relimp_perc`: the percent relative contribution\n",
     "\n",
-    "Note that each of these attributes are **pandas.Series**:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "301b2c1e-350c-4201-973e-eb26f64260c2",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "type(lm_results.bse) # show the type of lm_results.bse"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5a6e076d-a45d-4887-bbd8-6636035c992d",
-   "metadata": {},
-   "source": [
-    "This means that we can easily combine these into a **DataFrame** using `pd.concat()`:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bfc83cfc-3f20-4e86-9ed2-b8c0621c2430",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "res_df = pd.concat([lm_results.params, lm_results.bse, lm_results.tvalues, lm_results.pvalues, lm_results.conf_int()], axis=1) # join params, bse, tvalues, pvalues, confidence intervals along the column axis\n",
-    "res_df.columns = ['coef', 'std err', 't-value', 'p-value', 'ci_low', 'ci_up'] # set the column names\n",
+    "The ouptut **DataFrame** also has hidden attributes such as the residuals (`lin_model.residuals_`), the degrees of freedom of the model (`lin_model.df_model_`), and the degrees of freedom of the residuals (`lin_model.df_resid_`).\n",
     "\n",
-    "res_df # show the dataframe"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "debcd1ce-377f-4690-a237-4d392f6cbbb9",
-   "metadata": {},
-   "source": [
-    "To get the full summary of the regression results, use `.summary()` ([documentation](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.summary.html)):"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "16b5879d-8162-404e-a03a-344914cde420",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "lm_results.summary() # show the summary of the results"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ed86c6dc-ace9-4c3a-81c3-fe8457b03e2a",
-   "metadata": {},
-   "source": [
     "## multiple linear regression\n",
     "\n",
     "Now, let's try to fit a linear model of `tmax` with two variables: `rain` and `sun`. Remember that multiple linear regression tries to fit a model with the form:\n",
@@ -353,7 +307,7 @@
     "\n",
     "$$ y = \\beta + \\alpha_1 x_1 + \\alpha_2 x_2 $$\n",
     "\n",
-    "The code to fit this model using `statsmodels` looks like this:"
+    "The code to fit this model using `pingouin` looks like this:"
    ]
   },
   {
@@ -366,30 +320,9 @@
     "xdata = station_data.dropna(subset=['rain', 'tmax', 'sun'])[['rain', 'sun']] # select the rain and sun variables, after dropping NaN values\n",
     "ydata = station_data.dropna(subset=['rain', 'tmax', 'sun'])['tmax'] # select the tmax variable, after dropping NaN values\n",
     "\n",
-    "xdata = sm.add_constant(xdata) # add a constant to xdata - otherwise, we're only fitting the slope\n",
+    "ml_model = pg.linear_regression(xdata, ydata, alpha=0.01) # run the regression at the 99% significance level\n",
     "\n",
-    "ml_model = sm.OLS(ydata, xdata) # initialize the OLS object\n",
-    "mlm_results = ml_model.fit() # fit the model to the data\n",
-    "\n",
-    "mlm_results.params # see the regression parameters: const is the intercept, rain is the coefficient for 'rain'"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e43f2037-dd88-4f8d-baa6-f65cfd08e6ef",
-   "metadata": {},
-   "source": [
-    "Just as with the simple linear regression case, we can look at the summary of the regression results:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7385af31-d0b5-42ab-b80d-0b1175f71a7b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "mlm_results.summary() # show the summary of the results"
+    "ml_model.round(3) # round the output table to 3 decimal places"
    ]
   },
   {
@@ -419,10 +352,7 @@
     "    xdata = season_data['rain'] # select the rain variable\n",
     "    ydata = season_data['tmax'] # select the tmax variable\n",
     "    \n",
-    "    xdata = sm.add_constant(xdata) # add a constant to xdata - otherwise, we're only fitting the slope\n",
-    "    \n",
-    "    model = sm.OLS(ydata, xdata) # initialize the OLS object\n",
-    "    results[season] = model.fit() # add the result to the results dict, with season as the key"
+    "    results[season] = pg.linear_regression(xdata, ydata, alpha=0.01) # add the result to the results dict, with season as the key"
    ]
   },
   {
@@ -440,102 +370,72 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "results['spring'].summary() # view the summary for spring"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "cc0a971a-fc7d-458c-b2ea-9053ab907980",
-   "metadata": {},
-   "source": [
-    "Next, let's see how we can combine these results into a single **DataFrame**. First, we'll write a **function** to create the **DataFrame** for a single model result - as we have discussed, it is often preferable to write functions for repeated lines of code, as it can make the code more readable, it helps avoid mistakes, and also because programmers are often lazy.\n",
-    "\n",
-    "Run the next cell to define the function - the only new bits of code here are the use of `.reset_index()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html)), which will turn the current index parameter names into a column, `index`, and the use of `.rename()` to rename this column from `index` to `parameter`. The reason for doing this will be clear in a moment."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c4311d9a-afdf-4d14-9baf-fdefb3441e65",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def get_results_df(res):\n",
-    "    res_df = pd.concat([res.params, res.bse, res.tvalues, res.pvalues, res.conf_int()], axis=1) # join params, bse, tvalues, pvalues, confidence intervals along the column axis\n",
-    "    res_df.columns = ['coef', 'std err', 't-value', 'p-value', 'ci_low', 'ci_up'] # set the column names\n",
-    "    res_df.reset_index(inplace=True) # unset the index in-place\n",
-    "    \n",
-    "    return res_df.rename(columns={'index': 'parameter'}) # return the dataframe with 'index' renamed to 'parameter'"
+    "results['spring'] # view the results for spring"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "95d0940c-1166-462c-add0-c207e43591ae",
+   "id": "84d78003-02f3-4a3f-8f96-967ba800647b",
    "metadata": {},
    "source": [
-    "Now, we can loop over the season names to get the parameter table for each season, then use `pd.concat()` to combine these results into a single **DataFrame**:"
+    "Now, let's see how we can combine these results into a single **DataFrame**. First, we'll add a column, `season`, to each **DataFrame**:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "7545bfd8-7804-4ea6-a472-ce4c5d810017",
+   "id": "2cecf29d-452f-4ced-ac00-f5f88677826d",
    "metadata": {},
    "outputs": [],
    "source": [
-    "all_results = []\n",
-    "\n",
     "for season in seasons:\n",
-    "    this_df = get_results_df(results[season]) # get a dataframe for this season\n",
-    "    this_df['season'] = season\n",
-    "    all_results.append(this_df)\n",
-    "\n",
-    "all_results = pd.concat(all_results) # combine the list of dataframes into a single dataframe"
+    "    results[season]['season'] = season # add a season column"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "13cecabb-0683-48aa-9efd-4843720c78c3",
+   "id": "c0651758-b6e6-45fb-aad4-7500a86422d8",
    "metadata": {},
    "source": [
-    "Finally, we'll use `.set_index()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html)) to set the index of the **DataFrame** using the `season` and `parameter` columns:"
+    "Next, we use `pd.concatenate()`, along with the `values()` of the results **dict**, to combine the tables into a single table:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "d45b2c26-98a2-4b4f-aa8b-cf6efd8ef29a",
+   "id": "07c39ce4-8fa8-46fd-a45f-66671a4d5510",
    "metadata": {},
    "outputs": [],
    "source": [
-    "all_results.set_index(['season', 'parameter'], inplace=True) # set a multi-level index with season and parameter values\n",
-    "all_results # show the dataframe"
+    "all_results = pd.concat(results.values()) # concatenate the results dataframes into a single dataframe"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "73f66b7d-7960-46cc-962c-cb8880c15d08",
+   "id": "f01eb88b-1d5e-456d-86fa-f793f5bfa5f5",
    "metadata": {},
    "source": [
-    "Now, in the final **DataFrame**, we can use the season name with `.loc` to get the parameter results:"
+    "Next, we'll use `.set_index()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html)) to set the `season` and `names` columns to be the `index` of the **DataFrame**:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "a36faa97-80a7-4a9e-9b53-2ba559aa8f8d",
+   "id": "ca84bbe4-24e3-4a23-99fb-cb2c618a8bd1",
    "metadata": {},
    "outputs": [],
    "source": [
-    "all_results.loc['spring'] # show the rows of the dataframe corresponding to spring"
+    "all_results.set_index(['season', 'names'], inplace=True) # set the index to be a multi-index with season and names\n",
+    "\n",
+    "all_results # show the updated table"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "eeaffe66-ab7c-4f22-8943-6c179ca914ce",
    "metadata": {},
    "source": [
-    "and, save the table of regression parameter results to a file, using `pd.to_csv()`:"
+    "Finally, we'll save the table of regression parameter results to a file, using `pd.to_csv()`:"
    ]
   },
   {