pbiecek
diff --git a/‎01-Introduction.Rmd
Lines changed: 5 additions & 3 deletions b/‎01-Introduction.Rmd
Lines changed: 5 additions & 3 deletions
diff --git a/‎02-Model-Development-Process.Rmd
Lines changed: 5 additions & 3 deletions b/‎02-Model-Development-Process.Rmd
Lines changed: 5 additions & 3 deletions
diff --git a/‎04-Data-Sets.Rmd
Lines changed: 2 additions & 2 deletions b/‎04-Data-Sets.Rmd
Lines changed: 2 additions & 2 deletions
diff --git a/‎06-Break-Down.Rmd
Lines changed: 6 additions & 6 deletions b/‎06-Break-Down.Rmd
Lines changed: 6 additions & 6 deletions
@@ -6,7 +6,8 @@ A note to readers: this text is a work in progress.
 
 We've released this initial version to get more feedback. Feedback can be given at the GitHub repo https://github.com/pbiecek/ema/issues. We are primarily interested in the organization and consistency of the content, but any comments will be welcomed.
 
-We'd like to thank everyone that contributed feedback, found typos, or ignited discussions while the book was being written, including GitHub contributors: [agosiewska](https://github.com/agosiewska/), Rees Morrison, [kasiapekala](https://github.com/kasiapekala/), [hbaniecki](https://github.com/hbaniecki/), [AsiaHenzel](https://github.com/AsiaHenzel/), [kozaka93](https://github.com/kozaka93/).
+We'd like to thank everyone that contributed feedback, found typos, or ignited discussions while the book was being written, including GitHub contributors: [agosiewska](https://github.com/agosiewska/), Rees Morrison, [kasiapekala](https://github.com/kasiapekala/), [hbaniecki](https://github.com/hbaniecki/), [AsiaHenzel](https://github.com/AsiaHenzel/), [kozaka93](https://github.com/kozaka93/),
+[agilebean](https://github.com/agilebean/).
 
 
 ## The aim of the book
@@ -66,8 +67,9 @@ Before embarking on the description of the methods, in Chapter
 \@ref(modelDevelopmentProcess), we provide a short introduction to the process of data exploration and model assembly along with notation and definition of key concepts that are used in consecutive chapters. 
 In chapters \@ref(doItYourselfWithR) and \@ref(doItYourselfWithPython), we provide a short description of R and Python tools and packages that are necessary to replicate the results presented in this book. In Chapter \@ref(dataSetsIntro), we describe two datasets that are used throughout the book to illustrate the presented methods and tools. 
 
+(ref:UMEPpiramideCaption) Stack with model exploration methods presented in this book. Left side is focused on instance-level explanation while the right side is focused on dataset-level explanation. Consecutive layers of the stack are linked with a deeper level of model exploration. These layers are linked with law's of model exploration introduced in Section \@ref(three-single-laws)
 
-```{r UMEPpiramide, echo=FALSE, fig.cap="Stack with model exploration methods presented in this book. Left side is focused on instance-level explanation while the right side is focused on dataset-level explanation. Consecutive layers of the stack are linked with a deeper level of model exploration. These layers are linked with law's of model exploration introduced in Section \@ref(three-single-laws)", out.width = '85%', fig.align='center'}
+```{r UMEPpiramide, echo=FALSE, fig.cap='(ref:UMEPpiramideCaption)', out.width = '85%', fig.align='center'}
 knitr::include_graphics("figure/UMEPpiramide.png")
 ```
 
@@ -177,7 +179,7 @@ On the other hand, **in this book, we do not focus on**
 ## Acknowledgements {#thanksto}
 
 This book has been prepared using the `bookdown` package [@R-bookdown], created thanks to the amazing work of Yihui Xie. 
-Figures and tables are created in R language for statistical computing [@RcoreT] with numerous libraries that support predictive modeling. Just to name few frequently used in this book `randomForest` [@randomForestRNews], `ranger` [@rangerRpackage], `rms` [@rms], `gbm` [@gbm] or `caret` [@caret].  For statistical graphics we used the `ggplot2` library [@ggplot2] and for model governance we used `archivist` [@archivist].
+Figures and tables are created in R language for statistical computing [@RcoreT] with numerous libraries that support predictive modeling. Just to name few frequently used in this book `randomForest` [@randomForest], `ranger` [@rangerRpackage], `rms` [@rms], `gbm` [@gbm] or `caret` [@caret].  For statistical graphics we used the `ggplot2` library [@ggplot2] and for model governance we used `archivist` [@archivist].
 
 Przemek's work on interpretability started during research trips within the RENOIR (H2020 grant no. 691152) secondments to Nanyang Technological University (Singapour) and Davis University of California (USA). So he would like to thank Prof. Janusz Holyst for the chance to take part in this project. Przemek would also like to thank Prof. Chris Drake for her hospitality. This book would have never been created without perfect conditions that Przemek found at Chris's house in Woodland.
 
@@ -28,7 +28,9 @@ In this book we use *Model Development Process* introduced in [@mdp2019]. It is
 This is why MDP is build as an untangled version of Figure \@ref(fig:MDPwashmachine). The MDP process is shown in Figure \@ref(fig:mdpGeneral). Each vertical stripe is a single run of the cycle. 
 First iterations are usually focused on *formulation of the problem*. Sometimes the problem is well stated, however it's a rare situation valid maybe only for kaggle competitions. In most real-life problems the problem formulation requires lots of discussions and experiments. Once the problem is defined we can start building first prototypes, first *crisp versions of models*. These initial versions of models are needed to verify if the problem can be solved and how far we are form the solution. Usually we gather more information and go for the next phase, the *fine tuning*. We repeat these iterations until a final version of a model is developed. Then we move to the last phase *maintenance and* (one day) *decommissioning*.  
 
-```{r mdpGeneral, echo=FALSE, fig.cap="Overview of the Model Development Process. Horizontal axis show how time passes from the problem formulation to the model decommissioning. Vertical axis shows tasks are performed in a given phase. Each vertical strip is a next iteration of cycle presented in Figure \@ref(fig:MDPwashmachine)", out.width = '99%', fig.align='center'}
+(ref:mdpGeneralCaption) Overview of the Model Development Process. Horizontal axis show how time passes from the problem formulation to the model decommissioning. Vertical axis shows tasks are performed in a given phase. Each vertical strip is a next iteration of cycle presented in Figure \@ref(fig:MDPwashmachine)
+
+```{r mdpGeneral, echo=FALSE, fig.cap='(ref:mdpGeneralCaption)', out.width = '99%', fig.align='center'}
 knitr::include_graphics("figure/mdp_general.png")
 ```
 
@@ -98,7 +100,7 @@ In predictive modeling, we are interested in a distribution of a dependent varia
 
 Assume that we have got model $f()$, for which $f(x_*)$ is an approximation of $E_Y(Y | x_*)$, i.e., $E_Y(Y | x_*) \approx f(x_*)$. Note that we do not assume that it is a "good" model, nor that the approximation is precise. We simply assume that we have a model that is used to estimate the conditional expected value and to form predictions of the values of the dependent variable. Our interest lies in the evaluation of the quality of the predictions. If the model offers a "good" approximation of the conditional expected value, it should be reflected in its satisfactory predictive performance. 
 
-Usually the available data is split into two parts. One will be used for model training (estimation of model parameters), second will be used for model validation. The splitting may be repeated as in k-fold cross validation or repeated k-fold cross validation (see for example [@AppliedPredictiveModeling2013]). We leave the topic of model validation for Chapter \@ref(modelPerformance).
+Usually the available data is split into two parts. One will be used for model training (estimation of model parameters), second will be used for model validation. The splitting may be repeated as in k-fold cross validation or repeated k-fold cross validation (see for example [@Kuhn2013]). We leave the topic of model validation for Chapter \@ref(modelPerformance).
 
 Training procedures are different for different models, but most of them can be written as an optimization problem. Let $\Theta$ be a space for possible model parameters. Model training is a procedure of selection a $\theta \in \Theta$ that maximize some loss function $L(y, f_\theta(X))$. For models with large parameter spaces it is common to add additional term $\lambda(\theta)$ that control the model complexity.
 
@@ -124,7 +126,7 @@ For linear regression, the penalty term $\lambda(\beta)$ is equal to $0$, and op
 
 For classification, the natural choice for distribution of $y$ is a Binomial distribution. This leads to logistic regression and logistic loss function. For multi label classification frequent choice is the cross-entropy loss function. 
 
-Apart from linear models for $y$ there is a large variety of predictive models. Find a good overview of different techniques for model development in [@Venables2010] or [@AppliedPredictiveModeling2013].
+Apart from linear models for $y$ there is a large variety of predictive models. Find a good overview of different techniques for model development in [@MASSbook] or [@Kuhn2013].
 
 
 
 
@@ -165,7 +165,7 @@ Note that our prime interest is not in the assessment of model performance, but
 
 ### Random forest model {#model-titanic-rf}
 
-As a challenger to the logistic regression model, we consider a random forest model. Random forest is known for good predictive performance, is able to grasp low-order variable interactions, and is quite stable [@randomForestBreiman]. To fit the model, we apply the `randomForest()` function, with default settings, from the package with the same name [@randomForestRNews].  
+As a challenger to the logistic regression model, we consider a random forest model. Random forest is known for good predictive performance, is able to grasp low-order variable interactions, and is quite stable [@randomForestBreiman]. To fit the model, we apply the `randomForest()` function, with default settings, from the package with the same name [@randomForest].  
 
 In the first instance, we fit a model with the same set of explanatory variables as the logistic regression model. The results of the model are stored in model-object `titanic_rf_v6`.
 
@@ -428,7 +428,7 @@ anova(apartments_lm_v5)
 
 ### Random forest model {#model-Apartments-rf}
 
-As a challenger to linear regression, we consider a random forest model. To fit the model, we apply the `randomForest()` function, with default settings, from the package with the same name [@randomForestRNews].  
+As a challenger to linear regression, we consider a random forest model. To fit the model, we apply the `randomForest()` function, with default settings, from the package with the same name [@randomForest].  
 The results of the model are stored in model-object `apartments_rf_v5`. 
 
 ```{r, warning=FALSE, message=FALSE, eval = FALSE}
 
@@ -14,7 +14,7 @@ The underlying idea is to calculate contribution of an explanatory variable $x^i
 
 This idea is illustrated in Figure \@ref(fig:BDPrice4). Consider an example related to the prediction for the random-forest model `model_rf_v6` for Titanic data (see Section \@ref(model-titanic-rf)). We are interested in chances of survival for `johny_d` - an 8-years old passenger from first class. Panel A shows distribution of model predictions for all 2207 instances from dataset $X$. The row `all data` shows the vioplot of the predictions for the entire dataset. The red dot indicates the average and it is an estimate of the expected model prediction $E_X[f(X)]$ over the distribution of all explanatory variables. In this example the average model response is 23.5%.
 
-To evaluate the contribution of the explanatory variables to the particular instance prediction, we trace changes in model predictions when fixing the values of consecutive variables. For instance, the row `class=1st` in Panel A of Figure \@ref(fig:BDPrice4) presents the distribution of the predictions obtained when the value of the `class` variable has been fixed to the `1st` class. Again, the red dot indicates the average of the predictions. The next row `age=8` shows the distribution and the average predictions with the value of variable `class` set to `1st` and `age` set to `8`, and so on. With this procedure after $p$ steps every row in $X$ will be filled up with variable values of `johny_d`. All predictions for these rows will be equal, so the last row in the Figure corresponds to the prediction for `model response for `johny_d`.
+To evaluate the contribution of the explanatory variables to the particular instance prediction, we trace changes in model predictions when fixing the values of consecutive variables. For instance, the row `class=1st` in Panel A of Figure \@ref(fig:BDPrice4) presents the distribution of the predictions obtained when the value of the `class` variable has been fixed to the `1st` class. Again, the red dot indicates the average of the predictions. The next row `age=8` shows the distribution and the average predictions with the value of variable `class` set to `1st` and `age` set to `8`, and so on. With this procedure after $p$ steps every row in $X$ will be filled up with variable values of `johny_d`. All predictions for these rows will be equal, so the last row in the Figure corresponds to the prediction for `model response` for `johny_d`.
 
 The thin black lines in Panel A show how the individual prediction for a single person changes after the value of the $j$-th variable has been replaced by the value indicated in the name of the row. 
 
@@ -29,7 +29,7 @@ The model prediction for Johny D is 42.2 percent. It is much higher than an aver
 Note that value of variable attribution depends on the value not only a variable itself. In this example the `embarked = Southampton` has small effect on average model prediction. It may be because the variable `embarked` is not important or it is possible that variable `embarked` is important but `Southampton` has an average effect out of all other possible values of the `embarked` variable.
 
 
-```{r BDPrice4, echo=FALSE, fig.cap="Break-down plots show how the contribution of individual explanatory variables change the average model prediction to the prediction for a single instance (observation). Panel A) The first row shows the distribution and the average (red dot) of model predictions for all data. The next rows show the distribution and the average of the predictions when fixing values of subsequent explanatory variables. The last row shows the prediction for a particular instance of interest. B) Red dots indicate the average predictions from Panel B. C) The green and red bars indicate, respectively, positive and negative changes in the average predictions (variable contributions). ", out.width = '70%', fig.align='center'}
+```{r BDPrice4, echo=FALSE, fig.cap="Break-down plots show how the contribution of individual explanatory variables change the average model prediction to the prediction for a single instance (observation). Panel A) The first row shows the distribution and the average (red dot) of model predictions for all data. The next rows show the distribution and the average of the predictions when fixing values of subsequent explanatory variables. The last row shows the prediction for a particular instance of interest. B) Red dots indicate the average predictions from Panel A. C) The green and red bars indicate, respectively, positive and negative changes in the average predictions (variable contributions). ", out.width = '70%', fig.align='center'}
 knitr::include_graphics("figure/break_down_distr.png")
 ```
 
@@ -66,10 +66,10 @@ where $x^{-i}_*$ indicates that variable $X^i$ in vector $x_*$ is treated as ran
 v(i, x_*) = f(x_*) - E_{X^i}[f(x^{-i}_*)] = \beta^0 + x^1_* \beta^1 + \ldots + x^p_* \beta^p - E_{X^i}[\beta^0 + x^1_* \beta^1 + \ldots +\beta^i X^i \ldots + x^p_* \beta^p] = \beta^i[x_*^i - E_{X^i}(X^i)].
 \end{equation}
 
-In practice, given a dataset, the expected value of $X_i$ can be estimated by the sample mean $\bar x_i$. This leads to  
+In practice, given a dataset, the expected value of $X^i$ can be estimated by the sample mean $\bar x^i$. This leads to  
 
 \begin{equation}
-v(i, x_*) = \beta_i (x_*^i - \bar x^i).
+v(i, x_*) = \beta^i (x_*^i - \bar x^i).
 \end{equation}
 
 Note that the linear-model-based prediction may be re-expressed in the following way:
@@ -81,7 +81,7 @@ $$
 (\#eq:singleBreakDownResult)
 \end{equation}
 
-Thus, the contributions of the explanatory variables $b(i, x_*)$ sum up to the  difference between the model prediction for $x_*$ and the average model prediction.
+Thus, the contributions of the explanatory variables $v(i, x_*)$ sum up to the  difference between the model prediction for $x_*$ and the average model prediction.
 
 **NOTE for careful readers**
 
@@ -159,7 +159,7 @@ However, he is very young, therefore odds are higher than adult men. Explanation
 
 Note that the effect of *the second class* is negative in explanations for scenario 1 but positive in explanations for scenario 2.
 
-```{r ordering, echo=FALSE, fig.cap="An illustration of the order-dependence of the variable-contribution values. Two *Break-down* explanations for the same observation from Titanic data set. The underlying model is a random forest. Scenarios differ due to the order of variables in *Break-down* algorithm. Blue bar indicates the difference between the model's prediction for a particular observation and an average model prediction. Other bars show contributions of variables. Red color means a negative effect on the survival probability, while green color means a positive effect. Order of variables on the y-axis corresponds to their sequence used in *Break-down* algorithm.", out.width = '50%', fig.align='center'}
+```{r ordering, echo=FALSE, fig.cap="An illustration of the order-dependence of the variable-contribution values. Two *Break-down* explanations for the same observation from Titanic data set. The underlying model is a random forest. Scenarios differ due to the order of variables in *Break-down* algorithm. Last bar indicates the difference between the model's prediction for a particular observation and an average model prediction. Other bars show contributions of variables. Red color means a negative effect on the survival probability, while green color means a positive effect. Order of variables on the y-axis corresponds to their sequence used in *Break-down* algorithm.", out.width = '50%', fig.align='center'}
 knitr::include_graphics("figure/ordering.png")
 ```