Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions source/_quarto.yml.template
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ book:
- correlation_causation.Rmd
- how_big_sample.Rmd
- bayes_simulation.Rmd
- practical_considerations.Rmd
- exercise_solutions.Rmd
- acknowlegements.Rmd
- technical_note.Rmd
Expand Down
2 changes: 1 addition & 1 deletion source/bayes_simulation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -1201,7 +1201,7 @@ Creating the correct simulation procedure is not trivial, because Bayesian
reasoning is subtle. But it certainly is not easier to create a
correct procedure using analytic tools (except in the cookbook sense of
plug-and-pray). If one is interested in insight, a combination of theory and simulation procedure
might well be the answer[^sequentially]
might well be the answer[^sequentially].

[^sequentially]: We can use a similar procedure to illustrate an aspect of the
Bayesian procedure that Box and Tiao emphasize, its
Expand Down
58 changes: 58 additions & 0 deletions source/practical_considerations.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
jupyter:
jupytext:
metadata_filter:
notebook:
additional: all
excluded:
- language_info
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.0'
jupytext_version: 0.8.6
kernelspec:
display_name: Python 3
language: python
name: python3
resampling_with:
ed2_fname: 06-Chap-2
---

```{r setup, include=FALSE}
source("_common.R")
```

# Practical considerations {#sec-practical-considerations}

> Uncertainty, in the presence of vivid hopes and fears, is painful, but must
> be endured if we wish to live without the support of comforting fairy
> tales." — Bertrand Russell [-@russell1945history p. *xiv*].

<!---
Not ready for review. This will be expanded into a full chapter
-->

## Follow good software practices

This might appear as an unusual topic for a book on statistics. However, it does emphasize resampling methods. And that
is all about coding. Anyway, you will become so much more marketable if you learn the basics of solid software practices.
We don't have the space to do it here but we do want to stress its importance. Always keep in mind the following:

1. Use versioning control; we recommend using git. If you regularly push to the git repo this will protect you from
accidental software loss. It makes is also eay to share your code. You want other people to use your code, it make you
so much more useful!
2. Ask someone else to critically review your code. Even better if you work in an environment where there is a formal
system of code review.
3. Read other people's code. You will learn a lot.
4. Always add tests for your code. This means that you run your code on small examples for which you know the answer. Every
time you make changes to your code you can check that no unwanted side effects occurred.
5. Even better if you start you coding exercise with a small example for which you know the answer. This is known as
test-driven and we do this all the time.
6. Python has several formal testing environments that help you write tests. We recommend `pytest`.
7. Always be critical of yourself. You will run into bugs and make errors. This is inevitable and you should learn how
to recognise non-obvious errors. Are the results what you expect, are they reasonable? This of everything you
possibly can to find fault with the output of your system.
8. Always be critical of yourself. Know that you can and will make mistakes. That is no shame; being sloppy and not following
good practices, is.

46 changes: 37 additions & 9 deletions source/simon_refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -839,15 +839,15 @@ @article{fisher1930inverse
}

@article{fisher1922mathematical,
title={On the mathematical foundations of theoretical statistics},
author={Fisher, Ronald Aylmer},
journal={Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character},
volume={222},
number={594-604},
pages={309--368},
year={1922},
publisher={The Royal Society},
address={London}
title={On the mathematical foundations of theoretical statistics},
author={Fisher, Ronald Aylmer},
journal={Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character},
volume={222},
number={594-604},
pages={309--368},
year={1922},
publisher={The Royal Society},
address={London}
}

@book{fisher1925smrw,
Expand Down Expand Up @@ -1601,3 +1601,31 @@ @article{gilovich1985hot
publisher={Elsevier},
url={https://www.joelvelasco.net/teaching/122/Gilo.Vallone.Tversky.pdf}
}

@article{hashemi2020retracted,
title={RETRACTED ARTICLE: Criminal tendency detection from facial images and the gender bias effect.},
author={Hashemi, M., Hall, M.},
journal={J Big Data},
volume={7},
number={2},
pages={},
year={2020},
publisher={https://doi.org/10.1186/s40537-019-0282-4}
}

@article{bower2006face,
title={The “Criminality from Face” Illusion},
author={Bowyer, Kevin W., King, Michael C., Scheirer, Walter, and Vangara,Kushal},
journal={arXiv},
volume={},
number={},
pages={},
year={2006},
publisher={
https://doi.org/10.48550/arXiv.2006.03895}
}

The “Criminality from Face” Illusion
Kevin W. Bowyer, Fellow, IEEE, Michael C. King, Member, IEEE, and Walter Scheirer, Senior
Member, IEEE, Kushal Vangara .

116 changes: 113 additions & 3 deletions source/what_is_probability.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@ Not ready for review
## Introduction, what is probability?

Uncertainty is part of life. Although there is not much we can do to answer questions such as,
will I loose my job within the next year?, or what is my life expectency?, it is possible to
will I loose my job within the next year?, or what is my life expectancy?, it is possible to
answer questions such as, what is my chances to win the lottery? (exceedingly small).

Scientists from any discipline that we can think of, also need to deal with uncertainties,
to come up with specific answers when possible, or know the limits of what it is possible to know.

You will encounter numerous examples of this kind throughout this book. In fatc, this
You will encounter numerous examples of this kind throughout this book. In fact, this
is what this book is all about.

The uncertainty stems from different sources, some or all may be present in any given application.
Expand Down Expand Up @@ -113,7 +113,117 @@ Does it even matter; does that knowledge alter our reasoning? Much progress has
on developing causality models and is still an exciting, active field of research.

The rest of this chapter is about the general considerations that are important for principled
reasoning in the presence of uncertainty.
reasoning in the presence of uncertainty.


## Understand your problem

In practice you will asked to provide answers based on data. For example, you may be given data about customer
behaviour in a large bank and asked to develop a model that will provide the probability of default of the customers in
the bank. This is an important problem for all financial institutions - if it does not have a good credit risk model,
it will either loose money by being too conservative in the way it lends money, or loose money by taking on too much
of a risk. You can opt for applying a You can opt for applying an extremely complex "machine-learning" model with many
parameters but you are almost guaranteed
to come to grief. First study the problem and come to terms with all the many issues at stake.
We speak from experience!

Let's illustrate the idea with a simple example. The teacher asks little Annie to solve the following problem: Ten
sheep are on this side of the road and one sheep crosses to the other side, how many sheep remain on this side? Annie
knows the answer of course, and replies, correctly, none. This quite agitates the teacher and asks, there were ten sheep
on this side of the road and one crosses over to the other side, how is it that you tell none remain? Annie replies,
You don't know sheep, if one crosses the road everyone else follows and none remain. We can confirm this, also from
experience!

It is easy to get the arithmetic right, but as easy to get the problem wrong if you don't understand it.

Please make sure you know what problem you have to solve. You may even run into situations where a company provides you
with lots of data and then asks you to extract meaningful information from it. Our advice it, work with the company to
first formulate a meaningful problem. Then you can direct your investigation to solving this problem.

Work with the domain experts! You may also find that you expertise is in more demand if you have specialised domain knowledge!

## Understand your data

You will often find that you spend more time trying to understand your data than solving the problem. During this
investigation you will probably look at things like,

1. Possible correlations between the variables.
2. Is the data complete? Real-world data often has empty fields with missing values. How do you deal with this?
Do you discard these fields or do you try and estimate values for them?
3. Do you have reason to believe that the information you need to solve the problem is in the data? This is a much-neglected
problem, perhaps because it is not easy to provide an answer. There are certainly situations where some of the
variables can act as a proxy for what is needed. This simply means that the information is hidden, but present in
the data. Precisely because it is not an easy problem to answer, it is necessary to give it some thought and
most definitely keep it in mind while you are modelling.

One example that is often quoted in the literature is about a group of researchers that wanted to build a model to
visually distinguish between criminals and non-criminals. For this they used a dataset of photographs of known criminals
and non-criminals. For their criminals they relied on mug shots - but of course they had to use other types of
photograph for the non-criminals. But even without that difference, you have to seriously ask yourself whether
you believe that one can tell whether a person is a criminal based on visual appearance,
see [@hashemi2020retracted] and The “Criminality from Face” Illusion
Kevin W. Bowyer, Fellow, IEEE, Michael C. King, Member, IEEE, and Walter Scheirer, Senior
Member, IEEE, Kushal Vangara .

<!---
ToDo: Move to citations
-->

4. Is your data balanced, if not how are you going to deal with it? If you are ever asked to develop a credit risk model,
you will have a vast quantity of non-defaulting examples and few, perhaps 5%, of defaulting examples. No financial
institution will survive a high percentage of defaulting customers. How will you deal with it?
5. Is you data biased? For historical reasons and because of social inequalities your data may be biased against minorities,
or on race, gender, etc. If this is the case your model will seriously deficient. You may also find that the instituion
you are working for has strict policies in place against the use of potentially harmful variables.

These raise serious ethical questions that the practitioner should be aware of.

Returning to the criminal detection problem mentioned above, it failed. Let's think of what the model does. Since it
is given samples of photographs of criminals and non-criminals, i.e. each photograph comes with the label, `criminal`
or `non-criminal`, it is adjusting its parameters in order to find the maximum correlation between samples belonging
to the same class, and to maximize the difference between the two classes. Your model will latch on to any feature that
satisfies these requirements, including unacceptable bias.

## How is you model going to be used?

The responsibility of the technical developer does not end with providing the model, or the analytics needed for the
purpose. It is important to know how your model is going to be used. If you are to develop a system that needs to
detect a terrorist before they board an aircraft, your thinking will be very different when the result of an error
by your system is relatively benign.

The institution you work for may also need to be able to audit the output of your system. This brings is to the next issue.

## Involve all stakeholders

You cannot, and don't want to take on the responsibility for all the choices outlined above, and this is by no means an
exhaustive list! Work with the domain knowledge experts within the institution, involve all managers that have a stake in
your system. And work within a team.

Make sure you are that team member that everyone wants to work with, because of your expertise, because you are trustworthy,
and easy to work with.

## Follow good software practices

This might appear as an unusual topic for a book on statistics. However, it does emphasize resampling methods. And that
is all about coding. Anyway, you will become so much more marketable if you learn the basics of solid software practices.
We don't have the space to do it here but we do want to stress its importance. Always keep in mind the following:

1. Use versioning control; we recommend using git. If you regularly push to the git repo this will protect you from
accidental software loss. It makes is also eay to share your code. You want other people to use your code, it make you
so much more useful!
2. Ask someone else to critically review your code. Even better if you work in an environment where there is a formal
system of code review.
3. Read other people's code. You will learn a lot.
4. Always add tests for your code. This means that you run your code on small examples for which you know the answer. Every
time you make changes to your code you can check that no unwanted side effects occurred.
5. Even better if you start you coding exercise with a small example for which you know the answer. This is known as
test-driven and we do this all the time.
6. Python has several formal testing environments that help you write tests. We recommend `pytest`.
7. Always be critical of yourself. You will run into bugs and make errors. This is inevitable and you should learn how
to recognise non-obvious errors. Are the results what you expect, are they reasonable? This of everything you
possibly can to find fault with the output of your system.
8. Always be critical of yourself. Know that you can and will make mistakes. That is no shame; being sloppy and not following
good practices, is.


<!---
Expand Down