Skip to content

Commit 6871078

Browse files
committed
Commiting a bunch of minor stuff
1 parent 8a40634 commit 6871078

File tree

8 files changed

+1799
-1875
lines changed

8 files changed

+1799
-1875
lines changed
Lines changed: 160 additions & 158 deletions
Original file line numberDiff line numberDiff line change
@@ -1,158 +1,160 @@
1-
---
2-
title : Introduction to statistical inference
3-
subtitle : Statistical inference
4-
author : Brian Caffo, Jeff Leek, Roger Peng
5-
job : Johns Hopkins Bloomberg School of Public Health
6-
logo : bloomberg_shield.png
7-
framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
8-
highlighter : highlight.js # {highlight.js, prettify, highlight}
9-
hitheme : tomorrow #
10-
url:
11-
lib: ../../librariesNew
12-
assets: ../../assets
13-
widgets : [mathjax] # {mathjax, quiz, bootstrap}
14-
mode : selfcontained # {standalone, draft}
15-
---
16-
## Statistical inference defined
17-
18-
Statistical inference is the process of drawing formal conclusions from
19-
data.
20-
21-
In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy
22-
statistical data where uncertainty must be accounted for.
23-
24-
---
25-
26-
## Motivating example: who's going to win the election?
27-
28-
In every major election, pollsters would like to know, ahead of the
29-
actual election, who's going to win. Here, the target of
30-
estimation (the estimand) is clear, the percentage of people in
31-
a particular group (city, state, county, country or other electoral
32-
grouping) who will vote for each candidate.
33-
34-
We can not poll everyone. Even if we could, some polled
35-
may change their vote by the time the election occurs.
36-
How do we collect a reasonable subset of data and quantify the
37-
uncertainty in the process to produce a good guess at who will win?
38-
39-
---
40-
41-
## Motivating example: is hormone replacement therapy effective?
42-
43-
A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. **Based on a statistically based protocol, the study was stopped early due an excess number of negative events.**
44-
45-
Here's there's two inferential problems.
46-
47-
1. Is HRT effective?
48-
2. How long should we continue the trial in the presence of contrary
49-
evidence?
50-
51-
See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts
52-
53-
---
54-
55-
## Motivating example: ECMO
56-
57-
In 1985 a group at a major neonatal intensive care center published the results of a trial comparing a standard treatment and a promising new extracorporeal membrane oxygenation treatment (ECMO) for newborn infants with severe respiratory failure. **Ethical considerations lead to a statistical randomization scheme whereby one infant received the control therapy, thereby opening the study to sample-size based criticisms.**
58-
59-
For a review and statistical discussion, see Royall Statistical Science 1991, Vol 6, No. 1, 52-88
60-
61-
---
62-
63-
## Summary
64-
65-
- These examples illustrate many of the difficulties of trying
66-
to use data to create general conclusions about a population.
67-
- Paramount among our concerns are:
68-
- Is the sample representative of the population that we'd like to draw inferences about?
69-
- Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions?
70-
- Is there systematic bias created by missing data or the design or conduct of the study?
71-
- What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization
72-
or random sampling, or implicit as the aggregation of many complex uknown processes.
73-
- Are we trying to estimate an underlying mechanistic model of phenomena under study?
74-
- Statistical inference requires navigating the set of assumptions and
75-
tools and subsequently thinking about how to draw conclusions from data.
76-
77-
---
78-
## Example goals of inference
79-
80-
1. Estimate and quantify the uncertainty of an estimate of
81-
a population quantity (the proportion of people who will
82-
vote for a candidate).
83-
2. Determine whether a population quantity
84-
is a benchmark value ("is the treatment effective?").
85-
3. Infer a mechanistic relationship when quantities are measured with
86-
noise ("What is the slope for Hooke's law?")
87-
4. Determine the impact of a policy? ("If we reduce polution levels,
88-
will asthma rates decline?")
89-
90-
91-
---
92-
## Example tools of the trade
93-
94-
1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest
95-
2. Random sampling: concerned with obtaining data that is representative
96-
of the population of interest
97-
3. Sampling models: concerned with creating a model for the sampling
98-
process, the most common is so called "iid".
99-
4. Hypothesis testing: concerned with decision making in the presence of uncertainty
100-
5. Confidence intervals: concerned with quantifying uncertainty in
101-
estimation
102-
6. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are
103-
approximated.
104-
7. Study design: the process of designing an experiment to minimize biases and variability.
105-
8. Nonparametric bootstrapping: the process of using the data to,
106-
with minimal probability model assumptions, create inferences.
107-
9. Permutation, randomization and exchangeability testing: the process
108-
of using data permutations to perform inferences.
109-
110-
---
111-
## Different thinking about probability leads to different styles of inference
112-
113-
We won't spend too much time talking about this, but there are several different
114-
styles of inference. Two broad categories that get discussed a lot are:
115-
116-
1. Frequency probability: is the long run proportion of
117-
times an event occurs in independent, identically distributed
118-
repetitions.
119-
2. Frequency inference: uses frequency interpretations of probabilities
120-
to control error rates. Answers questions like "What should I decide
121-
given my data controlling the long run proportion of mistakes I make at
122-
a tolerable level."
123-
3. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules.
124-
4. Bayesian inference: the use of Bayesian probability representation
125-
of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what
126-
should I believe now?"
127-
128-
Data scientists tend to fall within shades of gray of these and various other schools of inference.
129-
130-
---
131-
## In this class
132-
133-
* In this class, we will primarily focus on basic sampling models,
134-
basic probability models and frequency style analyses
135-
to create standard inferences.
136-
* Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing
137-
and bootstrapping.
138-
* As probability modeling will be our starting point, we first build
139-
up basic probability.
140-
141-
---
142-
## Where to learn more on the topics not covered
143-
144-
1. Explicit use of random sampling in inferences: look in references
145-
on "finite population statistics". Used heavily in polling and
146-
sample surveys.
147-
2. Explicit use of randomization in inferences: look in references
148-
on "causal inference" especially in clinical trials.
149-
3. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many).
150-
4. Missing data: well covered in biostatistics and econometric
151-
references; look for references to "multiple imputation", a popular tool for
152-
addressing missing data.
153-
5. Study design: consider looking in the subject matter area that
154-
you are interested in; some examples with rich histories in design:
155-
1. The epidemiological literature is very focused on using study design to investigate public health.
156-
2. The classical development of study design in agriculture broadly covers design and design principles.
157-
3. The industrial quality control literature covers design thoroughly.
158-
1+
---
2+
title : Introduction to statistical inference
3+
subtitle : Statistical inference
4+
author : Brian Caffo, Jeff Leek, Roger Peng
5+
job : Johns Hopkins Bloomberg School of Public Health
6+
logo : bloomberg_shield.png
7+
framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
8+
highlighter : highlight.js # {highlight.js, prettify, highlight}
9+
hitheme : tomorrow #
10+
url:
11+
lib: ../../librariesNew
12+
assets: ../../assets
13+
widgets : [mathjax] # {mathjax, quiz, bootstrap}
14+
mode : selfcontained # {standalone, draft}
15+
---
16+
17+
## Statistical inference defined
18+
19+
Statistical inference is the process of drawing formal conclusions from
20+
data.
21+
22+
In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy
23+
statistical data where uncertainty must be accounted for.
24+
25+
---
26+
27+
## Motivating example: who's going to win the election?
28+
29+
In every major election, pollsters would like to know, ahead of the
30+
actual election, who's going to win. Here, the target of
31+
estimation (the estimand) is clear, the percentage of people in
32+
a particular group (city, state, county, country or other electoral
33+
grouping) who will vote for each candidate.
34+
35+
We can not poll everyone. Even if we could, some polled
36+
may change their vote by the time the election occurs.
37+
How do we collect a reasonable subset of data and quantify the
38+
uncertainty in the process to produce a good guess at who will win?
39+
40+
---
41+
42+
## Motivating example: is hormone replacement therapy effective?
43+
44+
A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. **Based on a statistically based protocol, the study was stopped early due an excess number of negative events.**
45+
46+
Here's there's two inferential problems.
47+
48+
1. Is HRT effective?
49+
2. How long should we continue the trial in the presence of contrary
50+
evidence?
51+
52+
See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts
53+
54+
---
55+
56+
## Motivating example
57+
### Brain activation
58+
59+
![fMRI salmon study](fig/fmri-salmon.jpg 'fMRI salmon study')
60+
http://www.wired.com/2009/09/fmrisalmon/
61+
62+
63+
---
64+
65+
## Summary
66+
67+
- These examples illustrate many of the difficulties of trying
68+
to use data to create general conclusions about a population.
69+
- Paramount among our concerns are:
70+
- Is the sample representative of the population that we'd like to draw inferences about?
71+
- Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions?
72+
- Is there systematic bias created by missing data or the design or conduct of the study?
73+
- What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization
74+
or random sampling, or implicit as the aggregation of many complex uknown processes.
75+
- Are we trying to estimate an underlying mechanistic model of phenomena under study?
76+
- Statistical inference requires navigating the set of assumptions and
77+
tools and subsequently thinking about how to draw conclusions from data.
78+
79+
---
80+
## Example goals of inference
81+
82+
1. Estimate and quantify the uncertainty of an estimate of
83+
a population quantity (the proportion of people who will
84+
vote for a candidate).
85+
2. Determine whether a population quantity
86+
is a benchmark value ("is the treatment effective?").
87+
3. Infer a mechanistic relationship when quantities are measured with
88+
noise ("What is the slope for Hooke's law?")
89+
4. Determine the impact of a policy? ("If we reduce polution levels,
90+
will asthma rates decline?")
91+
5. Talk about the probability that something occurs.
92+
93+
---
94+
## Example tools of the trade
95+
96+
1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest
97+
2. Random sampling: concerned with obtaining data that is representative
98+
of the population of interest
99+
3. Sampling models: concerned with creating a model for the sampling
100+
process, the most common is so called "iid".
101+
4. Hypothesis testing: concerned with decision making in the presence of uncertainty
102+
5. Confidence intervals: concerned with quantifying uncertainty in
103+
estimation
104+
6. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are
105+
approximated.
106+
7. Study design: the process of designing an experiment to minimize biases and variability.
107+
8. Nonparametric bootstrapping: the process of using the data to,
108+
with minimal probability model assumptions, create inferences.
109+
9. Permutation, randomization and exchangeability testing: the process
110+
of using data permutations to perform inferences.
111+
112+
---
113+
## Different thinking about probability leads to different styles of inference
114+
115+
We won't spend too much time talking about this, but there are several different
116+
styles of inference. Two broad categories that get discussed a lot are:
117+
118+
1. Frequency probability: is the long run proportion of
119+
times an event occurs in independent, identically distributed
120+
repetitions.
121+
2. Frequency inference: uses frequency interpretations of probabilities
122+
to control error rates. Answers questions like "What should I decide
123+
given my data controlling the long run proportion of mistakes I make at
124+
a tolerable level."
125+
3. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules.
126+
4. Bayesian inference: the use of Bayesian probability representation
127+
of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what
128+
should I believe now?"
129+
130+
Data scientists tend to fall within shades of gray of these and various other schools of inference.
131+
132+
---
133+
## In this class
134+
135+
* In this class, we will primarily focus on basic sampling models,
136+
basic probability models and frequency style analyses
137+
to create standard inferences.
138+
* Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing
139+
and bootstrapping.
140+
* As probability modeling will be our starting point, we first build
141+
up basic probability.
142+
143+
---
144+
## Where to learn more on the topics not covered
145+
146+
1. Explicit use of random sampling in inferences: look in references
147+
on "finite population statistics". Used heavily in polling and
148+
sample surveys.
149+
2. Explicit use of randomization in inferences: look in references
150+
on "causal inference" especially in clinical trials.
151+
3. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many).
152+
4. Missing data: well covered in biostatistics and econometric
153+
references; look for references to "multiple imputation", a popular tool for
154+
addressing missing data.
155+
5. Study design: consider looking in the subject matter area that
156+
you are interested in; some examples with rich histories in design:
157+
1. The epidemiological literature is very focused on using study design to investigate public health.
158+
2. The classical development of study design in agriculture broadly covers design and design principles.
159+
3. The industrial quality control literature covers design thoroughly.
160+

0 commit comments

Comments
 (0)