-
-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adjustments to the negative binomial tutorial - #884
Open
connor-pph
wants to merge
2
commits into
bambinos:main
Choose a base branch
from
connor-pph:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,28 +18,28 @@ | |
"cell_type": "markdown", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
"metadata": {}, | ||
"source": [ | ||
"I always experience some kind of confusion when looking at the negative binomial distribution after a while of not working with it. There are so many different definitions that I usually need to read everything more than once. The definition I've first learned, and the one I like the most, says as follows: The negative binomial distribution is the distribution of a random variable that is defined as the number of independent Bernoulli trials until the k-th \"success\". In short, we repeat a Bernoulli experiment until we observe k successes and record the number of trials it required.\n", | ||
"The negative binomial distribution is flexible with multiple possible formulations. For example, it can model the number of *trials* or *failures* in a sequence of independent Bernoulli trials with probability of success (or failure) $p$ until the $k$-th \"success\". If we want to model the number of trials until the $k$-th success, the probability mass function (pmf) results:\n", | ||
"\n", | ||
"$$\n", | ||
"Y \\sim \\text{NB}(k, p)\n", | ||
"p(y | k, p)= \\binom{y - 1}{y-k}(1 -p)^{y - k}p^k\n", | ||
"$$\n", | ||
"\n", | ||
"where $0 \\le p \\le 1$ is the probability of success in each Bernoulli trial, $k > 0$, usually integer, and $y \\in \\{k, k + 1, \\cdots\\}$\n", | ||
"where $0 \\le p \\le 1$ is the probability of success in each Bernoulli trial, $k > 0$, usually integer, $y \\in \\{k, k + 1, \\cdots\\}$ and $Y$ is the number of trials until the $k$-th success.\n", | ||
"\n", | ||
"The probability mass function (pmf) is \n", | ||
"In this case, since we are modeling the number of *trials* until the $k$-th success, $y$ starts at $k$ and can be any integer greater than or equal to $k$. If instead we want to model the number of *failures* until the $k$-th success, we can use the same definition but $Y$ represents failures and starts at $0$ and there's a slightly different pmf:\n", | ||
"\n", | ||
"$$\n", | ||
"p(y | k, p)= \\binom{y - 1}{y-k}(1 -p)^{y - k}p^k\n", | ||
"p(y | k, p)= \\binom{y + k - 1}{k-1}(1 -p)^{y}p^k\n", | ||
"$$\n", | ||
"\n", | ||
"If you, like me, find it hard to remember whether $y$ starts at $0$, $1$, or $k$, try to think twice about the definition of the variable. But how? First, recall we aim to have $k$ successes. And success is one of the two possible outcomes of a trial, so the number of trials can never be smaller than the number of successes. Thus, we can be confident to say that $y \\ge k$." | ||
"In this case, $y$ starts at $0$ and can be any integer greater than or equal to $0$. When modeling failures, $y$ starts at 0, when modeling trials, $y$ starts at $k$." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"But this is not the only way of defining the negative binomial distribution, there are plenty of options! One of the most interesting, and the one you see in [PyMC3](https://docs.pymc.io/api/distributions/discrete.html#pymc3.distributions.discrete.NegativeBinomial), the library we use in Bambi for the backend, is as a continuous mixture. The negative binomial distribution describes a Poisson random variable whose rate is also a random variable (not a fixed constant!) following a gamma distribution. Or in other words, conditional on a gamma-distributed variable $\\mu$, the variable $Y$ has a Poisson distribution with mean $\\mu$.\n", | ||
"These are not the only ways of defining the negative binomial distribution, there are plenty of options! One of the most interesting, and the one you see in [PyMC](https://www.pymc.io/projects/docs/en/stable/api/distributions/generated/pymc.NegativeBinomial.html), the library we use in Bambi for the backend, is as a continuous mixture. The negative binomial distribution describes a Poisson random variable whose rate is also a random variable (not a fixed constant!) following a gamma distribution. Or in other words, conditional on a gamma-distributed variable $\\mu$, the variable $Y$ has a Poisson distribution with mean $\\mu$.\n", | ||
"\n", | ||
"Under this alternative definition, the pmf is\n", | ||
"\n", | ||
|
@@ -88,7 +88,7 @@ | |
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In SciPy, the definition of the negative binomial distribution differs a little from the one in our introduction. They define $Y$ = Number of failures until k successes and then $y$ starts at 0. In the following plot, we have the probability of observing $y$ failures before we see $k=3$ successes. " | ||
"SciPy uses the number of *failures* until $k$ successes definition, therefore $y$ starts at 0. In the following plot, we have the probability of observing $y$ failures before we see $k=3$ successes. " | ||
] | ||
}, | ||
{ | ||
|
@@ -163,7 +163,7 @@ | |
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Finally, if one wants to show this probability mass function as if we are following the first definition of negative binomial distribution we introduced, we just need to shift the whole thing to the right by adding $k$ to the $y$ values." | ||
"To change the definition to the number of *trials* until $k$ successes, we just need to shift the whole thing to the right by adding $k$ to the $y$ values." | ||
] | ||
}, | ||
{ | ||
|
@@ -226,7 +226,7 @@ | |
"\n", | ||
"School administrators study the attendance behavior of high school juniors at two schools. Predictors of the **number of days of absence** include the **type of program** in which the student is enrolled and a **standardized test in math**. We have attendance data on 314 high school juniors.\n", | ||
"\n", | ||
"The variables of insterest in the dataset are\n", | ||
"The variables of interest in the dataset are\n", | ||
"\n", | ||
"* daysabs: The number of days of absence. It is our response variable.\n", | ||
"* progr: The type of program. Can be one of 'General', 'Academic', or 'Vocational'.\n", | ||
|
@@ -551,7 +551,7 @@ | |
"\n", | ||
"But then, why negative binomial? Can't we just use a Poisson likelihood?\n", | ||
"\n", | ||
"Yes, we can. However, using a Poisson likelihood implies that the mean is equal to the variance, and that is usually an unrealistic assumption. If it turns out the variance is either substantially smaller or greater than the mean, the Poisson regression model results in a poor fit. Alternatively, if we use a negative binomial likelihood, the variance is not forced to be equal to the mean, and there's more flexibility to handle a given dataset, and consequently, the fit tends to better." | ||
"Yes, we can. However, using a Poisson likelihood implies that the mean is equal to the variance, and that is usually an unrealistic assumption. If it turns out the variance is either substantially smaller or greater than the mean, the Poisson regression model results in a poor fit. Alternatively, if we use a negative binomial likelihood, the variance is not forced to be equal to the mean, and there's more flexibility to handle a given dataset, and consequently, the fit tends to be better." | ||
] | ||
}, | ||
{ | ||
|
@@ -608,7 +608,7 @@ | |
"\\log(\\mathbb{E}[Y_i]) = \\beta_3 + \\beta_4 \\text{Math\\_std}_i\n", | ||
"$$\n", | ||
"\n", | ||
"And one last thing to note is we've decided not to inclide an intercept term, that's why you don't see any $\\beta_0$ above. This choice allows us to represent the effect of each program directly with $\\beta_1$, $\\beta_2$, and $\\beta_3$." | ||
"And one last thing to note is we've decided not to include an intercept term, that's why you don't see any $\\beta_0$ above. This choice allows us to represent the effect of each program directly with $\\beta_1$, $\\beta_2$, and $\\beta_3$." | ||
] | ||
}, | ||
{ | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would replace "we can use the following definition:" with "the probability mass function (pmf) results:" and directly show the first pmf you list.
Then I find it a bit repetitive that you say twice, and very close, that Y starts at zero when modeling failures, but it's fine.
Reply via ReviewNB