|
374 | 374 | "cell_type": "markdown",
|
375 | 375 | "metadata": {},
|
376 | 376 | "source": [
|
377 |
| - "With this in place, we can take a look at what the GMM model gives us for our initial data:" |
| 377 | + "With this in place, we can take a look at what the four-component GMM gives us for our initial data:" |
378 | 378 | ]
|
379 | 379 | },
|
380 | 380 | {
|
|
561 | 561 | "metadata": {},
|
562 | 562 | "source": [
|
563 | 563 | "Here the mixture of 16 Gaussians serves not to find separated clusters of data, but rather to model the overall *distribution* of the input data.\n",
|
564 |
| - "This is a generative model of the distribution, meaning that the GMM model gives us the recipe to generate new random data distributed similarly to our input.\n", |
565 |
| - "For example, here are 400 new points drawn from this 16-component GMM model to our original data:" |
| 564 | + "This is a generative model of the distribution, meaning that the GMM gives us the recipe to generate new random data distributed similarly to our input.\n", |
| 565 | + "For example, here are 400 new points drawn from this 16-component GMM fit to our original data:" |
566 | 566 | ]
|
567 | 567 | },
|
568 | 568 | {
|
|
604 | 604 | "The fact that GMM is a generative model gives us a natural means of determining the optimal number of components for a given dataset.\n",
|
605 | 605 | "A generative model is inherently a probability distribution for the dataset, and so we can simply evaluate the *likelihood* of the data under the model, using cross-validation to avoid over-fitting.\n",
|
606 | 606 | "Another means of correcting for over-fitting is to adjust the model likelihoods using some analytic criterion such as the [Akaike information criterion (AIC)](https://en.wikipedia.org/wiki/Akaike_information_criterion) or the [Bayesian information criterion (BIC)](https://en.wikipedia.org/wiki/Bayesian_information_criterion).\n",
|
607 |
| - "Scikit-Learn's ``GMM`` model actually includes built-in methods that compute both of these, and so it is very easy to operate on this approach.\n", |
| 607 | + "Scikit-Learn's ``GMM`` estimator actually includes built-in methods that compute both of these, and so it is very easy to operate on this approach.\n", |
608 | 608 | "\n",
|
609 | 609 | "Let's look at the AIC and BIC as a function as the number of GMM components for our moon dataset:"
|
610 | 610 | ]
|
|
725 | 725 | "cell_type": "markdown",
|
726 | 726 | "metadata": {},
|
727 | 727 | "source": [
|
728 |
| - "We have nearly 1,800 digits in 64 dimensions, and we can build a GMM model on top of these to generate more.\n", |
729 |
| - "GMM can have difficulty converging in such a high dimensional space, so we will start with an invertible dimensionality reduction algorithm on the data.\n", |
| 728 | + "We have nearly 1,800 digits in 64 dimensions, and we can build a GMM on top of these to generate more.\n", |
| 729 | + "GMMs can have difficulty converging in such a high dimensional space, so we will start with an invertible dimensionality reduction algorithm on the data.\n", |
730 | 730 | "Here we will use a straightforward PCA, asking it to preserve 99% of the variance in the projected data:"
|
731 | 731 | ]
|
732 | 732 | },
|
|
0 commit comments