|
51 | 51 | "## Overview of this Notebook\n",
|
52 | 52 | "\n",
|
53 | 53 | "In this notebook we will build a classifier using the Oracle AutoMLx tool for the public Census Income dataset. The dataset is a binary classification dataset, and more details about the dataset can be found at https://archive.ics.uci.edu/ml/datasets/Adult.\n",
|
54 |
| - "We explore the various options provided by the Oracle AutoMLx tool, allowing the user to exercise control over the AutoML training process. We then evaluate the different models trained by AutoML. Finally we provide an overview of the possibilites that Oracle AutoMLx offers for explaining the predictions of the tuned model.\n", |
| 54 | + "We explore the various options provided by the Oracle AutoMLx tool, allowing the user to exercise control over the AutoMLx training process. We then evaluate the different models trained by AutoMLx. Finally we provide an overview of the possibilites that Oracle AutoMLx offers for explaining the predictions of the tuned model.\n", |
55 | 55 | "\n",
|
56 | 56 | "---\n",
|
57 | 57 | "## Prerequisites\n",
|
|
70 | 70 | "- Pick an appropriate model for the given dataset and prediction task at hand.\n",
|
71 | 71 | "- Tune the chosen model’s hyperparameters for the given dataset.\n",
|
72 | 72 | "\n",
|
73 |
| - "All of these steps are significantly time consuming and heavily rely on data scientist expertise. Unfortunately, to make this problem harder, the best feature subset, model, and hyperparameter choice widely varies with the dataset and the prediction task. Hence, there is no one-size-fits-all solution to achieve reasonably good model performance. Using a simple Python API, AutoML can quickly (faster) jump-start the datascience process with an accurately-tuned model and appropriate features for a given prediction task.\n", |
| 73 | + "All of these steps are significantly time consuming and heavily rely on data scientist expertise. Unfortunately, to make this problem harder, the best feature subset, model, and hyperparameter choice widely varies with the dataset and the prediction task. Hence, there is no one-size-fits-all solution to achieve reasonably good model performance. Using a simple Python API, AutoMLx can quickly (faster) jump-start the datascience process with an accurately-tuned model and appropriate features for a given prediction task.\n", |
74 | 74 | "\n",
|
75 | 75 | "## Table of Contents\n",
|
76 | 76 | "\n",
|
77 | 77 | "- <a href='#setup'>0. Setup</a>\n",
|
78 | 78 | "- <a href='#load-data'>1. Load the Census Income dataset</a>\n",
|
79 |
| - "- <a href='#AutoML'>2. AutoML</a>\n", |
| 79 | + "- <a href='#AutoMLx'>2. AutoMLx</a>\n", |
80 | 80 | " - <a href='#Engine'>2.0. Set the engine and deprecation warnings</a>\n",
|
81 | 81 | " - <a href='#provider'>2.1. Create an Instance of Oracle AutoMLx</a>\n",
|
82 |
| - " - <a href='#default'>2.2. Train a Model using AutoML</a>\n", |
83 |
| - " - <a href='#analyze'>2.3. Analyze the AutoML optimization process </a>\n", |
| 82 | + " - <a href='#default'>2.2. Train a Model using AutoMLx</a>\n", |
| 83 | + " - <a href='#analyze'>2.3. Analyze the AutoMLx optimization process </a>\n", |
84 | 84 | " - <a href='#algorithm-selection'>2.3.1. Algorithm Selection</a>\n",
|
85 | 85 | " - <a href='#adaptive-sampling'>2.3.2. Adaptive Sampling</a>\n",
|
86 | 86 | " - <a href='#feature-selection'>2.3.3. Feature Selection</a>\n",
|
87 | 87 | " - <a href='#hyperparameter-tuning'>2.3.4. Hyperparameter Tuning</a>\n",
|
88 | 88 | " - <a href='#confusion-matrix'>2.3.5. Confusion Matrix</a>\n",
|
89 |
| - " - <a href='#analyze'>2.3. Analyze the AutoML optimization process </a>\n", |
90 |
| - " - <a href='#modellist'>2.4. Provide a Specific Model List to AutoML</a>\n", |
| 89 | + " - <a href='#analyze'>2.3. Analyze the AutoMLx optimization process </a>\n", |
| 90 | + " - <a href='#modellist'>2.4. Provide a Specific Model List to AutoMLx</a>\n", |
91 | 91 | " - <a href='#nalgostuned'>2.5. Increase the number of tuned models</a>\n",
|
92 |
| - " - <a href='#scoringstr'>2.6. Specify a Different Scoring Metric to AutoML</a>\n", |
93 |
| - " - <a href='#scoringfn'>2.7. Specify a User-defined Scoring Function to AutoML</a>\n", |
94 |
| - " - <a href='#timebudget'>2.8. Specify a time budget to AutoML</a>\n", |
95 |
| - " - <a href='#minfeatures'>2.9. Specify a minimum set of features to AutoML</a>\n", |
| 92 | + " - <a href='#scoringstr'>2.6. Specify a Different Scoring Metric to AutoMLx</a>\n", |
| 93 | + " - <a href='#scoringfn'>2.7. Specify a User-defined Scoring Function to AutoMLx</a>\n", |
| 94 | + " - <a href='#timebudget'>2.8. Specify a time budget to AutoMLx</a>\n", |
| 95 | + " - <a href='#minfeatures'>2.9. Specify a minimum set of features to AutoMLx</a>\n", |
96 | 96 | "- <a href='#MLX'>3. Machine Learning Explainability (MLX)</a>\n",
|
97 | 97 | " - <a href='#MLX-initializing'> 3.1. Initialize an MLExplainer</a>\n",
|
98 | 98 | " - <a href='#MLX-global'>3.2. Model Explanations (Global Feature Importance)</a>\n",
|
|
50448 | 50448 | "id": "06fec3c6",
|
50449 | 50449 | "metadata": {},
|
50450 | 50450 | "source": [
|
50451 |
| - "We now separate the predictions (`y`) from the training data (`X`) for both the training (70%) and test (30%) datasets. The training set will be used to create a Machine Learning model using AutoML, and the test set will be used to evaluate the model's performance on unseen data." |
| 50451 | + "We now separate the predictions (`y`) from the training data (`X`) for both the training (70%) and test (30%) datasets. The training set will be used to create a Machine Learning model using AutoMLx, and the test set will be used to evaluate the model's performance on unseen data." |
50452 | 50452 | ]
|
50453 | 50453 | },
|
50454 | 50454 | {
|
|
50488 | 50488 | "id": "e3d1e608",
|
50489 | 50489 | "metadata": {},
|
50490 | 50490 | "source": [
|
50491 |
| - "<a id='AutoML'></a>\n", |
50492 |
| - "## AutoML" |
| 50491 | + "<a id='AutoMLx'></a>\n", |
| 50492 | + "## AutoMLx" |
50493 | 50493 | ]
|
50494 | 50494 | },
|
50495 | 50495 | {
|
|
50499 | 50499 | "source": [
|
50500 | 50500 | "<a id='Engine'></a>\n",
|
50501 | 50501 | "### Setting the engine and deprecation warnings\n",
|
50502 |
| - "The AutoML pipeline offers the function `init`, which allows to initialize the parallelization engine. " |
| 50502 | + "The AutoMLx pipeline offers the function `init`, which allows to initialize the parallelization engine. " |
50503 | 50503 | ]
|
50504 | 50504 | },
|
50505 | 50505 | {
|
|
50530 | 50530 | "\n",
|
50531 | 50531 | "The Oracle AutoMLx solution provides a pipeline that automatically finds a tuned model given a prediction task and a training dataset. In particular it allows to find a tuned model for any supervised prediction task, e.g. classification or regression where the target can be binary, categorical or real-valued.\n",
|
50532 | 50532 | "\n",
|
50533 |
| - "AutoML consists of five main modules: \n", |
| 50533 | + "AutoMLx consists of five main modules: \n", |
50534 | 50534 | "- **Preprocessing** : Clean, impute, engineer, and normalize features.\n",
|
50535 | 50535 | "- **Algorithm Selection** : Identify the right classification algorithm -in this notebook- for a given dataset, choosing from amongst:\n",
|
50536 | 50536 | " - AdaBoostClassifier\n",
|
|
50549 | 50549 | "- **Feature Selection** : Select a subset of the data features, based on the previously selected model.\n",
|
50550 | 50550 | "- **Hyperparameter Tuning** : Find the right model parameters that maximize score for the given dataset. \n",
|
50551 | 50551 | "\n",
|
50552 |
| - "All these pieces are readily combined into a simple AutoML pipeline which automates the entire Machine Learning process with minimal user input/interaction." |
| 50552 | + "All these pieces are readily combined into a simple AutoMLx pipeline which automates the entire Machine Learning process with minimal user input/interaction." |
50553 | 50553 | ]
|
50554 | 50554 | },
|
50555 | 50555 | {
|
|
50558 | 50558 | "metadata": {},
|
50559 | 50559 | "source": [
|
50560 | 50560 | "<a id='default'></a>\n",
|
50561 |
| - "### Train a model using AutoML\n", |
| 50561 | + "### Train a model using AutoMLx\n", |
50562 | 50562 | "\n",
|
50563 |
| - "The AutoML API is quite simple to work with. We create an instance of the pipeline. Next, the training data is passed to the `fit()` function which executes the previously mentioned steps." |
| 50563 | + "The AutoMLx API is quite simple to work with. We create an instance of the pipeline. Next, the training data is passed to the `fit()` function which executes the previously mentioned steps." |
50564 | 50564 | ]
|
50565 | 50565 | },
|
50566 | 50566 | {
|
@@ -50646,12 +50646,12 @@
|
50646 | 50646 | "metadata": {},
|
50647 | 50647 | "source": [
|
50648 | 50648 | "<a id='analyze'></a>\n",
|
50649 |
| - "### Analyze the AutoML optimization process\n", |
| 50649 | + "### Analyze the AutoMLx optimization process\n", |
50650 | 50650 | "\n",
|
50651 |
| - "During the AutoML process, a summary of the optimization process is logged. It consists of:\n", |
| 50651 | + "During the AutoMLx process, a summary of the optimization process is logged. It consists of:\n", |
50652 | 50652 | "- Information about the training data .\n",
|
50653 |
| - "- Information about the AutoML Pipeline, such as:\n", |
50654 |
| - " - Selected features that AutoML found to be most predictive in the training data;\n", |
| 50653 | + "- Information about the AutoMLx Pipeline, such as:\n", |
| 50654 | + " - Selected features that AutoMLx found to be most predictive in the training data;\n", |
50655 | 50655 | " - Selected algorithm that was the best choice for this data;\n",
|
50656 | 50656 | " - Selected hyperparameters for the selected algorithm."
|
50657 | 50657 | ]
|
|
50661 | 50661 | "id": "551e2cd6",
|
50662 | 50662 | "metadata": {},
|
50663 | 50663 | "source": [
|
50664 |
| - "AutoML provides a `print_summary` API to output all the different trials performed." |
| 50664 | + "AutoMLx provides a `print_summary` API to output all the different trials performed." |
50665 | 50665 | ]
|
50666 | 50666 | },
|
50667 | 50667 | {
|
|
53807 | 53807 | "id": "81fbf2ea",
|
53808 | 53808 | "metadata": {},
|
53809 | 53809 | "source": [
|
53810 |
| - "We also provide the capability to visualize the results of each stage of the AutoML pipeline. " |
| 53810 | + "We also provide the capability to visualize the results of each stage of the AutoMLx pipeline. " |
53811 | 53811 | ]
|
53812 | 53812 | },
|
53813 | 53813 | {
|
|
53993 | 53993 | "<a id='hyperparameter-tuning'></a>\n",
|
53994 | 53994 | "#### Hyperparameter Tuning\n",
|
53995 | 53995 | "\n",
|
53996 |
| - "Hyperparameter Tuning is the last stage of the AutoML pipeline, and focuses on improving the chosen algorithm's score on the reduced dataset (after Adaptive Sampling and Feature Selection). We use a novel algorithm to search across many hyperparameters dimensions, and converge automatically when optimal hyperparameters are identified. Each trial in the graph below represents a particular hyperparameters configuration for the selected model." |
| 53996 | + "Hyperparameter Tuning is the last stage of the AutoMLx pipeline, and focuses on improving the chosen algorithm's score on the reduced dataset (after Adaptive Sampling and Feature Selection). We use a novel algorithm to search across many hyperparameters dimensions, and converge automatically when optimal hyperparameters are identified. Each trial in the graph below represents a particular hyperparameters configuration for the selected model." |
53997 | 53997 | ]
|
53998 | 53998 | },
|
53999 | 53999 | {
|
|
95354 | 95354 | "<a id='ref'></a>\n",
|
95355 | 95355 | "## References\n",
|
95356 | 95356 | "* More examples and details: http://automl.oraclecorp.com/\n",
|
95357 |
| - "* Oracle AutoML http://www.vldb.org/pvldb/vol13/p3166-yakovlev.pdf\n", |
| 95357 | + "* Oracle AutoMLx http://www.vldb.org/pvldb/vol13/p3166-yakovlev.pdf\n", |
95358 | 95358 | "* scikit-learn https://scikit-learn.org/stable/\n",
|
95359 | 95359 | "* Interpretable Machine Learning https://christophm.github.io/interpretable-ml-book/\n",
|
95360 | 95360 | "* LIME https://arxiv.org/pdf/1602.04938\n",
|
|
0 commit comments