Skip to content

Commit

Permalink
Add the documentation for Sklearn integration in EVADB.
Browse files Browse the repository at this point in the history
  • Loading branch information
Jineet Desai committed Nov 26, 2023
1 parent 334c8b1 commit 020b82a
Showing 1 changed file with 40 additions and 7 deletions.
47 changes: 40 additions & 7 deletions docs/source/reference/ai/model-train-sklearn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,54 @@ Model Training with Sklearn
1. Installation
---------------

To use the `Sklearn framework <https://scikit-learn.org/stable/>`_, we need to install the extra sklearn dependency in your EvaDB virtual environment.
To use the `Flaml XGBoost AutoML framework <https://microsoft.github.io/FLAML/docs/Examples/Integrate%20-%20Scikit-learn%20Pipeline/>`_, we need to install the extra Flaml dependency in your EvaDB virtual environment.

.. code-block:: bash
pip install evadb[sklearn]
pip install "flaml[automl]"
2. Example Query
----------------

.. code-block:: sql
CREATE OR REPLACE FUNCTION PredictHouseRent FROM
CREATE FUNCTION IF NOT EXISTS PredictRent FROM
( SELECT number_of_rooms, number_of_bathrooms, days_on_market, rental_price FROM HomeRentals )
TYPE Sklearn
TYPE XGBoost
PREDICT 'rental_price';
In the above query, you are creating a new customized function by training a model from the ``HomeRentals`` table using the ``Sklearn`` framework.
The ``rental_price`` column will be the target column for predication, while the rest columns from the ``SELECT`` query are the inputs.
In the above query, you are creating a new customized function by training a model from the ``HomeRentals`` table using the ``Flaml XGBoost`` framework.
The ``rental_price`` column will be the target column for predication, while the rest columns from the ``SELECT`` query are the inputs.

3. Model Training Parameters
----------------------------

.. list-table:: Available Parameters
:widths: 25 75

* - PREDICT (**required**)
- The name of the column we wish to predict.
* - MODEL
- The Sklearn models supported as of now are ``Random Forest``, ``Extra Trees Regressor`` and ``KNN``.
You can use ``rf`` for Random Forests, ``extra_tree`` for ExtraTrees Regressor, and ``kneighbor`` for KNN.
* - TIME_LIMIT
- Time limit to train the model in seconds. Default: 120.
* - TASK
- Specify whether you want to perform ``regression`` task or ``classification`` task.
* - METRIC
- Specify the metric that you want to use to train your model. For e.g. for training ``regression`` tasks you could
use the ``r2`` or ``RMSE`` metrics. For training ``classification`` tasks you could use the ``accuracy`` or ``f1_score`` metrics.
More information about the model metrics could be found `here <https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric>`_

Below are the example queries specifying the aboe parameters

.. code-block:: sql
CREATE OR REPLACE FUNCTION PredictHouseRentSklearn FROM
( SELECT number_of_rooms, number_of_bathrooms, days_on_market, rental_price FROM HomeRentals )
TYPE Sklearn
PREDICT 'rental_price'
MODEL 'extra_tree'
METRIC 'r2'
TASK 'regression'
TIME_LIMIT 180;

0 comments on commit 020b82a

Please sign in to comment.