Skip to content

Commit f3ecf40

Browse files
committed
docs: add new documentation and fix existing docs
1 parent be10123 commit f3ecf40

10 files changed

+278
-24
lines changed

README.md

+2-3
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ For guided demos and basics walkthroughs, check out the following links:
1111
- these demos can be copied into your current working directory when using the `codeflare-sdk` by using the `codeflare_sdk.copy_demo_nbs()` function
1212
- Additionally, we have a [video walkthrough](https://www.youtube.com/watch?v=U76iIfd9EmE) of these basic demos from June, 2023
1313

14-
Full documentation can be found [here](https://project-codeflare.github.io/codeflare-sdk/detailed-documentation)
14+
Full documentation can be found [here](https://project-codeflare.github.io/codeflare-sdk/index.html)
1515

1616
## Installation
1717

@@ -32,11 +32,10 @@ It is possible to use the Release Github workflow to do the release. This is gen
3232
The following instructions apply when doing release manually. This may be required in instances where the automation is failing.
3333

3434
- Check and update the version in "pyproject.toml" file.
35-
- Generate new documentation.
36-
`pdoc --html -o docs src/codeflare_sdk && pushd docs && rm -rf cluster job utils && mv codeflare_sdk/* . && rm -rf codeflare_sdk && popd && find docs -type f -name "*.html" -exec bash -c "echo '' >> {}" \;` (it is possible to install **pdoc** using the following command `poetry install --with docs`)
3735
- Commit all the changes to the repository.
3836
- Create Github release (<https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository#creating-a-release>).
3937
- Build the Python package. `poetry build`
4038
- If not present already, add the API token to Poetry.
4139
`poetry config pypi-token.pypi API_TOKEN`
4240
- Publish the Python package. `poetry publish`
41+
- Trigger the [Publish Documentation](https://github.com/project-codeflare/codeflare-sdk/actions/workflows/publish-documentation.yaml) workflow

docs/sphinx/index.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,16 @@ The CodeFlare SDK is an intuitive, easy-to-use python interface for batch resour
1616
modules
1717

1818
.. toctree::
19-
:maxdepth: 2
19+
:maxdepth: 1
2020
:caption: User Documentation:
2121

2222
user-docs/authentication
2323
user-docs/cluster-configuration
24+
user-docs/ray-cluster-interaction
2425
user-docs/e2e
2526
user-docs/s3-compatible-storage
2627
user-docs/setup-kueue
28+
user-docs/ui-widgets
2729

2830
Quick Links
2931
===========

docs/sphinx/user-docs/authentication.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ a login command like ``oc login --token=<token> --server=<server>``
3939
their kubernetes config file should have updated. If the user has not
4040
specifically authenticated through the SDK by other means such as
4141
``TokenAuthentication`` then the SDK will try to use their default
42-
Kubernetes config file located at ``"/HOME/.kube/config"``.
42+
Kubernetes config file located at ``"$HOME/.kube/config"``.
4343

4444
Method 3 Specifying a Kubernetes Config File
4545
--------------------------------------------
@@ -62,5 +62,5 @@ Method 4 In-Cluster Authentication
6262
----------------------------------
6363

6464
If a user does not authenticate by any of the means detailed above and
65-
does not have a config file at ``"/HOME/.kube/config"`` the SDK will try
65+
does not have a config file at ``"$HOME/.kube/config"`` the SDK will try
6666
to authenticate with the in-cluster configuration file.

docs/sphinx/user-docs/cluster-configuration.rst

+117-8
Original file line numberDiff line numberDiff line change
@@ -29,24 +29,133 @@ requirements for creating the Ray Cluster.
2929
labels={"exampleLabel": "example", "secondLabel": "example"},
3030
))
3131
32-
Note: ‘quay.io/modh/ray:2.35.0-py39-cu121’ is the default image used by
33-
the CodeFlare SDK for creating a RayCluster resource. If you have your
34-
own Ray image which suits your purposes, specify it in image field to
35-
override the default image. If you are using ROCm compatible GPUs you
36-
can use ‘quay.io/modh/ray:2.35.0-py39-rocm61’. You can also find
37-
documentation on building a custom image
38-
`here <https://github.com/opendatahub-io/distributed-workloads/tree/main/images/runtime/examples>`__.
32+
.. note::
33+
`quay.io/modh/ray:2.35.0-py39-cu121` is the default image used by
34+
the CodeFlare SDK for creating a RayCluster resource. If you have your
35+
own Ray image which suits your purposes, specify it in image field to
36+
override the default image. If you are using ROCm compatible GPUs you
37+
can use `quay.io/modh/ray:2.35.0-py39-rocm61`. You can also find
38+
documentation on building a custom image
39+
`here <https://github.com/opendatahub-io/distributed-workloads/tree/main/images/runtime/examples>`__.
3940

4041
The ``labels={"exampleLabel": "example"}`` parameter can be used to
4142
apply additional labels to the RayCluster resource.
4243

4344
After creating their ``cluster``, a user can call ``cluster.up()`` and
4445
``cluster.down()`` to respectively create or remove the Ray Cluster.
4546

47+
Parameters of the ``ClusterConfiguration``
48+
------------------------------------------
49+
50+
Below is a table explaining each of the ``ClusterConfiguration``
51+
parameters and their default values.
52+
53+
.. list-table::
54+
:header-rows: 1
55+
:widths: auto
56+
57+
* - Name
58+
- Type
59+
- Description
60+
- Default
61+
* - ``name``
62+
- ``str``
63+
- The name of the Ray Cluster/AppWrapper
64+
- Required - No default
65+
* - ``namespace``
66+
- ``Optional[str]``
67+
- The namespace of the Ray Cluster/AppWrapper
68+
- ``None``
69+
* - ``head_cpu_requests``
70+
- ``Union[int, str]``
71+
- CPU resource requests for the Head Node
72+
- ``2``
73+
* - ``head_cpu_limits``
74+
- ``Union[int, str]``
75+
- CPU resource limits for the Head Node
76+
- ``2``
77+
* - ``head_memory_requests``
78+
- ``Union[int, str]``
79+
- Memory resource requests for the Head Node
80+
- ``8``
81+
* - ``head_memory_limits``
82+
- ``Union[int, str]``
83+
- Memory limits for the Head Node
84+
- ``8``
85+
* - ``head_extended_resource_requests``
86+
- ``Dict[str, Union[str, int]]``
87+
- Extended resource requests for the Head Node
88+
- ``{}``
89+
* - ``worker_cpu_requests``
90+
- ``Union[int, str]``
91+
- CPU resource requests for the Worker Node
92+
- ``1``
93+
* - ``worker_cpu_limits``
94+
- ``Union[int, str]``
95+
- CPU resource limits for the Worker Node
96+
- ``1``
97+
* - ``num_workers``
98+
- ``int``
99+
- Number of Worker Nodes for the Ray Cluster
100+
- ``1``
101+
* - ``worker_memory_requests``
102+
- ``Union[int, str]``
103+
- Memory resource requests for the Worker Node
104+
- ``8``
105+
* - ``worker_memory_limits``
106+
- ``Union[int, str]``
107+
- Memory resource limits for the Worker Node
108+
- ``8``
109+
* - ``appwrapper``
110+
- ``bool``
111+
- A boolean that wraps the Ray Cluster in an AppWrapper
112+
- ``False``
113+
* - ``envs``
114+
- ``Dict[str, str]``
115+
- A dictionary of environment variables to set for the Ray Cluster
116+
- ``{}``
117+
* - ``image``
118+
- ``str``
119+
- A parameter for specifying the Ray Image
120+
- ``""``
121+
* - ``image_pull_secrets``
122+
- ``List[str]``
123+
- A parameter for providing a list of Image Pull Secrets
124+
- ``[]``
125+
* - ``write_to_file``
126+
- ``bool``
127+
- A boolean for writing the Ray Cluster as a Yaml file if set to True
128+
- ``False``
129+
* - ``verify_tls``
130+
- ``bool``
131+
- A boolean indicating whether to verify TLS when connecting to the cluster
132+
- ``True``
133+
* - ``labels``
134+
- ``Dict[str, str]``
135+
- A dictionary of labels to apply to the cluster
136+
- ``{}``
137+
* - ``worker_extended_resource_requests``
138+
- ``Dict[str, Union[str, int]]``
139+
- Extended resource requests for the Worker Node
140+
- ``{}``
141+
* - ``extended_resource_mapping``
142+
- ``Dict[str, str]``
143+
- A dictionary of custom resource mappings to map extended resource requests to RayCluster resource names
144+
- ``{}``
145+
* - ``overwrite_default_resource_mapping``
146+
- ``bool``
147+
- A boolean indicating whether to overwrite the default resource mapping
148+
- ``False``
149+
* - ``local_queue``
150+
- ``Optional[str]``
151+
- A parameter for specifying the Local Queue label for the Ray Cluster
152+
- ``None``
153+
46154
Deprecating Parameters
47155
----------------------
48156

49-
The following parameters of the ``ClusterConfiguration`` are being deprecated.
157+
The following parameters of the ``ClusterConfiguration`` are being
158+
deprecated.
50159

51160
.. list-table::
52161
:header-rows: 1

docs/sphinx/user-docs/e2e.rst

+10-9
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ On KinD clusters
1111

1212
Pre-requisite for KinD clusters: please add in your local ``/etc/hosts``
1313
file ``127.0.0.1 kind``. This will map your localhost IP address to the
14-
KinD clusters hostname. This is already performed on `GitHub
14+
KinD cluster's hostname. This is already performed on `GitHub
1515
Actions <https://github.com/project-codeflare/codeflare-common/blob/1edd775e2d4088a5a0bfddafb06ff3a773231c08/github-actions/kind/action.yml#L70-L72>`__
1616

1717
If the system you run on contains NVidia GPU then you can enable the GPU
@@ -91,7 +91,7 @@ instructions <https://www.substratus.ai/blog/kind-with-gpus>`__.
9191
poetry install --with test,docs
9292
poetry run pytest -v -s ./tests/e2e/mnist_raycluster_sdk_kind_test.py
9393

94-
- If the cluster doesnt have NVidia GPU support then we need to
94+
- If the cluster doesn't have NVidia GPU support then we need to
9595
disable NVidia GPU tests by providing proper marker:
9696

9797
::
@@ -124,8 +124,8 @@ If the system you run on contains NVidia GPU then you can enable the GPU
124124
support on OpenShift, this will allow you to run also GPU tests. To
125125
enable GPU on OpenShift follow `these
126126
instructions <https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/introduction.html>`__.
127-
Currently the SDK doesnt support tolerations, so e2e tests cant be
128-
executed on nodes with taint (i.e. GPU taint).
127+
Currently the SDK doesn't support tolerations, so e2e tests can't be
128+
executed on nodes with taint (i.e. GPU taint).
129129

130130
- Test Phase:
131131

@@ -203,8 +203,9 @@ On OpenShift Disconnected clusters
203203
AWS_STORAGE_BUCKET=<storage-bucket-name>
204204
AWS_STORAGE_BUCKET_MNIST_DIR=<storage-bucket-MNIST-datasets-directory>
205205

206-
Note : When using the Python Minio client to connect to a minio
207-
storage bucket, the ``AWS_DEFAULT_ENDPOINT`` environment
208-
variable by default expects secure endpoint where user can use
209-
endpoint url with https/http prefix for autodetection of
210-
secure/insecure endpoint.
206+
.. note::
207+
When using the Python Minio client to connect to a minio
208+
storage bucket, the ``AWS_DEFAULT_ENDPOINT`` environment
209+
variable by default expects secure endpoint where user can use
210+
endpoint url with https/http prefix for autodetection of
211+
secure/insecure endpoint.
21.9 KB
Loading
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
Ray Cluster Interaction
2+
=======================
3+
4+
The CodeFlare SDK offers multiple ways to interact with Ray Clusters
5+
including the below methods.
6+
7+
get_cluster()
8+
-------------
9+
10+
The ``get_cluster()`` function is used to initialise a ``Cluster``
11+
object from a pre-existing Ray Cluster/AppWrapper. Below is an example
12+
of it's usage:
13+
14+
::
15+
16+
from codeflare_sdk import get_cluster
17+
cluster = get_cluster(cluster_name="raytest", namespace="example", is_appwrapper=False, write_to_file=False)
18+
-> output: Yaml resources loaded for raytest
19+
cluster.status()
20+
-> output:
21+
🚀 CodeFlare Cluster Status 🚀
22+
╭─────────────────────────────────────────────────────────────────╮
23+
│ Name │
24+
│ raytest Active ✅ │
25+
│ │
26+
│ URI: ray://raytest-head-svc.example.svc:10001 │
27+
│ │
28+
│ Dashboard🔗 │
29+
│ │
30+
╰─────────────────────────────────────────────────────────────────╯
31+
(<CodeFlareClusterStatus.READY: 1>, True)
32+
cluster.down()
33+
cluster.up() # This function will create an exact copy of the retrieved Ray Cluster only if the Ray Cluster has been previously deleted.
34+
35+
| These are the parameters the ``get_cluster()`` function accepts:
36+
| ``cluster_name: str # Required`` -> The name of the Ray Cluster.
37+
| ``namespace: str # Default: "default"`` -> The namespace of the Ray Cluster.
38+
| ``is_appwrapper: bool # Default: False`` -> When set to
39+
| ``True`` the function will attempt to retrieve an AppWrapper instead of a Ray Cluster.
40+
| ``write_to_file: bool # Default: False`` -> When set to ``True`` the Ray Cluster/AppWrapper will be written to a file similar to how it is done in ``ClusterConfiguration``.
41+
42+
list_all_queued()
43+
-----------------
44+
45+
| The ``list_all_queued()`` function returns (and prints by default) a list of all currently queued-up Ray Clusters in a given namespace.
46+
| It accepts the following parameters:
47+
| ``namespace: str # Required`` -> The namespace you want to retrieve the list from.
48+
| ``print_to_console: bool # Default: True`` -> Allows the user to print the list to their console.
49+
| ``appwrapper: bool # Default: False`` -> When set to ``True`` allows the user to list queued AppWrappers.
50+
51+
list_all_clusters()
52+
-------------------
53+
54+
| The ``list_all_clusters()`` function will return a list of detailed descriptions of Ray Clusters to the console by default.
55+
| It accepts the following parameters:
56+
| ``namespace: str # Required`` -> The namespace you want to retrieve the list from.
57+
| ``print_to_console: bool # Default: True`` -> A boolean that allows the user to print the list to their console.
58+
59+
.. note::
60+
61+
The following methods require a ``Cluster`` object to be
62+
initialized. See :doc:`./cluster-configuration`
63+
64+
cluster.up()
65+
------------
66+
67+
| The ``cluster.up()`` function creates a Ray Cluster in the given namespace.
68+
69+
cluster.down()
70+
--------------
71+
72+
| The ``cluster.down()`` function deletes the Ray Cluster in the given namespace.
73+
74+
cluster.status()
75+
----------------
76+
77+
| The ``cluster.status()`` function prints out the status of the Ray Cluster's state with a link to the Ray Dashboard.
78+
79+
cluster.details()
80+
-----------------
81+
82+
| The ``cluster.details()`` function prints out a detailed description of the Ray Cluster's status, worker resources and a link to the Ray Dashboard.
83+
84+
cluster.wait_ready()
85+
--------------------
86+
87+
| The ``cluster.wait_ready()`` function waits for the requested cluster to be ready, up to an optional timeout and checks every 5 seconds.
88+
| It accepts the following parameters:
89+
| ``timeout: Optional[int] # Default: None`` -> Allows the user to define a timeout for the ``wait_ready()`` function.
90+
| ``dashboard_check: bool # Default: True`` -> If enabled the ``wait_ready()`` function will wait until the Ray Dashboard is ready too.

docs/sphinx/user-docs/s3-compatible-storage.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -82,5 +82,5 @@ Lastly the new ``run_config`` must be added to the Trainer:
8282
To find more information on creating a Minio Bucket compatible with
8383
RHOAI you can refer to this
8484
`documentation <https://ai-on-openshift.io/tools-and-applications/minio/minio/>`__.
85-
Note: You must have ``sf3s`` and ``pyarrow`` installed in your
85+
Note: You must have ``s3fs`` and ``pyarrow`` installed in your
8686
environment for this method.

docs/sphinx/user-docs/ui-widgets.rst

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
Jupyter UI Widgets
2+
==================
3+
4+
Below are some examples of the Jupyter UI Widgets that are included in
5+
the CodeFlare SDK. > [!NOTE] > To use the widgets functionality you must
6+
be using the CodeFlare SDK in a Jupyter Notebook environment.
7+
8+
Cluster Up/Down Buttons
9+
-----------------------
10+
11+
The Cluster Up/Down buttons appear after successfully initialising your
12+
`ClusterConfiguration <cluster-configuration.md#ray-cluster-configuration>`__.
13+
There are two buttons and a checkbox ``Cluster Up``, ``Cluster Down``
14+
and ``Wait for Cluster?`` which mimic the
15+
`cluster.up() <ray-cluster-interaction.md#clusterup>`__,
16+
`cluster.down() <ray-cluster-interaction.md#clusterdown>`__ and
17+
`cluster.wait_ready() <ray-cluster-interaction.md#clusterwait_ready>`__
18+
functionality.
19+
20+
After initialising their ``ClusterConfiguration`` a user can select the
21+
``Wait for Cluster?`` checkbox then click the ``Cluster Up`` button to
22+
create their Ray Cluster and wait until it is ready. The cluster can be
23+
deleted by clicking the ``Cluster Down`` button.
24+
25+
.. image:: images/ui-buttons.png
26+
:alt: An image of the up/down ui buttons
27+
28+
View Clusters UI Table
29+
----------------------
30+
31+
The View Clusters UI Table allows a user to see a list of Ray Clusters
32+
with information on their configuration including number of workers, CPU
33+
requests and limits along with the clusters status.
34+
35+
.. image:: images/ui-view-clusters.png
36+
:alt: An image of the view clusters ui table
37+
38+
Above is a list of two Ray Clusters ``raytest`` and ``raytest2`` each of
39+
those headings is clickable and will update the table to view the
40+
selected Cluster's information. There are three buttons under the table
41+
``Cluster Down``, ``View Jobs`` and ``Open Ray Dashboard``. \* The
42+
``Cluster Down`` button will delete the selected Cluster. \* The
43+
``View Jobs`` button will try to open the Ray Dashboard's Jobs view in a
44+
Web Browser. The link will also be printed to the console. \* The
45+
``Open Ray Dashboard`` button will try to open the Ray Dashboard view in
46+
a Web Browser. The link will also be printed to the console.
47+
48+
The UI Table can be viewed by calling the following function.
49+
50+
.. code:: python
51+
52+
from codeflare_sdk import view_clusters
53+
view_clusters() # Accepts namespace parameter but will try to gather the namespace from the current context

0 commit comments

Comments
 (0)