Finetune Controller

Finetune Controller is a robust and flexible system designed to manage and streamline the fine-tuning of machine learning models on Kubernetes, particularly within OpenShift clusters. This project leverages modern tools and workflows, enabling efficient development and deployment processes for AI-driven applications.

Features

Local Development: Get started quickly with a streamlined setup process using uv, a high-performance Python package and project manager.
OpenShift Integration: Simplify deployment and scaling with OpenShift-specific configurations and GPU support for intensive workloads.
MongoDB Backend: Seamlessly connect to a local or cluster-based MongoDB database.
Extensibility: Easily integrate with the Kubeflow Training Operator and other components for advanced workflows.

Getting Started

If the cluster is already set up continue else follow the cluster setup instructions here

Prereqs

Recommend using uv, an extremely fast Python package and project manager
```
pip install uv
```
A container engine such as Docker or Podman

Install

Create virtual environment and install dependencies
```
uv sync
```

Start a local developement mongo database (or connect to one on cluster with port-forward)

Local

docker run -d --rm --name mongodb \
    -e MONGODB_INITDB_ROOT_USERNAME="default-user" \
    -e MONGODB_INITDB_ROOT_PASSWORD="admin123456789" \
    -e MONGODB_INITDB_DATABASE="finetune" \
    -p 27017:27017 \
    mongodb/mongodb-community-server:latest

you can port-forward this connection to your local machine

oc port-forward service/mongodb-community-server 27017:27017 -n <namespace>

Connect to the Openshift cluster with the cli login command oc login. If cluster not already set up follow these steps
Create a project level .env file (see .env.example) and update the variables.
```
cp .env.example .env
```
Make sure the virtual environment is activated and start the local finetuning controller application.
```
source .venv/bin/activate

uvicorn app.main:app --reload
```

This will:

Start MongoDB with the required configuration
Build and start the FastAPI server
Make the application available at http://localhost:8000

Development and Contributing

Setup pre-commit to keep linting and code styling up to standard.

uv sync
pre-commit install

Setup OpenShift Cluster Resources

Create default project

Name can be descriptive for these examples we will use finetune-controller

oc new-project finetune-controller

Create Kubeflow project

oc new-project kubeflow

Install Kubeflow training operator

kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1"

Install Kueue

Requires Kubernetes 1.29 or newer

Follow the latest docs

Install a released version

kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.1/manifests.yaml

To wait for Kueue to be fully available, run:

kubectl wait deploy/kueue-controller-manager -nkueue-system --for=condition=available --timeout=5m

Restart pods

kubectl delete pods -lcontrol-plane=controller-manager -nkueue-system

First update the namepspace for the crd LocalQueue object in default-user-queue.yaml. default namepsace: "default"

yq e '.metadata.namespace = "finetune-controller"' -i crds/kueue/default-user-queue.yaml

Apply the default CRD config for Kueue or update by following their docs

kubectl apply -f crds/kueue/

Install mongodb server

Example configuration. do properly configure for production

oc new-app -e MONGODB_INITDB_ROOT_USERNAME="default-user" -e MONGODB_INITDB_ROOT_PASSWORD="admin123456789" -e MONGODB_INITDB_DATABASE="finetune"  mongodb/mongodb-community-server:latest --namespace finetune-controller

Add GPU nodes to ROSA cluster

Go to your cluster on redhat console admin dashboard. Add a machine pool of your choosing with the following configuration:

Taints

key: nvidia.com/gpu
value: <machine pool type or other>
effect: NoSchedule

Node Labels

Key: cluster-api/accelerator
Value: <gpu type e.g. V100 or empty>

Setup AWS Secret

Example aws config

# aws_credentials.yaml
apiVersion: v1
data:
  AWS_ACCESS_KEY_ID: |base64 encoded secret
  AWS_SECRET_ACCESS_KEY: |base64 encoded secret
  AWS_REGION: |base64 encoded string
kind: Secret
metadata:
  name: aws-credentials
type: Opaque

Setup Pull secrets

Example docker pull secret config

# pull_secret.yaml
apiVersion: v1
data:
  .dockerconfigjson: ...
kind: Secret
metadata:
  name: cr-pull-secret
type: kubernetes.io/dockerconfigjson

Apply these secrets

oc apply -f aws-credentials.yaml -n finetune-controller

Install Finetune Controller On OpenShift

Create a .env.production file and update the defaults. For this example set MONGODB_URL=mongodb://mongodb-community-server.finetune-controller.svc.cluster.local:27017
```
cp .env.example .env.production
```

create the application

oc new-app --strategy=docker --binary --name finetune-controller --env-file=".env.production" --namespace finetune-controller

expose services and patch tls config

oc expose deployment/finetune-controller --port=8000
oc expose svc/finetune-controller --port=8000
oc patch route finetune-controller --type=merge -p '{"spec":{"tls":{"termination":"edge"}}}'

add cluster role binding permissions to the application

start a build

oc start-build finetune-controller --from-dir=. --namespace=finetune-controller

Manually Publish Updates To Finetune Controller

Publish From current project

./scripts/publish.sh

Publish From git ~HEAD

./scripts/publish_git.sh

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
app		app
crds/kueue		crds/kueue
examples/Kueue		examples/Kueue
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
Dockerfile.monitor		Dockerfile.monitor
LICENSE		LICENSE
README.md		README.md
example.config.json		example.config.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finetune Controller

Features

Getting Started

Prereqs

Install

Development and Contributing

Setup OpenShift Cluster Resources

Create default project

Create Kubeflow project

Install Kubeflow training operator

Install Kueue

Install mongodb server

Add GPU nodes to ROSA cluster

Setup AWS Secret

Setup Pull secrets

Install Finetune Controller On OpenShift

Manually Publish Updates To Finetune Controller

About

Releases

Packages

Languages

License

acceleratedscience/finetune-controller

Folders and files

Latest commit

History

Repository files navigation

Finetune Controller

Features

Getting Started

Prereqs

Install

Development and Contributing

Setup OpenShift Cluster Resources

Create default project

Create Kubeflow project

Install Kubeflow training operator

Install Kueue

Install mongodb server

Add GPU nodes to ROSA cluster

Setup AWS Secret

Setup Pull secrets

Install Finetune Controller On OpenShift

Manually Publish Updates To Finetune Controller

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages