-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Add README to tpch directory #79
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
<!--- | ||
Licensed to the Apache Software Foundation (ASF) under one | ||
or more contributor license agreements. See the NOTICE file | ||
distributed with this work for additional information | ||
regarding copyright ownership. The ASF licenses this file | ||
to you under the Apache License, Version 2.0 (the | ||
"License"); you may not use this file except in compliance | ||
with the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, | ||
software distributed under the License is distributed on an | ||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations | ||
under the License. | ||
--> | ||
|
||
# Benchmarking DataFusion Ray on Kubernetes | ||
|
||
This is a rough guide to deploying and benchmarking DataFusion Ray on Kubernetes as part of the development process. | ||
|
||
## Building Wheels | ||
|
||
Follow the instructions in the [contributor guide] to set up a development environment and then build the project | ||
using the following command. Note that the `--strip` argument is important for keeping the wheel size below 100MB | ||
|
||
[contributor guide]: ../docs/contributing.md | ||
|
||
```shell | ||
maturin build --strip | ||
``` | ||
|
||
## Create a Ray Cluster | ||
|
||
Follow the instructions in the [Ray on Kubernetes] documentation to install the KubeRay operator. | ||
|
||
[Ray on Kubernetes]: https://docs.ray.io/en/latest/cluster/kubernetes/index.html | ||
|
||
Create a `datafusion-ray.yaml` file using the following template. It is important that the Ray Docker image uses the | ||
same Python version that was used to build the wheels. This example yaml assumes that the TPC-H data files are | ||
available locally on each node in the cluster at the path `/mnt/bigdata`. If the data is stored on object storage then | ||
the `volume` and `volumeMount` sections can be removed. | ||
|
||
```yaml | ||
apiVersion: ray.io/v1alpha1 | ||
kind: RayCluster | ||
metadata: | ||
name: datafusion-ray-cluster | ||
spec: | ||
headGroupSpec: | ||
rayStartParams: | ||
num-cpus: "0" | ||
template: | ||
spec: | ||
containers: | ||
- name: ray-head | ||
image: rayproject/ray:2.43.0-py312-cpu | ||
imagePullPolicy: IfNotPresent | ||
resources: | ||
limits: | ||
cpu: 2 | ||
memory: 8Gi | ||
requests: | ||
cpu: 2 | ||
memory: 8Gi | ||
volumeMounts: | ||
- mountPath: /mnt/bigdata # Mount path inside the container | ||
name: ray-storage | ||
volumes: | ||
- name: ray-storage | ||
persistentVolumeClaim: | ||
claimName: ray-pvc | ||
workerGroupSpecs: | ||
- replicas: 2 | ||
groupName: "datafusion-ray" | ||
rayStartParams: | ||
num-cpus: "4" | ||
template: | ||
spec: | ||
containers: | ||
- name: ray-worker | ||
image: rayproject/ray:2.43.0-py312-cpu | ||
imagePullPolicy: IfNotPresent | ||
resources: | ||
limits: | ||
cpu: 5 | ||
memory: 64Gi | ||
requests: | ||
cpu: 5 | ||
memory: 64Gi | ||
volumeMounts: | ||
- mountPath: /mnt/bigdata | ||
name: ray-storage | ||
volumes: | ||
- name: ray-storage | ||
persistentVolumeClaim: | ||
claimName: ray-pvc | ||
``` | ||
|
||
Run the following command to create the cluster: | ||
|
||
```shell | ||
kubectl apply -f datafusion-ray.yaml | ||
``` | ||
|
||
Once the cluster is running, set up port forwarding on port 8265 on the head node and then run the following | ||
command to run the benchmarks. | ||
|
||
```shell | ||
ray job submit --address='http://localhost:8265' \ | ||
--runtime-env-json='{"pip":["datafusion", "tabulate", "boto3", "duckdb"], "py_modules":["/home/andy/git/apache/datafusion-ray/target/wheels/datafusion_ray-0.1.0-cp38-abi3-manylinux_2_35_x86_64.whl"], "working_dir":"./", "env_vars":{"RAY_DEDUP_LOGS":"O", "RAY_COLOR_PREFIX":"1"}}' -- \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The path to the .wheel here should perhaps be changed. Also, |
||
python tpcbench.py \ | ||
--data /mnt/bigdata/tpch/sf100 \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you want want to include the k8s resource definition for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see it is mentioned above that it is assumed. Ok |
||
--concurrency 8 \ | ||
--partitions-per-processor 4 \ | ||
--worker-pool-min 30 \ | ||
--listing-tables | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment here reminding users to add
--release
for a release wheel build