Skip to content

Add datasets and allow to choose them #259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/actions/create-inventory/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,9 @@ runs:
EOL

mv inventory.ini ansible/playbooks/inventory.ini
- name: Prepare datasets.yml
shell: bash
run: |
apk add yq
echo -e "datasets:\n" > ansible/playbooks/group_vars/datasets.yml
yq -p json -o yaml datasets/datasets.json >> ansible/playbooks/group_vars/datasets.yml
26 changes: 24 additions & 2 deletions .github/workflows/continuous-benchmark-hnsw.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,21 @@ name: Continuous Benchmark Hnsw Indexing

on:
workflow_dispatch:
inputs:
dataset_name:
description: 'First dataset name for transform benchmark'
required: false
type: choice
options:
- 'cohere-wiki-100k-no-filters'
- 'laion-small-clip-no-filters-1'
dataset_2_name:
description: 'Second dataset name for transform benchmark'
required: false
type: choice
options:
- 'cohere-wiki-100k-no-filters-2'
- 'laion-small-clip-no-filters-2'
schedule:
# Run every day at 3am
- cron: "0 3 * * *"
Expand Down Expand Up @@ -30,7 +45,10 @@ jobs:
- name: Run bench
id: hnsw-indexing-update
run: |
cd ansible/playbooks && ansible-playbook playbook-hnsw-index.yml --extra-vars "bench=update"
cd ansible/playbooks && ansible-playbook playbook-hnsw-index.yml --extra-vars "
bench=update
dataset_name=dbpedia-openai-100K-1536-angular
"

runTransformHealingBenchmark:
runs-on: ubuntu-latest
Expand All @@ -50,4 +68,8 @@ jobs:
- name: Run bench
id: hnsw-indexing-transform
run: |
cd ansible/playbooks && ansible-playbook playbook-hnsw-index.yml --extra-vars "bench=transform"
cd ansible/playbooks && ansible-playbook playbook-hnsw-index.yml --extra-vars "
bench=transform
dataset_name=${{ inputs.dataset_name || 'laion-small-clip-no-filters-1' }}
dataset_2_name=${{ inputs.dataset_2_name || 'laion-small-clip-no-filters-2' }}
"
7 changes: 7 additions & 0 deletions ansible/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
### Prerequisites
* ssh keys (to connect to the remote machines)
* inventory.ini (to define the actual machine on which the benchmark is run)
* datasets.yml (to define the datasets used in the benchmark)

Add inventory.ini in [ansible/playbooks/](playbooks) with the following content:
```ini
Expand All @@ -13,6 +14,12 @@ benchmark-machine ansible_host=${YOUR_SERVER_IP} ansible_user=${YOUR_USER}
benchmark-db ansible_host=${YOUR_SERVER_IP} ansible_user=${YOUR_USER}
```

Convert [datasets/datasets.json](../datasets/datasets.json) into datasets.yml in [ansible/playbooks/group_vars](playbooks/group_vars).
You can use `yq` for it. Note that the yaml should start with `datasets:`. From [ansible](.) run:
```bash
yq -p json -o=yaml ../datasets/datasets.json >> playbooks/group_vars/datasets.yml
```

### Run ansible inside Docker
Ensure the ssh keys are properly mounted into the container.

Expand Down
3 changes: 3 additions & 0 deletions ansible/playbooks/group_vars/datasets.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# you can populate this file running a command from project root (yq should be installed):
# yq -p json -o=yaml datasets/datasets.json >> ansible/playbooks/group_vars/datasets.yml
datasets:
4 changes: 1 addition & 3 deletions ansible/playbooks/group_vars/hnsw-indexing-transform.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
qdrant_python_client_version: "1.14.0"
logging_dir: "/tmp/logs"
working_dir: "/tmp/experiments"
dataset_url: "https://storage.googleapis.com/ann-filtered-benchmark/datasets/laion-small-clip-no-filters-1.tgz"
# Default dataset values (can be overridden via --extra-vars)
dataset_name: "laion-small-clip-no-filters-1"
dataset_dim: "512"
dataset_2_url: "https://storage.googleapis.com/ann-filtered-benchmark/datasets/laion-small-clip-no-filters-2.tgz"
dataset_2_name: "laion-small-clip-no-filters-2"
servers:
- name: "qdrant"
Expand Down
3 changes: 1 addition & 2 deletions ansible/playbooks/group_vars/hnsw-indexing-update.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
qdrant_python_client_version: "1.14.0"
logging_dir: "/tmp/logs"
working_dir: "/tmp/experiments"
dataset_url: "https://storage.googleapis.com/ann-filtered-benchmark/datasets/dbpedia_openai_100K.tgz"
# Default dataset value (can be overridden via --extra-vars)
dataset_name: "dbpedia_openai_100K"
dataset_dim: "1536"
servers:
- name: "qdrant"
registry: "ghcr.io"
Expand Down
3 changes: 3 additions & 0 deletions ansible/playbooks/playbook-hnsw-index.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@
- name: Load common variables
include_vars: "group_vars/hnsw-indexing-{{ bench | default('update') }}.yml"

- name: Load datasets variables
include_vars: "group_vars/datasets.yml"

- name: Ensure necessary packages are installed
ansible.builtin.package:
name: "{{ item }}"
Expand Down
19 changes: 16 additions & 3 deletions ansible/playbooks/roles/run-hnsw-indexing-transform/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,22 @@
- "{{ working_dir }}/data/{{ dataset_name }}"
- "{{ working_dir }}/data/{{ dataset_2_name }}"

- name: Get dataset info for first dataset
ansible.builtin.set_fact:
dataset_1_info: "{{ datasets | selectattr('name', 'equalto', dataset_name) | first }}"

- name: Get dataset info for second dataset
ansible.builtin.set_fact:
dataset_2_info: "{{ datasets | selectattr('name', 'equalto', dataset_2_name) | first }}"

- name: Check if the dataset archive already exists
ansible.builtin.stat:
path: "{{ working_dir }}/data/{{ dataset_name }}.tgz"
register: archive_stat

- name: Download the archive
ansible.builtin.get_url:
url: "{{ dataset_url }}"
url: "{{ dataset_1_info.link }}"
dest: "{{ working_dir }}/data/{{ dataset_name }}.tgz"
when: not archive_stat.stat.exists

Expand All @@ -45,7 +53,7 @@

- name: Download the second archive
ansible.builtin.get_url:
url: "{{ dataset_2_url }}"
url: "{{ dataset_2_info.link }}"
dest: "{{ working_dir }}/data/{{ dataset_2_name }}.tgz"
when: not archive_2_stat.stat.exists

Expand All @@ -63,6 +71,11 @@
owner: "{{ ansible_user }}"
when: dest_2_dir_contents.matched == 0

- name: Set dataset dimensions
ansible.builtin.set_fact:
dataset_dim: "{{ dataset_1_info.vector_size }}"
dataset_2_dim: "{{ dataset_2_info.vector_size }}"

- name: Prepare and execute the benchmark
ansible.builtin.include_role:
name: run-hnsw-indexing-common
name: run-hnsw-indexing-common
10 changes: 9 additions & 1 deletion ansible/playbooks/roles/run-hnsw-indexing-update/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,18 @@
- "{{ working_dir }}/data"
- "{{ working_dir }}/data/{{ dataset_name }}"

- name: Get dataset info
ansible.builtin.set_fact:
dataset_info: "{{ datasets | selectattr('name', 'equalto', dataset_name) | first }}"

- name: Check if the dataset archive already exists
ansible.builtin.stat:
path: "{{ working_dir }}/data/{{ dataset_name }}.tgz"
register: archive_stat

- name: Download the archive
ansible.builtin.get_url:
url: "{{ dataset_url }}"
url: "{{ dataset_info.link }}"
dest: "{{ working_dir }}/data/{{ dataset_name }}.tgz"
when: not archive_stat.stat.exists

Expand All @@ -37,6 +41,10 @@
owner: "{{ ansible_user }}"
when: dest_dir_contents.matched == 0

- name: Set dataset dimension
ansible.builtin.set_fact:
dataset_dim: "{{ dataset_info.vector_size }}"

- name: Prepare and execute the benchmark
ansible.builtin.include_role:
name: run-hnsw-indexing-common
32 changes: 32 additions & 0 deletions datasets/datasets.json
Original file line number Diff line number Diff line change
Expand Up @@ -362,5 +362,37 @@
"type": "tar",
"link": "https://storage.googleapis.com/ann-filtered-benchmark/datasets/cohere-wiki-50m-test-only.tgz",
"path": "cohere-wiki-50m/cohere_wiki_50m"
},
{
"name": "cohere-wiki-100k-no-filters",
"vector_size": 768,
"distance": "cosine",
"type": "tar",
"link": "https://storage.googleapis.com/ann-filtered-benchmark/datasets/cohere-wiki-100k-no-filters.tgz",
"path": "cohere-wiki-100k/cohere_wiki_100k_no_filters"
},
{
"name": "cohere-wiki-100k-no-filters-2",
"vector_size": 768,
"distance": "cosine",
"type": "tar",
"link": "https://storage.googleapis.com/ann-filtered-benchmark/datasets/cohere-wiki-100k-no-filters-2.tgz",
"path": "cohere-wiki-100k/cohere_wiki_100k_no_filters_2"
},
{
"name": "laion-small-clip-no-filters-1",
"vector_size": 512,
"distance": "cosine",
"type": "tar",
"path": "laion-small-clip/laion-small-clip-no-filters-1",
"link": "https://storage.googleapis.com/ann-filtered-benchmark/datasets/laion-small-clip-no-filters-1.tgz"
},
{
"name": "laion-small-clip-no-filters-2",
"vector_size": 512,
"distance": "cosine",
"type": "tar",
"path": "laion-small-clip/laion-small-clip-no-filters-2",
"link": "https://storage.googleapis.com/ann-filtered-benchmark/datasets/laion-small-clip-no-filters-2.tgz"
}
]