-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Options for setting up DVC for data / pipelines (#36)
* move stray visualisation code into the new layout * test fixtures for the whole image decollage * resolve merge issues in conftest.py, add decollage tests * down to one remaining test failure (underlying exiftool) * remove coverage report step, run tests on push * whitespace change to trigger action (hopefully) * remove branch/path filters - why is action not running * fix environment name in pipeline * replace test workflow yaml with the python-template one * move to the reusable workflow layout * - to _ * html -> yml - probably time for a walk * black -> ruff and add config, deps in pyproject.toml * dev -> lint,test in pyproject.toml for consistency * tweaks for ruff standard linting, mostly return types * Well why doesn't ruff check warn about ruff format? * missing deps for this style of test coverage * catch last references to direct import of scivision * import order! need those commit hooks to persevere with ruff * unpin chromadb, packaging issue is fixed there * add exiftool install to workflow runner * add note on exiftool installation to README * Add a link trail to the README * replace the s3fs interface with boto3 * Initialise DVC * Test approach of adding DVC remote tracking to git * Uses import-url to do this * Adds a bash script to automate it * Documents the process * looking at how `dvc add` roundtrips data * Update the DVC usage examples to show data upload * remove the last traces of intake before writing pipelines * Add a two-stage pipeline for extracting embeddings * More notes on pipeline stages, and a dvc lockfile * fix issue where cluster images displayed twice * added the image list to session_state rather than calling the render from the selectbox state change * embeddings collection config via .env (more reusable) * dependency is Pillow, not PIL * Add a dvc stage to build a cluster model, but it's failing * was missing call to main() - long week * remove stray references to intake * scikit-image does need to be here * take bucket name from params, reuse for embeddings collection * Add more notes as we see what dvc repro does * spell out the model saving path * convert 16 bit greyscale images to rgb * add a single grayscale image to test fixtures * convert output to RGB for display as well * back out of the approach using git to store DVC tracking data * switch between embeddings collections in the demo UI
- Loading branch information
Showing
31 changed files
with
454 additions
and
206 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
/config.local | ||
/tmp | ||
/cache |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
[core] | ||
remote = jasmin | ||
['remote "jasmin"'] | ||
url = s3://metadata | ||
endpointurl = https://fw-plankton-o.s3-ext.jc.rl.ac.uk |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# Add patterns of files dvc should ignore, which could improve | ||
# the performance. Learn more at | ||
# https://dvc.org/doc/user-guide/dvcignore |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,3 +5,6 @@ vectors/ | |
*.ipynb | ||
*.egg-info/ | ||
venv/ | ||
data/** | ||
/vectors | ||
/models |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
# Data Version Control | ||
|
||
We're trying DVC (Data Version Control) in this project, for versioning data and ML models. | ||
|
||
There's little here on the DVC side as yet - links and notes in the README about following the approach being used here for LLM testing and fine-tuning, and how we might set it up to manage the collection "externally" (keeping the data on s3 and the metadata in source control). | ||
|
||
Other ecologies like [RO-crate](https://www.researchobject.org/ro-crate/) and [outpack](https://github.com/mrc-ide/outpack_server) share a lot of the same aims, but are more focused on research data and with possibly more community connections. For ML pipeline projects though, DVC is mature. | ||
|
||
## Walkthrough | ||
|
||
Following the [DVC Getting Started](https://github.com/iterative/dvc.org/blob/main/content/docs/start/index.md) | ||
|
||
``` | ||
dvc init | ||
git add .dvc/config | ||
git add .dvc/.gitignore | ||
``` | ||
|
||
Add our JASMIN object store as a DVC remote. Use an existing bucket for simplicity; this limits us to a single collection, but that's already the case. | ||
|
||
For non-AWS stores, `endpointurl` is needed - set this to the same object store url defined in `.env` as `AWS_URL_ENDPOINT` | ||
|
||
``` | ||
dvc remote add -d jasmin s3://metadata | ||
dvc remote modify jasmin endpointurl https://fw-plankton-o.s3-ext.jc.rl.ac.uk | ||
``` | ||
|
||
Add access key / secret key pair as documented in the llm_eval project. These values are also defined in `.env`, and the `--local` switch prevents accidentally committing them to git: | ||
|
||
``` | ||
dvc remote modify --local jasmin access_key_id [our access key] | ||
dvc remote modify --local jasmin secret_access_key [our secret key] | ||
``` | ||
|
||
Test that it works: | ||
|
||
`dvc push` | ||
|
||
### Add data | ||
|
||
Our images are already in the same bucket as this `dvc` remote. Try [import-url](https://dvc.org/doc/command-reference/import-url) to add an image in remote storage to version control. TODO - ask someone with permissions to add a dedicated bucket. | ||
|
||
There are two ways of doing this: | ||
|
||
`dvc import-url` - supports a `--no-download` option, creates a .dvc tracking file per object which can be added to our git repo. | ||
`dvc stage add` - writes the tracking information into `dvc.yml`, but assumes we're downloading it and uploading to the remote. | ||
|
||
We could also use the `--to-remote` option to transfer the data to remote storage in JASMIN's object store. We already have copies of the data in s3. | ||
|
||
`dvc add / dvc push` would be the pattern to use where we have data in a filesystem (in a JASMIN Group Workspace, or locally) and want to store and track canonical copies of it in an object store, and reuse those in experiment pipeline stages. | ||
|
||
#### Example for `dvc add` | ||
|
||
Files are on our local system. They might already be version controlled in a git repository! | ||
|
||
This uploads a image file to our object storage, and creates a .dvc metadata file in our local directory, alongside the image: | ||
|
||
``` | ||
dvc add tests/fixtures/test_images/testymctestface_113.tif --to-remote` | ||
git add tests/fixtures/test_images/testymctestface_113.tif.dvc | ||
git commit -m "add dvc metadata" | ||
git push origin our_branch | ||
``` | ||
|
||
Now in a completely separate checkout of the git repository, with the dvc remote set up to point to the same storage, we can | ||
|
||
`dvc pull` | ||
|
||
And this downloads the copy of the image linked to the metadata. It's quite nice! | ||
|
||
#### Script to `import-url` | ||
|
||
Bash script provided to automate this - read all the filenames from the CSV that intake was using as a catalog, use `dvc import-url` to create tracking information for them in a `data` directory, and then commit all the tracking information to this git repository. | ||
|
||
`bash scripts/intake_to_dvc.sh` | ||
|
||
### Define pipeline stages | ||
|
||
We want a [dvc.yaml](https://dvc.org/doc/user-guide/project-structure/dvcyaml-files) to keep our pipeline definition(s) in. | ||
|
||
Each script is converted to a stage; pass the output of one as the input of the next as a directory path (the `-d` switch). | ||
|
||
Option of a `params.yaml` with the `-p` switch which stores hyperparameters / initialisation values per stage. | ||
|
||
Use `dvc` to chain the existing scripts together into a pipeline: | ||
|
||
`cd scripts` - write a `dvc.yaml` into this directory. | ||
|
||
Rebuild the index of images in our s3 store: | ||
|
||
`dvc stage add -n index python image_metadata.py` | ||
|
||
Use that index to extract and store embeddings from images: | ||
|
||
`dvc stage add -n embeddings -o ../vectors python image_embeddings.py` | ||
|
||
Then check we can run our two-stage pipeline: | ||
|
||
`dvc repro` | ||
|
||
This creates a `dvc.lock` to commit to the repository, and suggests a `dvc push` which sends some amount of experiment metadata to the remote. | ||
|
||
When we run `dvc repro` again the second stage detects no change and doesn't re-run; but as our first stage only wrote a file back to `s3`, not to the filesystem, it may not be the behaviour we want. | ||
|
||
Now its output path `../vectors` is available to use as input to a model-building stage. | ||
|
||
Add a script that fits a K-means model from the image embeddings and saves it (hoping it saves automatically into `../models`) | ||
|
||
`dvc stage add -n cluster -d ../vectors -o ../models cluster.py` | ||
|
||
`dvc repro` at this point does want to run the image embeddings again, it's not clear why... code change? | ||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
"""Tiny DVC stage that trains and saves a K-means model from embeddings""" | ||
|
||
import logging | ||
import pickle | ||
import os | ||
import yaml | ||
|
||
from sklearn.cluster import KMeans | ||
from cyto_ml.data.vectorstore import embeddings, vector_store | ||
|
||
|
||
def main() -> None: | ||
|
||
os.makedirs("../models", exist_ok=True) | ||
|
||
# You can supply -p params to dvc as an alternative to params.yaml | ||
# But this (from the example) suggests they don't get overriden? | ||
params = yaml.safe_load(open("params.yaml")) | ||
collection_name = params.get("collection", "untagged-images-lana") | ||
try: | ||
stage_params = params["cluster"] | ||
n_clusters = stage_params["n_clusters"] | ||
except: | ||
logging.info("No parameters for stage found - default to 5 clusters") | ||
n_clusters = 5 | ||
|
||
kmeans = KMeans(n_clusters=n_clusters, random_state=42) | ||
store = vector_store(collection_name) | ||
X = embeddings(store) | ||
kmeans.fit(X) | ||
|
||
# We supply a -o for output directory - this doesn't ensure we write there. | ||
# The examples show the path hard-coded in the script, too | ||
# https://dvc.org/doc/start/data-pipelines/data-pipelines | ||
# Output directory will be deleted at the start of the stage; | ||
# It's the script's responsibility to ensure it's recreated | ||
|
||
with open(f"../models/kmeans-{collection_name}.pkl", "wb") as f: | ||
pickle.dump(kmeans, f) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
schema: '2.0' | ||
stages: | ||
index: | ||
cmd: python image_metadata.py | ||
embeddings: | ||
cmd: python image_embeddings.py | ||
cluster: | ||
cmd: python cluster.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
stages: | ||
index: | ||
cmd: python image_metadata.py | ||
embeddings: | ||
cmd: python image_embeddings.py | ||
#outs: | ||
#- ../vectors | ||
cluster: | ||
cmd: python cluster.py | ||
#deps: | ||
#- ../vectors | ||
#outs: | ||
#- ../models |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
"""Create a basic index for the images in an s3 collection | ||
""" | ||
|
||
import yaml | ||
from cyto_ml.data.s3 import boto3_client, image_index | ||
|
||
|
||
if __name__ == "__main__": | ||
|
||
# Write a minimal CSV index of images in a bucket | ||
# Was originally part of an intake catalogue setup | ||
image_bucket = yaml.safe_load(open("params.yaml"))["collection"] | ||
|
||
metadata = image_index(image_bucket) | ||
|
||
s3 = boto3_client() | ||
|
||
catalog_csv = metadata.to_csv(index=False) | ||
|
||
s3.put_object(Bucket=image_bucket, Key="catalog.csv", Body=catalog_csv) |
Oops, something went wrong.