Options for setting up DVC for data / pipelines (#36)

* move stray visualisation code into the new layout * test fixtures for the whole image decollage * resolve merge issues in conftest.py, add decollage tests * down to one remaining test failure (underlying exiftool) * remove coverage report step, run tests on push * whitespace change to trigger action (hopefully) * remove branch/path filters - why is action not running * fix environment name in pipeline * replace test workflow yaml with the python-template one * move to the reusable workflow layout * - to _ * html -> yml - probably time for a walk * black -> ruff and add config, deps in pyproject.toml * dev -> lint,test in pyproject.toml for consistency * tweaks for ruff standard linting, mostly return types * Well why doesn't ruff check warn about ruff format? * missing deps for this style of test coverage * catch last references to direct import of scivision * import order! need those commit hooks to persevere with ruff * unpin chromadb, packaging issue is fixed there * add exiftool install to workflow runner * add note on exiftool installation to README * Add a link trail to the README * replace the s3fs interface with boto3 * Initialise DVC * Test approach of adding DVC remote tracking to git * Uses import-url to do this * Adds a bash script to automate it * Documents the process * looking at how `dvc add` roundtrips data * Update the DVC usage examples to show data upload * remove the last traces of intake before writing pipelines * Add a two-stage pipeline for extracting embeddings * More notes on pipeline stages, and a dvc lockfile * fix issue where cluster images displayed twice * added the image list to session_state rather than calling the render from the selectbox state change * embeddings collection config via .env (more reusable) * dependency is Pillow, not PIL * Add a dvc stage to build a cluster model, but it's failing * was missing call to main() - long week * remove stray references to intake * scikit-image does need to be here * take bucket name from params, reuse for embeddings collection * Add more notes as we see what dvc repro does * spell out the model saving path * convert 16 bit greyscale images to rgb * add a single grayscale image to test fixtures * convert output to RGB for display as well * back out of the approach using git to store DVC tracking data * switch between embeddings collections in the demo UI
NERC-CEH · Oct 3, 2024 · 8f39f39 · 8f39f39
1 parent 21e8e53
commit 8f39f39
Show file tree

Hide file tree

Showing 31 changed files with 454 additions and 206 deletions.
diff --git a/.dvc/.gitignore b/.dvc/.gitignore
@@ -0,0 +1,3 @@
+/config.local
+/tmp
+/cache
diff --git a/.dvc/config b/.dvc/config
@@ -0,0 +1,5 @@
+[core]
+    remote = jasmin
+['remote "jasmin"']
+    url = s3://metadata
+    endpointurl = https://fw-plankton-o.s3-ext.jc.rl.ac.uk
diff --git a/.dvcignore b/.dvcignore
@@ -0,0 +1,3 @@
+# Add patterns of files dvc should ignore, which could improve
+# the performance. Learn more at
+# https://dvc.org/doc/user-guide/dvcignore
diff --git a/.gitignore b/.gitignore
@@ -5,3 +5,6 @@ vectors/
 *.ipynb
 *.egg-info/
 venv/
+data/**
+/vectors
+/models
diff --git a/DVC.md b/DVC.md
@@ -0,0 +1,116 @@
+# Data Version Control 
+
+We're trying DVC (Data Version Control) in this project, for versioning data and ML models.
+
+There's little here on the DVC side as yet - links and notes in the README about following the approach being used here for LLM testing and fine-tuning, and how we might set it up to manage the collection "externally" (keeping the data on s3 and the metadata in source control).
+
+Other ecologies like [RO-crate](https://www.researchobject.org/ro-crate/) and [outpack](https://github.com/mrc-ide/outpack_server) share a lot of the same aims, but are more focused on research data and with possibly more community connections. For ML pipeline projects though, DVC is mature.
+
+## Walkthrough
+
+Following the [DVC Getting Started](https://github.com/iterative/dvc.org/blob/main/content/docs/start/index.md) 
+
+```
+dvc init
+git add .dvc/config
+git add .dvc/.gitignore
+```
+
+Add our JASMIN object store as a DVC remote. Use an existing bucket for simplicity; this limits us to a single collection, but that's already the case. 
+
+For non-AWS stores, `endpointurl` is needed - set this to the same object store url defined in `.env` as `AWS_URL_ENDPOINT`
+
+```
+dvc remote add -d jasmin s3://metadata
+dvc remote modify jasmin endpointurl https://fw-plankton-o.s3-ext.jc.rl.ac.uk
+```
+
+Add access key / secret key pair as documented in the llm_eval project. These values are also defined in `.env`, and the `--local` switch prevents accidentally committing them to git:
+
+```
+dvc remote modify --local jasmin access_key_id [our access key]
+dvc remote modify --local jasmin secret_access_key [our secret key]
+```
+
+Test that it works:
+
+`dvc push`
+
+### Add data
+
+Our images are already in the same bucket as this `dvc` remote. Try [import-url](https://dvc.org/doc/command-reference/import-url) to add an image in remote storage to version control. TODO - ask someone with permissions to add a dedicated bucket.
+
+There are two ways of doing this:
+
+`dvc import-url` - supports a `--no-download` option, creates a .dvc tracking file per object which can be added to our git repo.
+`dvc stage add` - writes the tracking information into `dvc.yml`, but assumes we're downloading it and uploading to the remote.
+
+We could also use the `--to-remote` option to transfer the data to remote storage in JASMIN's object store. We already have copies of the data in s3.
+
+`dvc add / dvc push` would be the pattern to use where we have data in a filesystem (in a JASMIN Group Workspace, or locally) and want to store and track canonical copies of it in an object store, and reuse those in experiment pipeline stages.
+
+#### Example for `dvc add`
+
+Files are on our local system. They might already be version controlled in a git repository!
+
+This uploads a image file to our object storage, and creates a .dvc metadata file in our local directory, alongside the image:
+
+```
+dvc add tests/fixtures/test_images/testymctestface_113.tif --to-remote`                                                                       
+git add tests/fixtures/test_images/testymctestface_113.tif.dvc
+git commit -m "add dvc metadata"
+git push origin our_branch
+```
+
+Now in a completely separate checkout of the git repository, with the dvc remote set up to point to the same storage, we can
+
+`dvc pull`
+
+And this downloads the copy of the image linked to the metadata. It's quite nice!
+
+#### Script to `import-url`
+
+Bash script provided to automate this - read all the filenames from the CSV that intake was using as a catalog, use `dvc import-url` to create tracking information for them in a `data` directory, and then commit all the tracking information to this git repository.
+
+`bash scripts/intake_to_dvc.sh`
+
+### Define pipeline stages
+
+We want a [dvc.yaml](https://dvc.org/doc/user-guide/project-structure/dvcyaml-files) to keep our pipeline definition(s) in.
+
+Each script is converted to a stage; pass the output of one as the input of the next as a directory path (the `-d` switch).
+
+Option of a `params.yaml` with the `-p` switch which stores hyperparameters / initialisation values per stage. 
+
+Use `dvc` to chain the existing scripts together into a pipeline:
+
+`cd scripts` - write a `dvc.yaml` into this directory.
+
+Rebuild the index of images in our s3 store:
+
+`dvc stage add -n index python image_metadata.py`
+
+Use that index to extract and store embeddings from images:
+
+`dvc stage add -n embeddings -o ../vectors python image_embeddings.py`
+
+Then check we can run our two-stage pipeline:
+
+`dvc repro`
+
+This creates a `dvc.lock` to commit to the repository, and suggests a `dvc push` which sends some amount of experiment metadata to the remote.
+
+When we run `dvc repro` again the second stage detects no change and doesn't re-run; but as our first stage only wrote a file back to `s3`, not to the filesystem, it may not be the behaviour we want.
+
+Now its output path `../vectors` is available to use as input to a model-building stage.
+
+Add a script that fits a K-means model from the image embeddings and saves it (hoping it saves automatically into `../models`)
+
+`dvc stage add -n cluster -d ../vectors -o ../models cluster.py`
+
+`dvc repro` at this point does want to run the image embeddings again, it's not clear why... code change?
+
+
+
+
+
diff --git a/README.md b/README.md
@@ -94,7 +94,7 @@ For more information see the [Jupytext docs](https://jupytext.readthedocs.io/en/
 Streamlit app based off the [text embeddings for EIDC catalogue metadata](https://github.com/NERC-CEH/embeddings_app/) one
 
 ```
-streamlit run cyto_ml/visualisation/visualisation_app.py
+streamlit run src/cyto_ml/visualisation/app.py
 ```
 
 The demo should automatically open in your browser when you run streamlit. If it does not, connect using: http://localhost:8501.

diff --git a/environment.yml b/environment.yml
@@ -9,18 +9,18 @@ dependencies:
   - black
   - boto3
   - chromadb
+  - dvc[s3]
   - flake8
-  - intake-xarray
-  - intake=0.7
   - isort
   - jupyterlab
   - jupytext
   - matplotlib
   - pandas
+  - Pillow
   - pytest
   - python-dotenv
-  - scikit-image
   - scikit-learn
+  - scikit-image
   - xarray
   - pip
   - streamlit

diff --git a/pyproject.toml b/pyproject.toml
@@ -9,18 +9,20 @@ requires-python = ">=3.12"
 description = "This package supports the processing and analysis of plankton sample data"
 readme = "README.md"
 dependencies = [
-    "boto3",
+    "boto3",	
     "chromadb",
+    "dvc[s3]",
     "imagecodecs",
-    "intake==0.7.0",
-    "intake-xarray",
     "pandas",
+    "Pillow",
     "plotly",
     "pyexiftool",
     "python-dotenv",
-    "scikit-image", # secretly required by intake-xarray as default reader
+    "scikit-image",
+    "scikit-learn",
     "streamlit", 
     "torch",
+    "torchvision",
     "xarray",
     "resnet50-cefas@git+https://github.com/jmarshrossney/resnet50-cefas",
 ]

diff --git a/scripts/cluster.py b/scripts/cluster.py
@@ -0,0 +1,43 @@
+"""Tiny DVC stage that trains and saves a K-means model from embeddings"""
+
+import logging
+import pickle
+import os
+import yaml
+
+from sklearn.cluster import KMeans
+from cyto_ml.data.vectorstore import embeddings, vector_store
+
+
+def main() -> None:
+
+    os.makedirs("../models", exist_ok=True)
+
+    # You can supply -p params to dvc as an alternative to params.yaml
+    # But this (from the example) suggests they don't get overriden?
+    params = yaml.safe_load(open("params.yaml"))
+    collection_name = params.get("collection", "untagged-images-lana")
+    try:
+        stage_params = params["cluster"]
+        n_clusters = stage_params["n_clusters"]
+    except:
+        logging.info("No parameters for stage found - default to 5 clusters")
+        n_clusters = 5
+
+    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
+    store = vector_store(collection_name)
+    X = embeddings(store)
+    kmeans.fit(X)
+
+    # We supply a -o for output directory - this doesn't ensure we write there.
+    # The examples show the path hard-coded in the script, too
+    # https://dvc.org/doc/start/data-pipelines/data-pipelines
+    # Output directory will be deleted at the start of the stage;
+    # It's the script's responsibility to ensure it's recreated
+
+    with open(f"../models/kmeans-{collection_name}.pkl", "wb") as f:
+        pickle.dump(kmeans, f)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/dvc.lock b/scripts/dvc.lock
@@ -0,0 +1,8 @@
+schema: '2.0'
+stages:
+  index:
+    cmd: python image_metadata.py
+  embeddings:
+    cmd: python image_embeddings.py
+  cluster:
+    cmd: python cluster.py
diff --git a/scripts/dvc.yaml b/scripts/dvc.yaml
@@ -0,0 +1,13 @@
+stages:
+  index:
+    cmd: python image_metadata.py
+  embeddings:
+    cmd: python image_embeddings.py
+    #outs:
+    #- ../vectors
+  cluster:
+    cmd: python cluster.py
+    #deps:
+    #- ../vectors
+    #outs:
+    #- ../models
diff --git a/scripts/image_embeddings.py b/scripts/image_embeddings.py
@@ -1,16 +1,16 @@
-"""Try to use the scivision pretrained model and tools against this collection"""
+"""Extract and store image embeddings from a collection in s3,
+using an off-the-shelf pre-trained model"""
 
 import os
 import logging
+import yaml
 from dotenv import load_dotenv
-from cyto_ml.models.scivision import (
-    prepare_image,
-    flat_embeddings,
-)
+from cyto_ml.models.utils import flat_embeddings
+from cyto_ml.data.image import load_image_from_url
+
 from resnet50_cefas import load_model
 from cyto_ml.data.vectorstore import vector_store
-from intake import open_catalog
-from intake_xarray import ImageSource
+import pandas as pd
 
 logging.basicConfig(level=logging.info)
 load_dotenv()
@@ -19,32 +19,27 @@
 if __name__ == "__main__":
 
     # Limited to the Lancaster FlowCam dataset for now:
-    catalog = "untagged-images-lana/intake.yml"
-    dataset = open_catalog(f"{os.environ.get('ENDPOINT')}/{catalog}")
-    collection = vector_store("plankton")
+    image_bucket = yaml.safe_load(open("params.yaml"))["collection"]
+    catalog = f"{image_bucket}/catalog.csv"
 
-    model = load_model(strip_final_layer=True)
+    file_index = f"{os.environ.get('AWS_URL_ENDPOINT')}/{catalog}"
+    df = pd.read_csv(file_index)
 
-    plankton = (
-        dataset.plankton().to_dask().compute()
-    )  # this will read a CSV with image locations as a dask dataframe
+    collection = vector_store(image_bucket)
 
-    # Feels like this is doing dask wrong, compute() should happen later
-    # If it doesn't, there are complaints about meta= return value inference
-    # that suggest this is wrongheaded use of `apply`: need to learn better patterns
-    # So this is a kludge, but we're still very much in prototype territory -
-    # Come back and refine this if the next parts work!
+    model = load_model(strip_final_layer=True)
 
     def store_embeddings(row):
         try:
-            image_data = ImageSource(row.Filename).to_dask()
+            image_data = load_image_from_url(row.Filename)
         except ValueError as err:
             # TODO diagnose and fix for this happening, in rare circumstances:
             # (would be nice to know rather than just buffer the image and add code)
             # File "python3.9/site-packages/PIL/PcdImagePlugin.py", line 34, in _open
             #   self.fp.seek(2048)
             # File "python3.9/site-packages/fsspec/implementations/http.py", line 745, in seek
             # raise ValueError("Cannot seek streaming HTTP file")
+            # Is this still reproducible? - JW
             logging.info(err)
             logging.info(row.Filename)
             return
@@ -53,7 +48,7 @@ def store_embeddings(row):
             logging.info(row.Filename)
             return
 
-        embeddings = flat_embeddings(model(prepare_image(image_data)))
+        embeddings = flat_embeddings(model(image_data))
 
         collection.add(
             documents=[row.Filename],
@@ -62,4 +57,5 @@ def store_embeddings(row):
             # Note - optional arg name is "metadatas" (we don't have any)
         )
 
-    plankton.apply(store_embeddings, axis=1)
+    for _, row in df.iterrows():
+        store_embeddings(row)
diff --git a/scripts/image_metadata.py b/scripts/image_metadata.py
@@ -0,0 +1,20 @@
+"""Create a basic index for the images in an s3 collection
+"""
+
+import yaml
+from cyto_ml.data.s3 import boto3_client, image_index
+
+
+if __name__ == "__main__":
+
+    # Write a minimal CSV index of images in a bucket
+    # Was originally part of an intake catalogue setup
+    image_bucket = yaml.safe_load(open("params.yaml"))["collection"]
+
+    metadata = image_index(image_bucket)
+
+    s3 = boto3_client()
+
+    catalog_csv = metadata.to_csv(index=False)
+
+    s3.put_object(Bucket=image_bucket, Key="catalog.csv", Body=catalog_csv)
-Original file line number
+Diff line change
@@ Expand Up / @@ -5,3 +5,6 @@ vectors/ @@
     *.ipynb
     *.egg-info/
     venv/
+    data/**
+    /vectors
+    /models