Skip to content

Commit

Permalink
Options for setting up DVC for data / pipelines (#36)
Browse files Browse the repository at this point in the history
* move stray visualisation code into the new layout

* test fixtures for the whole image decollage

* resolve merge issues in conftest.py, add decollage tests

* down to one remaining test failure (underlying exiftool)

* remove coverage report step, run tests on push

* whitespace change to trigger action (hopefully)

* remove branch/path filters - why is action not running

* fix environment name in pipeline

* replace test workflow yaml with the python-template one

* move to the reusable workflow layout

* - to _

* html -> yml - probably time for a walk

* black -> ruff and add config, deps in pyproject.toml

* dev -> lint,test in pyproject.toml for consistency

* tweaks for ruff standard linting, mostly return types

* Well why doesn't ruff check warn about ruff format?

* missing deps for this style of test coverage

* catch last references to direct import of scivision

* import order! need those commit hooks to persevere with ruff

* unpin chromadb, packaging issue is fixed there

* add exiftool install to workflow runner

* add note on exiftool installation to README

* Add a link trail to the README

* replace the s3fs interface with boto3

* Initialise DVC

* Test approach of adding DVC remote tracking to git

* Uses import-url to do this
* Adds a bash script to automate it
* Documents the process

* looking at how `dvc add` roundtrips data

* Update the DVC usage examples to show data upload

* remove the last traces of intake before writing pipelines

* Add a two-stage pipeline for extracting embeddings

* More notes on pipeline stages, and a dvc lockfile

* fix issue where cluster images displayed twice

* added the image list to session_state rather than calling the render
  from the selectbox state change

* embeddings collection config via .env (more reusable)

* dependency is Pillow, not PIL

* Add a dvc stage to build a cluster model, but it's failing

* was missing call to main() - long week

* remove stray references to intake

* scikit-image does need to be here

* take bucket name from params, reuse for embeddings collection

* Add more notes as we see what dvc repro does

* spell out the model saving path

* convert 16 bit greyscale images to rgb

* add a single grayscale image to test fixtures

* convert output to RGB for display as well

* back out of the approach using git to store DVC tracking data

* switch between embeddings collections in the demo UI
  • Loading branch information
metazool authored Oct 3, 2024
1 parent 21e8e53 commit 8f39f39
Show file tree
Hide file tree
Showing 31 changed files with 454 additions and 206 deletions.
3 changes: 3 additions & 0 deletions .dvc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/config.local
/tmp
/cache
5 changes: 5 additions & 0 deletions .dvc/config
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[core]
remote = jasmin
['remote "jasmin"']
url = s3://metadata
endpointurl = https://fw-plankton-o.s3-ext.jc.rl.ac.uk
3 changes: 3 additions & 0 deletions .dvcignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,6 @@ vectors/
*.ipynb
*.egg-info/
venv/
data/**
/vectors
/models
116 changes: 116 additions & 0 deletions DVC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Data Version Control

We're trying DVC (Data Version Control) in this project, for versioning data and ML models.

There's little here on the DVC side as yet - links and notes in the README about following the approach being used here for LLM testing and fine-tuning, and how we might set it up to manage the collection "externally" (keeping the data on s3 and the metadata in source control).

Other ecologies like [RO-crate](https://www.researchobject.org/ro-crate/) and [outpack](https://github.com/mrc-ide/outpack_server) share a lot of the same aims, but are more focused on research data and with possibly more community connections. For ML pipeline projects though, DVC is mature.

## Walkthrough

Following the [DVC Getting Started](https://github.com/iterative/dvc.org/blob/main/content/docs/start/index.md)

```
dvc init
git add .dvc/config
git add .dvc/.gitignore
```

Add our JASMIN object store as a DVC remote. Use an existing bucket for simplicity; this limits us to a single collection, but that's already the case.

For non-AWS stores, `endpointurl` is needed - set this to the same object store url defined in `.env` as `AWS_URL_ENDPOINT`

```
dvc remote add -d jasmin s3://metadata
dvc remote modify jasmin endpointurl https://fw-plankton-o.s3-ext.jc.rl.ac.uk
```

Add access key / secret key pair as documented in the llm_eval project. These values are also defined in `.env`, and the `--local` switch prevents accidentally committing them to git:

```
dvc remote modify --local jasmin access_key_id [our access key]
dvc remote modify --local jasmin secret_access_key [our secret key]
```

Test that it works:

`dvc push`

### Add data

Our images are already in the same bucket as this `dvc` remote. Try [import-url](https://dvc.org/doc/command-reference/import-url) to add an image in remote storage to version control. TODO - ask someone with permissions to add a dedicated bucket.

There are two ways of doing this:

`dvc import-url` - supports a `--no-download` option, creates a .dvc tracking file per object which can be added to our git repo.
`dvc stage add` - writes the tracking information into `dvc.yml`, but assumes we're downloading it and uploading to the remote.

We could also use the `--to-remote` option to transfer the data to remote storage in JASMIN's object store. We already have copies of the data in s3.

`dvc add / dvc push` would be the pattern to use where we have data in a filesystem (in a JASMIN Group Workspace, or locally) and want to store and track canonical copies of it in an object store, and reuse those in experiment pipeline stages.

#### Example for `dvc add`

Files are on our local system. They might already be version controlled in a git repository!

This uploads a image file to our object storage, and creates a .dvc metadata file in our local directory, alongside the image:

```
dvc add tests/fixtures/test_images/testymctestface_113.tif --to-remote`
git add tests/fixtures/test_images/testymctestface_113.tif.dvc
git commit -m "add dvc metadata"
git push origin our_branch
```

Now in a completely separate checkout of the git repository, with the dvc remote set up to point to the same storage, we can

`dvc pull`

And this downloads the copy of the image linked to the metadata. It's quite nice!

#### Script to `import-url`

Bash script provided to automate this - read all the filenames from the CSV that intake was using as a catalog, use `dvc import-url` to create tracking information for them in a `data` directory, and then commit all the tracking information to this git repository.

`bash scripts/intake_to_dvc.sh`

### Define pipeline stages

We want a [dvc.yaml](https://dvc.org/doc/user-guide/project-structure/dvcyaml-files) to keep our pipeline definition(s) in.

Each script is converted to a stage; pass the output of one as the input of the next as a directory path (the `-d` switch).

Option of a `params.yaml` with the `-p` switch which stores hyperparameters / initialisation values per stage.

Use `dvc` to chain the existing scripts together into a pipeline:

`cd scripts` - write a `dvc.yaml` into this directory.

Rebuild the index of images in our s3 store:

`dvc stage add -n index python image_metadata.py`

Use that index to extract and store embeddings from images:

`dvc stage add -n embeddings -o ../vectors python image_embeddings.py`

Then check we can run our two-stage pipeline:

`dvc repro`

This creates a `dvc.lock` to commit to the repository, and suggests a `dvc push` which sends some amount of experiment metadata to the remote.

When we run `dvc repro` again the second stage detects no change and doesn't re-run; but as our first stage only wrote a file back to `s3`, not to the filesystem, it may not be the behaviour we want.

Now its output path `../vectors` is available to use as input to a model-building stage.

Add a script that fits a K-means model from the image embeddings and saves it (hoping it saves automatically into `../models`)

`dvc stage add -n cluster -d ../vectors -o ../models cluster.py`

`dvc repro` at this point does want to run the image embeddings again, it's not clear why... code change?





2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ For more information see the [Jupytext docs](https://jupytext.readthedocs.io/en/
Streamlit app based off the [text embeddings for EIDC catalogue metadata](https://github.com/NERC-CEH/embeddings_app/) one

```
streamlit run cyto_ml/visualisation/visualisation_app.py
streamlit run src/cyto_ml/visualisation/app.py
```

The demo should automatically open in your browser when you run streamlit. If it does not, connect using: http://localhost:8501.
Expand Down
6 changes: 3 additions & 3 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,18 @@ dependencies:
- black
- boto3
- chromadb
- dvc[s3]
- flake8
- intake-xarray
- intake=0.7
- isort
- jupyterlab
- jupytext
- matplotlib
- pandas
- Pillow
- pytest
- python-dotenv
- scikit-image
- scikit-learn
- scikit-image
- xarray
- pip
- streamlit
Expand Down
10 changes: 6 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,20 @@ requires-python = ">=3.12"
description = "This package supports the processing and analysis of plankton sample data"
readme = "README.md"
dependencies = [
"boto3",
"boto3",
"chromadb",
"dvc[s3]",
"imagecodecs",
"intake==0.7.0",
"intake-xarray",
"pandas",
"Pillow",
"plotly",
"pyexiftool",
"python-dotenv",
"scikit-image", # secretly required by intake-xarray as default reader
"scikit-image",
"scikit-learn",
"streamlit",
"torch",
"torchvision",
"xarray",
"resnet50-cefas@git+https://github.com/jmarshrossney/resnet50-cefas",
]
Expand Down
43 changes: 43 additions & 0 deletions scripts/cluster.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
"""Tiny DVC stage that trains and saves a K-means model from embeddings"""

import logging
import pickle
import os
import yaml

from sklearn.cluster import KMeans
from cyto_ml.data.vectorstore import embeddings, vector_store


def main() -> None:

os.makedirs("../models", exist_ok=True)

# You can supply -p params to dvc as an alternative to params.yaml
# But this (from the example) suggests they don't get overriden?
params = yaml.safe_load(open("params.yaml"))
collection_name = params.get("collection", "untagged-images-lana")
try:
stage_params = params["cluster"]
n_clusters = stage_params["n_clusters"]
except:
logging.info("No parameters for stage found - default to 5 clusters")
n_clusters = 5

kmeans = KMeans(n_clusters=n_clusters, random_state=42)
store = vector_store(collection_name)
X = embeddings(store)
kmeans.fit(X)

# We supply a -o for output directory - this doesn't ensure we write there.
# The examples show the path hard-coded in the script, too
# https://dvc.org/doc/start/data-pipelines/data-pipelines
# Output directory will be deleted at the start of the stage;
# It's the script's responsibility to ensure it's recreated

with open(f"../models/kmeans-{collection_name}.pkl", "wb") as f:
pickle.dump(kmeans, f)


if __name__ == "__main__":
main()
8 changes: 8 additions & 0 deletions scripts/dvc.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
schema: '2.0'
stages:
index:
cmd: python image_metadata.py
embeddings:
cmd: python image_embeddings.py
cluster:
cmd: python cluster.py
13 changes: 13 additions & 0 deletions scripts/dvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
stages:
index:
cmd: python image_metadata.py
embeddings:
cmd: python image_embeddings.py
#outs:
#- ../vectors
cluster:
cmd: python cluster.py
#deps:
#- ../vectors
#outs:
#- ../models
40 changes: 18 additions & 22 deletions scripts/image_embeddings.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
"""Try to use the scivision pretrained model and tools against this collection"""
"""Extract and store image embeddings from a collection in s3,
using an off-the-shelf pre-trained model"""

import os
import logging
import yaml
from dotenv import load_dotenv
from cyto_ml.models.scivision import (
prepare_image,
flat_embeddings,
)
from cyto_ml.models.utils import flat_embeddings
from cyto_ml.data.image import load_image_from_url

from resnet50_cefas import load_model
from cyto_ml.data.vectorstore import vector_store
from intake import open_catalog
from intake_xarray import ImageSource
import pandas as pd

logging.basicConfig(level=logging.info)
load_dotenv()
Expand All @@ -19,32 +19,27 @@
if __name__ == "__main__":

# Limited to the Lancaster FlowCam dataset for now:
catalog = "untagged-images-lana/intake.yml"
dataset = open_catalog(f"{os.environ.get('ENDPOINT')}/{catalog}")
collection = vector_store("plankton")
image_bucket = yaml.safe_load(open("params.yaml"))["collection"]
catalog = f"{image_bucket}/catalog.csv"

model = load_model(strip_final_layer=True)
file_index = f"{os.environ.get('AWS_URL_ENDPOINT')}/{catalog}"
df = pd.read_csv(file_index)

plankton = (
dataset.plankton().to_dask().compute()
) # this will read a CSV with image locations as a dask dataframe
collection = vector_store(image_bucket)

# Feels like this is doing dask wrong, compute() should happen later
# If it doesn't, there are complaints about meta= return value inference
# that suggest this is wrongheaded use of `apply`: need to learn better patterns
# So this is a kludge, but we're still very much in prototype territory -
# Come back and refine this if the next parts work!
model = load_model(strip_final_layer=True)

def store_embeddings(row):
try:
image_data = ImageSource(row.Filename).to_dask()
image_data = load_image_from_url(row.Filename)
except ValueError as err:
# TODO diagnose and fix for this happening, in rare circumstances:
# (would be nice to know rather than just buffer the image and add code)
# File "python3.9/site-packages/PIL/PcdImagePlugin.py", line 34, in _open
# self.fp.seek(2048)
# File "python3.9/site-packages/fsspec/implementations/http.py", line 745, in seek
# raise ValueError("Cannot seek streaming HTTP file")
# Is this still reproducible? - JW
logging.info(err)
logging.info(row.Filename)
return
Expand All @@ -53,7 +48,7 @@ def store_embeddings(row):
logging.info(row.Filename)
return

embeddings = flat_embeddings(model(prepare_image(image_data)))
embeddings = flat_embeddings(model(image_data))

collection.add(
documents=[row.Filename],
Expand All @@ -62,4 +57,5 @@ def store_embeddings(row):
# Note - optional arg name is "metadatas" (we don't have any)
)

plankton.apply(store_embeddings, axis=1)
for _, row in df.iterrows():
store_embeddings(row)
20 changes: 20 additions & 0 deletions scripts/image_metadata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
"""Create a basic index for the images in an s3 collection
"""

import yaml
from cyto_ml.data.s3 import boto3_client, image_index


if __name__ == "__main__":

# Write a minimal CSV index of images in a bucket
# Was originally part of an intake catalogue setup
image_bucket = yaml.safe_load(open("params.yaml"))["collection"]

metadata = image_index(image_bucket)

s3 = boto3_client()

catalog_csv = metadata.to_csv(index=False)

s3.put_object(Bucket=image_bucket, Key="catalog.csv", Body=catalog_csv)
Loading

0 comments on commit 8f39f39

Please sign in to comment.