Packaged way adding a detritus classifier to image processing #32

metazool · 2024-09-11T13:22:16Z

Streamlit demo shows k-means / etc clustering based on embeddings that has a reasonably clear "this is detritus" cluster
Sets an outline for the extent to which these images can be discarded before ever going to s3 storage

Workflow for generating a classifier: s3 image collection -> Extract and store embeddings -> Fit a clustering model -> save the resulting artifact for reuse in annotation workflow

Could be Luigi or this is an opportunity to try and get started with pyorderly, or is it an opportunity to test this walkthrough of DVC and work with CML

Outline:

Set up DVC for the training data as an external source
Use that instead of intake to drive the script that does embedding extraction
Try DVC pipeline stages to run the embedding script as an alternative to learning Luigi
Then step back, make sure the streamlit demo runs and stop worrying about deploying it in a structured way, just look at the clusters again
Pick a cluster label out of the air on the basis of what the demo shows and add a pipeline stage to pickle a K-means model which can then be used to add detritus labels in Simple automation for the image decollage process #31
Decide whether it's worth adding more metadata to chromadb (labels, image sizes)

The text was updated successfully, but these errors were encountered:

metazool · 2024-09-27T08:27:35Z

Taking intake out involves changing a few places where intake_xarray.ImageSource was being used to load images for the scivision model but it looks worth doing, results will be much more readable

metazool · 2024-10-08T18:51:55Z

This is partly completed in #36 - simplest possible DVC pipeline that fits a Kmeans model for an image collection and saves it for reuse - with a web interface for exploring the contents of the different clusters to judge by eye which is primarily detritus

You can see there's still an open question about where the metadata goes. I thought about adding a tag right into the EXIF headers, or into the metadata that describes a lot of detail about each image's properties that the microscope exports. It depends what is most useful to the ongoing application! And also how this will be used - is the tagging an extra stage in a Luigi pipeline that's processing and uploading images to an object store, or is it a distinct pipeline that's indexing and analysing images once they've been uploaded?

So I've left it open for now - it needs another use case probably, like the phenocam images, show the wider picture

cc @albags @Kzra

metazool · 2024-11-20T09:45:18Z

is the tagging an extra stage in a Luigi pipeline that's processing and uploading images to an object store, or is it a distinct pipeline that's indexing and analysing images once they've been uploaded?

I've been thinking that it could be very helpful to have this as an API. Rather than add models and python wrappers for them directly into pipelines, POST an image contents and get back a classification, a set of embeddings or both.

What originally prompted this was #45 - these new models are promising but they're not yet published in a way that eases programmatic reuse, to deploy them you'd need to access a Google Drive first.

Triton Server or similar would be a good way to go, but a minimal FastAPI app would be a useful proof of concept for this

metazool · 2024-12-11T16:32:32Z

#53 exists now for serving the recent Turing models (at least the lightweight ResNet18) one, even if there are barriers (Google Drive) to reproducibility. There's an endpoint for returning embeddings + classification, could add a cluster label to it directly

As it stands that's not added as a pipeline stage but done after the fact by pointing at an image bucket and searching for unseen entries, that's probably fine at this point. The moral of this story is i'm going to take this out of the TODO list and prioritise a) working through the simplest possible deployment until we have more infrastructure decisions and b) if possible putting an entirely dissimilar image collection through the process

metazool · 2025-01-15T10:34:00Z

After the promising work setting up Label Studio for the plankton taxonomy, @Kzra suggested packaging this as a Label Studio ML backend. That's a nice path here.

It can be containerised, though we don't have a container registry to store that in. #52 covers most of this already.

metazool added the demonstrate feature that we need to be able to show label Sep 11, 2024

This was referenced Sep 12, 2024

Deploy and improve the streamlit demo app #33

Open

Shift from s3fs to boto3 for consistency #34

Closed

metazool added this to Plankton data pipelines Oct 8, 2024

metazool moved this to Todo in Plankton data pipelines Oct 8, 2024

metazool mentioned this issue Jan 21, 2025

Label Studio ML backend #64

Open

5 tasks

metazool mentioned this issue Jan 28, 2025

Image indexing #65

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Packaged way adding a detritus classifier to image processing #32

Packaged way adding a detritus classifier to image processing #32

metazool commented Sep 11, 2024 •

edited

Loading

metazool commented Sep 27, 2024

metazool commented Oct 8, 2024

metazool commented Nov 20, 2024

metazool commented Dec 11, 2024 •

edited

Loading

metazool commented Jan 15, 2025

Packaged way adding a detritus classifier to image processing #32

Packaged way adding a detritus classifier to image processing #32

Comments

metazool commented Sep 11, 2024 • edited Loading

metazool commented Sep 27, 2024

metazool commented Oct 8, 2024

metazool commented Nov 20, 2024

metazool commented Dec 11, 2024 • edited Loading

metazool commented Jan 15, 2025

metazool commented Sep 11, 2024 •

edited

Loading

metazool commented Dec 11, 2024 •

edited

Loading