Proof of concept of similarity search with the scivision model #5

metazool · 2024-07-01T16:38:45Z

This builds up the demo to show a reasonable proof of concept, for a human viewer, of enough success in extracting image embeddings from the scivision plankton model to keep building experiments on it in the short term.

What's in this

Adds some utility functions to make working with the image embeddings simpler
Changes the way the intake catalogue is written to use the whole untagged image collection, not the subset of labelled ones
Adds a bit more test coverage where it was spotty, moves the tests into the package for the flake8 action to collect them
Adapts the image_embeddings.py script to run through the whole collection through the model
Adds a notebook showing the outcome of similarity search of the embeddings, which looks plausible, and some notes on next steps / link to a related paper working through unsupervised clustering approaches

What isn't in this

Any kind of robust approach to spreading the workload around with Dask, partly the collection isn't big enough to justify it yet, partly i haven't worked with dask enough yet to understand the best approach, and would appreciate any advice on that topic, including from the DevOps folks (see the notes in comments in scripts/image_embeddings.py).

Any exploration of model explainability techniques using the prediction capabilities of the CEFAS model as opposed to using it as a source of embeddings. That's a good place to visit next, before trying any clustering algorithms on the embeddings- step back and observe whether what the model is seeing is properly coherent, whether it aligns with the CEFAS reference data, or whether there are factors like image dimensions giving a false positive impression of these initial results.

To test

export PYTHONPATH=. or pip install -e .
py.test

You may need the right credentials in .env to be able to run the scripts which generate the index and subsequently the embeddings. I've only run this locally, there are 8k images. I should set it up to be able to reproduce this using only the images from @Kzra that are in the test fixtures.

github-actions · 2024-07-01T16:48:49Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
98	83	85%	0%	🟢

New Files

File	Coverage	Status
cyto_ml/data/s3.py	0%	🟢
cyto_ml/tests/test_image_embeddings.py	100%	🟢
cyto_ml/tests/test_vector_store.py	100%	🟢
TOTAL	67%	🟢

Modified Files

File	Coverage	Status
cyto_ml/data/intake.py	0%	🟢
cyto_ml/data/vectorstore.py	81%	🟢
cyto_ml/models/scivision.py	95%	🟢
TOTAL	59%	🟢

updated for commit: b2d2aa1 by action🐍

metazool · 2024-07-18T13:32:22Z

I'm keen to merge this PR, on the understanding that it's still a rough prototype; #8 is not actionable otherwise. Same goes for the next one #7 though I still have doubts about its validity, it was a useful exercise and there are small refactors / improved test coverage along with the inconclusive notebook.

@jmarshrossney / @albags you've both kindly cast eyes on this experiment, prepared to approve it?

jmarshrossney · 2024-07-21T21:46:56Z

environment.yml

    - intake # for reading scivision
    - torch==1.10.0 # install before cefas_scivision; it needs this version
    - scivision
    - scikit-image
    - setuptools==69.5.1 # because this bug https://github.com/pytorch/serve/issues/3176
+    - tiffile


I think tiffile is deprecated and this should now be tifffile (3 fs)!

but

[~]$ pip install tiffile Collecting tiffile Using cached tiffile-2018.10.18-py2.py3-none-any.whl.metadata (883 bytes) Collecting tifffile (from tiffile) Downloading tifffile-2024.7.21-py3-none-any.whl.metadata (31 kB) Collecting numpy (from tifffile->tiffile) Downloading numpy-2.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.9/60.9 kB 9.7 MB/s eta 0:00:00 Using cached tiffile-2018.10.18-py2.py3-none-any.whl (2.7 kB) Downloading tifffile-2024.7.21-py3-none-any.whl (225 kB)

???

fair enough!

jmarshrossney · 2024-07-21T21:48:19Z

environment.yml

    - intake # for reading scivision
    - torch==1.10.0 # install before cefas_scivision; it needs this version
    - scivision
    - scikit-image
    - setuptools==69.5.1 # because this bug https://github.com/pytorch/serve/issues/3176
+    - tiffile
    - git+https://github.com/alan-turing-institute/plankton-cefas-scivision@main # torch version
    - chromadb


The latest version of chromadb (0.5.4) causes test_vector_store to fail with an AttributeError because for whatever reason chromadb.api.models.Collection no longer has an attribute model_fields. I'm none the wiser after reading the related issue and don't really want to dig deeper so for now we can just move forward by pinning chromadb==0.5.3.

This looks rooted a transitory issue in chromadb related to version back-incompatibility in pydantic (data validation through type checking) - reading chroma-core/chroma#2503 that team is firmly on the case and 0.5.5 should cause this problem to solve itself (always having version pinned dependencies would be a healthy habit i should adopt, though)

I don't think there's much that can be done when this sort of thing happens, unless you're not really interested in broad support and can just lock down all your dependencies, in which case it makes sense to use lockfiles (#14) ! Good to know they're fixing it though.

jmarshrossney · 2024-07-21T21:51:07Z

cyto_ml/models/scivision.py

test_image_embeddings fails whenever a cuda device is available because prepare_image moves tensors to the default cuda device whereas the scivision model returned by truncate_model lives on the CPU. My feeling is that truncate_model is fine and prepare_image doesn't need to mess with the device. Could we just remove lines 47-49?

Ugh, I didn't test this anywhere with a GPU available.

In past projects we did this spelled out long-hand, e.g. with device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') - awkward if you've got more than one GPU, plus you keep having to remember

https://pytorch.org/tutorials/recipes/recipes/changing_default_device.html - reading that in PyTorch 2 there are cleaner ways of doing it, including a context manager. But this project is pinned to an older torch 1.* because that's required by the older scivision plankton model, cf alan-turing-institute/plankton-cefas-scivision#5

We don't yet have a timeline for testing the new transformer based model, should follow up with Turing Inst folks about that - I'll try to find a handle.

This may be very specific to the ways I've used PyTorch, but I've never really found that moving things between devices generated too much boilerplate, whereas abstracting it away sometimes led to more problems than it solved.

I definitely don't think prepare_image should try to change the device since there are too many situations where you wouldn't want that - e.g. if you had a GPU but it didn't have enough memory for the full dataset.

If we wanted to abstract everything away we might consider using PyTorch Lightning, or otherwise replicating the LightningDataModule. Personally I'd prefer a more minimalist solution though.

Anyway I will have a think and make a new PR to discuss further!

jmarshrossney

Hi Jo.

Thanks for all of your work on this project. It's been fun and challenging (in a fun way) to figure out what's going on with the code and the jasmin object storage.

I notice that the script intake_metadata.py is broken on main now since metadata/metadata.csv no longer exists in the object store, which is a symptom of how long it's taken me to review this PR - sorry! So yeah I agree that we should merge this, with a couple of very minor tweaks (see comments) and with a pin in dependency specification which can be sorted in a future PR.

metazool · 2024-07-22T07:06:44Z

Thank you @jmarshrossney much appreciated - and good call on proper dependency pinning, will fix with any upcoming work.

#10 (support different models and try BioCLIP for embeddings, if that's runnable without a GPU back) is where I plan to look next, but clearing a documentation backlog first

Kzra · 2024-07-22T08:06:07Z

Just to add - I changed the object store layout and got rid of the metadata bucket! The object store layout is now:

untagged-images-lana
untagged-images-wala
tagged-images-lana
tagged-images-wala

Inside tagged-images-lana and tagged-images-wala there is a metadata.csv file and taxonomy.csv file.

lana refers to lancaster A, the flow cam images
wala refers to wallingford A, Isabelle's flow cytometer images

I've kept the tagged-images and untagged-images buckets for now as I know this repo is using them, but if you could change to using untagged-images-lana and tagged-images-lana, that would be great, once that's done i'll get rid of these outdated buckets.

The reason for doing this is to keep the images and metadata in separate workflows so that Isabelle can start to use the app to tag flow cytometer images.

Hi Jo.

Thanks for all of your work on this project. It's been fun and challenging (in a fun way) to figure out what's going on with the code and the jasmin object storage.

I notice that the script intake_metadata.py is broken on main now since metadata/metadata.csv no longer exists in the object store, which is a symptom of how long it's taken me to review this PR - sorry! So yeah I agree that we should merge this, with a couple of very minor tweaks (see comments) and with a pin in dependency specification which can be sorted in a future PR.

metazool · 2024-07-22T10:13:06Z

Hi @Kzra ! Thank you for unpacking the changes - hadn't clocked that there was active work on the annotation side of the project.

Just to add - I changed the object store layout and got rid of the metadata bucket!

I notice that the script intake_metadata.py is broken on main now since metadata/metadata.csv no longer exists

I found having a writeable bucket (with the permissions I was offered) useful to store two files that were generated to serve as a catalogue interface, using the intake package. I didn't document this other than in the script above though, and should write something down longhand.

Can we persuade you to adopt a git branch and merge workflow for changes to cyto-app? The EDS RSE group don't currently have a set of recommendations and hints for this, but I can draft some based on the guidance used in a previous team. If you make changes on a branch and merge them into your project via pull request, even if you haven't got someone lined up to peer review, you could tag one of us to offer a heads-up on potentially breaking changes. I'll add this to my list of bootstrap documentation to (re)write

See also #12 covering next steps for this part of the project

Kzra · 2024-07-22T11:08:56Z

Never done this before, but if you write some guidance I can follow the instructions! There will be quite a few changes to the app over the coming weeks as we begin testing in earnest. The object store layout might expand, but the tagged-images-lana etc. buckets won't change or be deleted.

metazool added 16 commits June 27, 2024 08:49

move functions from the scripts into small package

3b04eff

more tests, tweak lint config so it checks more

74545ae

lint action throws error with the default '.' flake8 path

9d7e3c6

try a comma separated flake8 path, then either give up or move the tests

96cc233

move tests inside package and put lint action back how it was

0466cd6

Merge branch 'lint_test' into more_data

bcac161

Explicit opt out of chromadb telemetry

ebf9546

replace the image metadata with file listing as per #4

334f552

correct markdown bracket order

6d503e7

utils and tests for searchable image features

c8150b8

adapt script to run through the whole image collection

dec30c8

catch an intermittent read error, TODO for it later

d0d0e96

belatedly log the problematic filenames

ee56814

stub notebook for feature search / cluster with notes on aims

5c0c11b

bypass relative file paths for vector storage, sigh

2b02cc4

clear outputs from the embedding notebook, add the proof of concept

e5530bf

metazool requested review from Kzra and a team July 1, 2024 16:38

update the test action to reflect they've moved location

b2d2aa1

This was referenced Jul 5, 2024

Quick attempt at model interpretation #7

Closed

Demo / lightning talk for plankton image data flow #8

Closed

Extract and store embeddings from BioCLIP #10

Open

Does this model have a model card? alan-turing-institute/plankton-cefas-scivision#6

Open

jmarshrossney reviewed Jul 21, 2024

View reviewed changes

jmarshrossney approved these changes Jul 21, 2024

View reviewed changes

metazool merged commit 0c34d98 into main Jul 22, 2024
4 checks passed

metazool mentioned this pull request Jul 22, 2024

Reflect layout changes to the object store #12

Closed

jmarshrossney mentioned this pull request Jul 22, 2024

Modify environment.yml #13

Merged

metazool deleted the more_data branch October 2, 2024 08:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of concept of similarity search with the scivision model #5

Proof of concept of similarity search with the scivision model #5

metazool commented Jul 1, 2024 •

edited

Loading

github-actions bot commented Jul 1, 2024

metazool commented Jul 18, 2024

jmarshrossney Jul 21, 2024 •

edited

Loading

metazool Jul 22, 2024

jmarshrossney Jul 22, 2024

jmarshrossney Jul 21, 2024

metazool Jul 22, 2024

jmarshrossney Jul 22, 2024

jmarshrossney Jul 21, 2024

metazool Jul 22, 2024

jmarshrossney Jul 22, 2024

jmarshrossney left a comment •

edited

Loading

metazool commented Jul 22, 2024

Kzra commented Jul 22, 2024

metazool commented Jul 22, 2024 •

edited

Loading

Kzra commented Jul 22, 2024

Proof of concept of similarity search with the scivision model #5

Proof of concept of similarity search with the scivision model #5

Conversation

metazool commented Jul 1, 2024 • edited Loading

What's in this

What isn't in this

To test

github-actions bot commented Jul 1, 2024

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

metazool commented Jul 18, 2024

jmarshrossney Jul 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmarshrossney left a comment • edited Loading

Choose a reason for hiding this comment

metazool commented Jul 22, 2024

Kzra commented Jul 22, 2024

metazool commented Jul 22, 2024 • edited Loading

Kzra commented Jul 22, 2024

metazool commented Jul 1, 2024 •

edited

Loading

jmarshrossney Jul 21, 2024 •

edited

Loading

jmarshrossney left a comment •

edited

Loading

metazool commented Jul 22, 2024 •

edited

Loading