Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept of similarity search with the scivision model #5

Merged
merged 17 commits into from
Jul 22, 2024

Conversation

metazool
Copy link
Collaborator

@metazool metazool commented Jul 1, 2024

This builds up the demo to show a reasonable proof of concept, for a human viewer, of enough success in extracting image embeddings from the scivision plankton model to keep building experiments on it in the short term.

What's in this

  • Adds some utility functions to make working with the image embeddings simpler
  • Changes the way the intake catalogue is written to use the whole untagged image collection, not the subset of labelled ones
  • Adds a bit more test coverage where it was spotty, moves the tests into the package for the flake8 action to collect them
  • Adapts the image_embeddings.py script to run through the whole collection through the model
  • Adds a notebook showing the outcome of similarity search of the embeddings, which looks plausible, and some notes on next steps / link to a related paper working through unsupervised clustering approaches

What isn't in this

Any kind of robust approach to spreading the workload around with Dask, partly the collection isn't big enough to justify it yet, partly i haven't worked with dask enough yet to understand the best approach, and would appreciate any advice on that topic, including from the DevOps folks (see the notes in comments in scripts/image_embeddings.py).

Any exploration of model explainability techniques using the prediction capabilities of the CEFAS model as opposed to using it as a source of embeddings. That's a good place to visit next, before trying any clustering algorithms on the embeddings- step back and observe whether what the model is seeing is properly coherent, whether it aligns with the CEFAS reference data, or whether there are factors like image dimensions giving a false positive impression of these initial results.

A plot of the closest 24 samples to one picked at random

To test

export PYTHONPATH=. or pip install -e .
py.test

You may need the right credentials in .env to be able to run the scripts which generate the index and subsequently the embeddings. I've only run this locally, there are 8k images. I should set it up to be able to reproduce this using only the images from @Kzra that are in the test fixtures.

@metazool metazool requested review from Kzra and a team July 1, 2024 16:38
Copy link

github-actions bot commented Jul 1, 2024

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
98 83 85% 0% 🟢

New Files

File Coverage Status
cyto_ml/data/s3.py 0% 🟢
cyto_ml/tests/test_image_embeddings.py 100% 🟢
cyto_ml/tests/test_vector_store.py 100% 🟢
TOTAL 67% 🟢

Modified Files

File Coverage Status
cyto_ml/data/intake.py 0% 🟢
cyto_ml/data/vectorstore.py 81% 🟢
cyto_ml/models/scivision.py 95% 🟢
TOTAL 59% 🟢

updated for commit: b2d2aa1 by action🐍

@metazool
Copy link
Collaborator Author

I'm keen to merge this PR, on the understanding that it's still a rough prototype; #8 is not actionable otherwise. Same goes for the next one #7 though I still have doubts about its validity, it was a useful exercise and there are small refactors / improved test coverage along with the inconclusive notebook.

@jmarshrossney / @albags you've both kindly cast eyes on this experiment, prepared to approve it?

- intake # for reading scivision
- torch==1.10.0 # install before cefas_scivision; it needs this version
- scivision
- scikit-image
- setuptools==69.5.1 # because this bug https://github.com/pytorch/serve/issues/3176
- tiffile
Copy link
Collaborator

@jmarshrossney jmarshrossney Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tiffile is deprecated and this should now be tifffile (3 fs)!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but

[~]$ pip install tiffile
Collecting tiffile
  Using cached tiffile-2018.10.18-py2.py3-none-any.whl.metadata (883 bytes)
Collecting tifffile (from tiffile)
  Downloading tifffile-2024.7.21-py3-none-any.whl.metadata (31 kB)
Collecting numpy (from tifffile->tiffile)
  Downloading numpy-2.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.9/60.9 kB 9.7 MB/s eta 0:00:00
Using cached tiffile-2018.10.18-py2.py3-none-any.whl (2.7 kB)
Downloading tifffile-2024.7.21-py3-none-any.whl (225 kB)

???

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough!

- intake # for reading scivision
- torch==1.10.0 # install before cefas_scivision; it needs this version
- scivision
- scikit-image
- setuptools==69.5.1 # because this bug https://github.com/pytorch/serve/issues/3176
- tiffile
- git+https://github.com/alan-turing-institute/plankton-cefas-scivision@main # torch version
- chromadb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version of chromadb (0.5.4) causes test_vector_store to fail with an AttributeError because for whatever reason chromadb.api.models.Collection no longer has an attribute model_fields. I'm none the wiser after reading the related issue and don't really want to dig deeper so for now we can just move forward by pinning chromadb==0.5.3.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks rooted a transitory issue in chromadb related to version back-incompatibility in pydantic (data validation through type checking) - reading chroma-core/chroma#2503 that team is firmly on the case and 0.5.5 should cause this problem to solve itself (always having version pinned dependencies would be a healthy habit i should adopt, though)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's much that can be done when this sort of thing happens, unless you're not really interested in broad support and can just lock down all your dependencies, in which case it makes sense to use lockfiles (#14) ! Good to know they're fixing it though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_image_embeddings fails whenever a cuda device is available because prepare_image moves tensors to the default cuda device whereas the scivision model returned by truncate_model lives on the CPU. My feeling is that truncate_model is fine and prepare_image doesn't need to mess with the device. Could we just remove lines 47-49?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I didn't test this anywhere with a GPU available.

In past projects we did this spelled out long-hand, e.g. with device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') - awkward if you've got more than one GPU, plus you keep having to remember

https://pytorch.org/tutorials/recipes/recipes/changing_default_device.html - reading that in PyTorch 2 there are cleaner ways of doing it, including a context manager. But this project is pinned to an older torch 1.* because that's required by the older scivision plankton model, cf alan-turing-institute/plankton-cefas-scivision#5

We don't yet have a timeline for testing the new transformer based model, should follow up with Turing Inst folks about that - I'll try to find a handle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be very specific to the ways I've used PyTorch, but I've never really found that moving things between devices generated too much boilerplate, whereas abstracting it away sometimes led to more problems than it solved.

I definitely don't think prepare_image should try to change the device since there are too many situations where you wouldn't want that - e.g. if you had a GPU but it didn't have enough memory for the full dataset.

If we wanted to abstract everything away we might consider using PyTorch Lightning, or otherwise replicating the LightningDataModule. Personally I'd prefer a more minimalist solution though.

Anyway I will have a think and make a new PR to discuss further!

Copy link
Collaborator

@jmarshrossney jmarshrossney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jo.

Thanks for all of your work on this project. It's been fun and challenging (in a fun way) to figure out what's going on with the code and the jasmin object storage.

I notice that the script intake_metadata.py is broken on main now since metadata/metadata.csv no longer exists in the object store, which is a symptom of how long it's taken me to review this PR - sorry! So yeah I agree that we should merge this, with a couple of very minor tweaks (see comments) and with a pin in dependency specification which can be sorted in a future PR.

@metazool
Copy link
Collaborator Author

Thank you @jmarshrossney much appreciated - and good call on proper dependency pinning, will fix with any upcoming work.

#10 (support different models and try BioCLIP for embeddings, if that's runnable without a GPU back) is where I plan to look next, but clearing a documentation backlog first

@metazool metazool merged commit 0c34d98 into main Jul 22, 2024
4 checks passed
@Kzra
Copy link
Collaborator

Kzra commented Jul 22, 2024

Just to add - I changed the object store layout and got rid of the metadata bucket! The object store layout is now:

untagged-images-lana
untagged-images-wala
tagged-images-lana
tagged-images-wala

Inside tagged-images-lana and tagged-images-wala there is a metadata.csv file and taxonomy.csv file.

lana refers to lancaster A, the flow cam images
wala refers to wallingford A, Isabelle's flow cytometer images

I've kept the tagged-images and untagged-images buckets for now as I know this repo is using them, but if you could change to using untagged-images-lana and tagged-images-lana, that would be great, once that's done i'll get rid of these outdated buckets.

The reason for doing this is to keep the images and metadata in separate workflows so that Isabelle can start to use the app to tag flow cytometer images.

Hi Jo.

Thanks for all of your work on this project. It's been fun and challenging (in a fun way) to figure out what's going on with the code and the jasmin object storage.

I notice that the script intake_metadata.py is broken on main now since metadata/metadata.csv no longer exists in the object store, which is a symptom of how long it's taken me to review this PR - sorry! So yeah I agree that we should merge this, with a couple of very minor tweaks (see comments) and with a pin in dependency specification which can be sorted in a future PR.

@metazool
Copy link
Collaborator Author

metazool commented Jul 22, 2024

Hi @Kzra ! Thank you for unpacking the changes - hadn't clocked that there was active work on the annotation side of the project.

Just to add - I changed the object store layout and got rid of the metadata bucket!

I notice that the script intake_metadata.py is broken on main now since metadata/metadata.csv no longer exists

I found having a writeable bucket (with the permissions I was offered) useful to store two files that were generated to serve as a catalogue interface, using the intake package. I didn't document this other than in the script above though, and should write something down longhand.

Can we persuade you to adopt a git branch and merge workflow for changes to cyto-app? The EDS RSE group don't currently have a set of recommendations and hints for this, but I can draft some based on the guidance used in a previous team. If you make changes on a branch and merge them into your project via pull request, even if you haven't got someone lined up to peer review, you could tag one of us to offer a heads-up on potentially breaking changes. I'll add this to my list of bootstrap documentation to (re)write

See also #12 covering next steps for this part of the project

@Kzra
Copy link
Collaborator

Kzra commented Jul 22, 2024

Never done this before, but if you write some guidance I can follow the instructions! There will be quite a few changes to the app over the coming weeks as we begin testing in earnest. The object store layout might expand, but the tagged-images-lana etc. buckets won't change or be deleted.

@metazool metazool deleted the more_data branch October 2, 2024 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants