-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proof of concept of similarity search with the scivision model #5
Conversation
☂️ Python Coverage
Overall Coverage
New Files
Modified Files
|
I'm keen to merge this PR, on the understanding that it's still a rough prototype; #8 is not actionable otherwise. Same goes for the next one #7 though I still have doubts about its validity, it was a useful exercise and there are small refactors / improved test coverage along with the inconclusive notebook. @jmarshrossney / @albags you've both kindly cast eyes on this experiment, prepared to approve it? |
- intake # for reading scivision | ||
- torch==1.10.0 # install before cefas_scivision; it needs this version | ||
- scivision | ||
- scikit-image | ||
- setuptools==69.5.1 # because this bug https://github.com/pytorch/serve/issues/3176 | ||
- tiffile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tiffile
is deprecated and this should now be tifffile
(3 fs)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but
[~]$ pip install tiffile
Collecting tiffile
Using cached tiffile-2018.10.18-py2.py3-none-any.whl.metadata (883 bytes)
Collecting tifffile (from tiffile)
Downloading tifffile-2024.7.21-py3-none-any.whl.metadata (31 kB)
Collecting numpy (from tifffile->tiffile)
Downloading numpy-2.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.9/60.9 kB 9.7 MB/s eta 0:00:00
Using cached tiffile-2018.10.18-py2.py3-none-any.whl (2.7 kB)
Downloading tifffile-2024.7.21-py3-none-any.whl (225 kB)
???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough!
- intake # for reading scivision | ||
- torch==1.10.0 # install before cefas_scivision; it needs this version | ||
- scivision | ||
- scikit-image | ||
- setuptools==69.5.1 # because this bug https://github.com/pytorch/serve/issues/3176 | ||
- tiffile | ||
- git+https://github.com/alan-turing-institute/plankton-cefas-scivision@main # torch version | ||
- chromadb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest version of chromadb
(0.5.4) causes test_vector_store
to fail with an AttributeError
because for whatever reason chromadb.api.models.Collection
no longer has an attribute model_fields
. I'm none the wiser after reading the related issue and don't really want to dig deeper so for now we can just move forward by pinning chromadb==0.5.3
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks rooted a transitory issue in chromadb
related to version back-incompatibility in pydantic
(data validation through type checking) - reading chroma-core/chroma#2503 that team is firmly on the case and 0.5.5 should cause this problem to solve itself (always having version pinned dependencies would be a healthy habit i should adopt, though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's much that can be done when this sort of thing happens, unless you're not really interested in broad support and can just lock down all your dependencies, in which case it makes sense to use lockfiles (#14) ! Good to know they're fixing it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_image_embeddings
fails whenever a cuda device is available because prepare_image
moves tensors to the default cuda device whereas the scivision model returned by truncate_model
lives on the CPU. My feeling is that truncate_model
is fine and prepare_image
doesn't need to mess with the device. Could we just remove lines 47-49?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I didn't test this anywhere with a GPU available.
In past projects we did this spelled out long-hand, e.g. with device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
- awkward if you've got more than one GPU, plus you keep having to remember
https://pytorch.org/tutorials/recipes/recipes/changing_default_device.html - reading that in PyTorch 2 there are cleaner ways of doing it, including a context manager. But this project is pinned to an older torch
1.* because that's required by the older scivision
plankton model, cf alan-turing-institute/plankton-cefas-scivision#5
We don't yet have a timeline for testing the new transformer based model, should follow up with Turing Inst folks about that - I'll try to find a handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be very specific to the ways I've used PyTorch, but I've never really found that moving things between devices generated too much boilerplate, whereas abstracting it away sometimes led to more problems than it solved.
I definitely don't think prepare_image
should try to change the device since there are too many situations where you wouldn't want that - e.g. if you had a GPU but it didn't have enough memory for the full dataset.
If we wanted to abstract everything away we might consider using PyTorch Lightning, or otherwise replicating the LightningDataModule. Personally I'd prefer a more minimalist solution though.
Anyway I will have a think and make a new PR to discuss further!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jo.
Thanks for all of your work on this project. It's been fun and challenging (in a fun way) to figure out what's going on with the code and the jasmin object storage.
I notice that the script intake_metadata.py
is broken on main
now since metadata/metadata.csv
no longer exists in the object store, which is a symptom of how long it's taken me to review this PR - sorry! So yeah I agree that we should merge this, with a couple of very minor tweaks (see comments) and with a pin in dependency specification which can be sorted in a future PR.
Thank you @jmarshrossney much appreciated - and good call on proper dependency pinning, will fix with any upcoming work. #10 (support different models and try BioCLIP for embeddings, if that's runnable without a GPU back) is where I plan to look next, but clearing a documentation backlog first |
Just to add - I changed the object store layout and got rid of the metadata bucket! The object store layout is now: untagged-images-lana Inside tagged-images-lana and tagged-images-wala there is a metadata.csv file and taxonomy.csv file. lana refers to lancaster A, the flow cam images I've kept the tagged-images and untagged-images buckets for now as I know this repo is using them, but if you could change to using untagged-images-lana and tagged-images-lana, that would be great, once that's done i'll get rid of these outdated buckets. The reason for doing this is to keep the images and metadata in separate workflows so that Isabelle can start to use the app to tag flow cytometer images.
|
Hi @Kzra ! Thank you for unpacking the changes - hadn't clocked that there was active work on the annotation side of the project.
I found having a writeable bucket (with the permissions I was offered) useful to store two files that were generated to serve as a catalogue interface, using the Can we persuade you to adopt a git branch and merge workflow for changes to See also #12 covering next steps for this part of the project |
Never done this before, but if you write some guidance I can follow the instructions! There will be quite a few changes to the app over the coming weeks as we begin testing in earnest. The object store layout might expand, but the tagged-images-lana etc. buckets won't change or be deleted. |
This builds up the demo to show a reasonable proof of concept, for a human viewer, of enough success in extracting image embeddings from the
scivision
plankton model to keep building experiments on it in the short term.What's in this
intake
catalogue is written to use the whole untagged image collection, not the subset of labelled onesflake8
action to collect themimage_embeddings.py
script to run through the whole collection through the modelWhat isn't in this
Any kind of robust approach to spreading the workload around with Dask, partly the collection isn't big enough to justify it yet, partly i haven't worked with dask enough yet to understand the best approach, and would appreciate any advice on that topic, including from the DevOps folks (see the notes in comments in
scripts/image_embeddings.py
).Any exploration of model explainability techniques using the prediction capabilities of the CEFAS model as opposed to using it as a source of embeddings. That's a good place to visit next, before trying any clustering algorithms on the embeddings- step back and observe whether what the model is seeing is properly coherent, whether it aligns with the CEFAS reference data, or whether there are factors like image dimensions giving a false positive impression of these initial results.
To test
export PYTHONPATH=.
orpip install -e .
py.test
You may need the right credentials in
.env
to be able to run the scripts which generate the index and subsequently the embeddings. I've only run this locally, there are 8k images. I should set it up to be able to reproduce this using only the images from @Kzra that are in the test fixtures.