-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proof of concept of similarity search with the scivision model #5
Changes from all commits
3b04eff
74545ae
9d7e3c6
96cc233
0466cd6
bcac161
ebf9546
334f552
6d503e7
c8150b8
dec30c8
d0d0e96
ee56814
5c0c11b
2b02cc4
e5530bf
b2d2aa1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
[flake8] | ||
max-line-length=120 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
.env | ||
**/.ipynb_checkpoints/ | ||
**/__pycache__/ | ||
vectors/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
"""Utilities for expressing our dataset as an intake catalog""" | ||
|
||
|
||
def intake_yaml( | ||
test_url: str, | ||
catalog_url: str, | ||
): | ||
""" | ||
Write a minimal YAML template describing this as an intake datasource | ||
Example: plankton dataset made available through scivision, metadata | ||
https://raw.githubusercontent.com/alan-turing-institute/plankton-cefas-scivision/test_data_catalog/scivision.yml | ||
See the comments below for decisions about its structure | ||
""" | ||
template = f""" | ||
sources: | ||
test_image: | ||
description: Single test image from the plankton collection | ||
origin: | ||
driver: intake_xarray.image.ImageSource | ||
args: | ||
urlpath: ["{test_url}"] | ||
exif_tags: False | ||
plankton: | ||
description: A CSV index of all the images of plankton | ||
origin: | ||
driver: intake.source.csv.CSVSource | ||
args: | ||
urlpath: ["{catalog_url}"] | ||
""" | ||
# coerce_shape: [256, 256] | ||
return template |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
"""Thin wrapper around the s3 object store with images and metadata""" | ||
|
||
import s3fs | ||
from dotenv import load_dotenv | ||
import os | ||
|
||
load_dotenv() | ||
|
||
|
||
def s3_endpoint(): | ||
"""Return a reference to the object store, | ||
reading the credentials set in the environment. | ||
""" | ||
fs = s3fs.S3FileSystem( | ||
anon=False, | ||
key=os.environ.get("FSSPEC_S3_KEY", ""), | ||
secret=os.environ.get("FSSPEC_S3_SECRET", ""), | ||
client_kwargs={"endpoint_url": os.environ["ENDPOINT"]}, | ||
) | ||
return fs |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
from intake_xarray import ImageSource | ||
from torch import Tensor | ||
from cyto_ml.models.scivision import prepare_image, flat_embeddings | ||
|
||
|
||
def test_embeddings(scivision_model, single_image): | ||
features = scivision_model(prepare_image(ImageSource(single_image).to_dask())) | ||
|
||
assert isinstance(features, Tensor) | ||
|
||
embeddings = flat_embeddings(features) | ||
|
||
assert len(embeddings) == features.size()[1] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
from cyto_ml.data.vectorstore import vector_store, client | ||
import numpy as np | ||
|
||
|
||
def test_client_no_telemetry(): | ||
assert not client.get_settings()["anonymized_telemetry"] | ||
|
||
|
||
def test_store(): | ||
store = vector_store() # default 'test_collection' | ||
id = "id_1" # insists on a str | ||
filename = "https://example.com/filename.tif" | ||
store.add( | ||
documents=[filename], # we use image location in s3 rather than text content | ||
embeddings=[list(np.random.rand(2048))], # wants a list of lists | ||
ids=[id], | ||
) # wants a list of ids | ||
|
||
record = store.get("id_1", include=["embeddings"]) | ||
assert record | ||
assert len(record["embeddings"][0]) == 2048 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,10 +10,12 @@ dependencies: | |
- dask | ||
- pip: | ||
- pytest | ||
- imagecodecs | ||
- intake # for reading scivision | ||
- torch==1.10.0 # install before cefas_scivision; it needs this version | ||
- scivision | ||
- scikit-image | ||
- setuptools==69.5.1 # because this bug https://github.com/pytorch/serve/issues/3176 | ||
- tiffile | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but
??? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fair enough! |
||
- git+https://github.com/alan-turing-institute/plankton-cefas-scivision@main # torch version | ||
- chromadb | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The latest version of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks rooted a transitory issue in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think there's much that can be done when this sort of thing happens, unless you're not really interested in broad support and can just lock down all your dependencies, in which case it makes sense to use lockfiles (#14) ! Good to know they're fixing it though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_image_embeddings
fails whenever a cuda device is available becauseprepare_image
moves tensors to the default cuda device whereas the scivision model returned bytruncate_model
lives on the CPU. My feeling is thattruncate_model
is fine andprepare_image
doesn't need to mess with the device. Could we just remove lines 47-49?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I didn't test this anywhere with a GPU available.
In past projects we did this spelled out long-hand, e.g. with
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
- awkward if you've got more than one GPU, plus you keep having to rememberhttps://pytorch.org/tutorials/recipes/recipes/changing_default_device.html - reading that in PyTorch 2 there are cleaner ways of doing it, including a context manager. But this project is pinned to an older
torch
1.* because that's required by the olderscivision
plankton model, cf alan-turing-institute/plankton-cefas-scivision#5We don't yet have a timeline for testing the new transformer based model, should follow up with Turing Inst folks about that - I'll try to find a handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be very specific to the ways I've used PyTorch, but I've never really found that moving things between devices generated too much boilerplate, whereas abstracting it away sometimes led to more problems than it solved.
I definitely don't think
prepare_image
should try to change the device since there are too many situations where you wouldn't want that - e.g. if you had a GPU but it didn't have enough memory for the full dataset.If we wanted to abstract everything away we might consider using PyTorch Lightning, or otherwise replicating the LightningDataModule. Personally I'd prefer a more minimalist solution though.
Anyway I will have a think and make a new PR to discuss further!