Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image indexing #65

Open
1 of 4 tasks
metazool opened this issue Jan 28, 2025 · 0 comments
Open
1 of 4 tasks

Image indexing #65

metazool opened this issue Jan 28, 2025 · 0 comments

Comments

@metazool
Copy link
Collaborator

metazool commented Jan 28, 2025

We tried an intake catalog at one point, and did not get much benefit from it - #3

By switching from chromadb (which is itself sqlite-based) to sqlite-vec we gained a relational database with minimal metadata, used for search purposes #44

At one stage there was an intention to add indexing during the image processing pipeline. There isn't a lot of benefit to doing that, it adds moving parts, slows things down if it involves one or more deep learning models. We also can't guarantee that it will always run in the same place, with the same code, and want the index to reflect what's in storage, not the processes that put it there.

There's a nice-to-have reality in which each image goes into a message queue or broker as it gets uploaded to s3 storage, and different indexers can pick it up and annotate it, but we're nowhere near that at the moment.

We do (should?) have spatio-temporal EXIF headers corresponding to sample location, which is helpful!

  • Fix the sqlite-vec database based on what we learned from the fdri_phenocam prototype (the virtual table is really only for an index of the embeddings, and needs a regular table for other metadata)
  • Extend or repair the current DVC pipeline to index everything in each bucket, and do so
  • Packaged way adding a detritus classifier to image processing #32 - create a clustering model based on image embeddings for each collection, and package that up as a Label Studio ML backend
  • Suggest some recommendations for handling larger volume - at what point will the number of image objects in storage (hundreds of thousands? millions?) start to become painful to work with? Review of s3-specific advice. This wants to be its own issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

1 participant