You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We tried an intake catalog at one point, and did not get much benefit from it - #3
By switching from chromadb (which is itself sqlite-based) to sqlite-vec we gained a relational database with minimal metadata, used for search purposes #44
At one stage there was an intention to add indexing during the image processing pipeline. There isn't a lot of benefit to doing that, it adds moving parts, slows things down if it involves one or more deep learning models. We also can't guarantee that it will always run in the same place, with the same code, and want the index to reflect what's in storage, not the processes that put it there.
There's a nice-to-have reality in which each image goes into a message queue or broker as it gets uploaded to s3 storage, and different indexers can pick it up and annotate it, but we're nowhere near that at the moment.
We do (should?) have spatio-temporal EXIF headers corresponding to sample location, which is helpful!
Fix the sqlite-vec database based on what we learned from the fdri_phenocam prototype (the virtual table is really only for an index of the embeddings, and needs a regular table for other metadata)
Extend or repair the current DVC pipeline to index everything in each bucket, and do so
Suggest some recommendations for handling larger volume - at what point will the number of image objects in storage (hundreds of thousands? millions?) start to become painful to work with? Review of s3-specific advice. This wants to be its own issue
The text was updated successfully, but these errors were encountered:
We tried an
intake
catalog at one point, and did not get much benefit from it - #3By switching from chromadb (which is itself sqlite-based) to sqlite-vec we gained a relational database with minimal metadata, used for search purposes #44
At one stage there was an intention to add indexing during the image processing pipeline. There isn't a lot of benefit to doing that, it adds moving parts, slows things down if it involves one or more deep learning models. We also can't guarantee that it will always run in the same place, with the same code, and want the index to reflect what's in storage, not the processes that put it there.
There's a nice-to-have reality in which each image goes into a message queue or broker as it gets uploaded to s3 storage, and different indexers can pick it up and annotate it, but we're nowhere near that at the moment.
We do (should?) have spatio-temporal EXIF headers corresponding to sample location, which is helpful!
The text was updated successfully, but these errors were encountered: