Conversation
|
Have you been able to test/benchmark this? Curious about how actually biologically meaningful generalized universe token matches can be to queries |
|
Waiting for Alex to run some tests on the BEDbase data... it should perform well since it just utilizes what overlaps and ignores what doesn't |
|
@sanghoonio trying to do some tests now... it seems to work. I made the mistake of mixing hg19 and hg38 data, but the search results are sensible. The biggest problem with the implementation is the same as always... the numnber of regions in the BEDbase BED files far surpasses millions of regions and so this means the sparse embeddings stored will have >1M indices and values... so it becomes a problem for data storage... But it seems to work nonetheless |
|
One possible solution is to just keep the top K best indices. Basically where are the "mass" is perse |
|
Ah I thought you included the splade implementation in this also. As for some sparsity management, we could do some sort of smarter activation function for the BM25 scores? |
|
Re: the sparse vector storage problem with large BED files — a few ideas for activation functions that could introduce sparsity into BM25 outputs, as an alternative (or complement) to hard top-K truncation: Soft thresholding: Log-saturation (borrowing from SPLADE): Exponential decay by rank: Sort scores descending, multiply each by Elbow detection: Sort scores descending, find the natural "knee" in the curve (e.g. via second derivative), and cut there. More adaptive — a file with 500 strong signals keeps 500, a file with 50 keeps 50. Soft thresholding is probably the most practical starting point — one line of code, interpretable parameter, and naturally adapts to document size. Also composes cleanly if SPLADE-like sparse vectors get implemented later. |
This is an implementation of the BM25 algorithm for genomic interval data as outlined in this discussion: https://github.com/databio/lab.databio.org/discussions/69. I believe that this can be used in BEDbase in conjunction with our current dense embedding search task
The BM25 algorithm is leveraged for generating sparse embeddings. Modern best practices in information retrieval recommend using hybrid search which utilizes both dense vectors and sparse vectors. BM25 lets us combine the power of Atacformer/Region2Vec/ScEmbed embeddings with sparse "key-region" sparse embeddings.
Example usage
Here is an example usage of the BM25 embedding:
Use with Atacformer and Qdrant
BM25 can be used with dense embedding models like Atacformer to perform hybrid search in Qdrant.
First, we need to create a Qdrant collection with both dense and sparse vector configurations:
Then we can instantiate our Atacformer and BM25 models, and insert some data into the collection:
Finally, we can perform a hybrid search using both the dense and sparse embeddings: