Implementation of BM25 for genomic intervals by nleroy917 · Pull Request #239 · databio/gtars

nleroy917 · 2026-03-01T01:21:50Z

This is an implementation of the BM25 algorithm for genomic interval data as outlined in this discussion: https://github.com/databio/lab.databio.org/discussions/69. I believe that this can be used in BEDbase in conjunction with our current dense embedding search task

The BM25 algorithm is leveraged for generating sparse embeddings. Modern best practices in information retrieval recommend using hybrid search which utilizes both dense vectors and sparse vectors. BM25 lets us combine the power of Atacformer/Region2Vec/ScEmbed embeddings with sparse "key-region" sparse embeddings.

Example usage

Here is an example usage of the BM25 embedding:

from gtars.bm25 import Bm25
from gtars.models import RegionSet

model = Bm25(
    tokenizer="/path/to/vocab.bed",
    k=1.5,
    b=0.75,
    avg_doc_length=1_000
)

query = RegionSet("path/to/query.bed")
embedding = model.embed(query)

print(embedding.indices) # [1, 5, 10]
print(embedding.values) # [0.5, 1.0, 0.75]

Use with Atacformer and Qdrant

BM25 can be used with dense embedding models like Atacformer to perform hybrid search in Qdrant.

First, we need to create a Qdrant collection with both dense and sparse vector configurations:

from geniml.atacformer import AtacformerForCellClustering
from gtars.bm25 import Bm25
from gtars.models import RegionSet
from gtars.tokenizers import Tokenizer

from qdrant_client import models as qdrant_models
from qdrant_client import QdrantClient

# instantiate the qdrant collection
client = QdrantClient("http://localhost:6333")
client.recreate_collection(
    collection_name="bedbase",
    # atacformer embeddings
    vectors_config={
        "dense": qdrant_models.VectorParams(
            size=384,
            distance=qdrant_models.Distance.COSINE
        ),
    },
    # bm25 sparse embeddings
    sparse_vectors_config={
        "sparse": qdrant_models.SparseVectorsConfig(
            modifier=qdrant_models.Modifier.IDF
        )
    }
)

Then we can instantiate our Atacformer and BM25 models, and insert some data into the collection:

# instantiate the models
tokenizer = Tokenizer.from_pretrained("databio/atacformer-ctft-hg38")
atacformer = AtacformerForCellClustering.from_pretrained("databio/atacformer-ctft-hg38")
bm25 = Bm25(
    tokenizer=tokenizer,
    k=1.5,
    b=0.75,
    avg_doc_length=1_000 # bed files are usually very large
)

documents = [
    RegionSet("path/to/document1.bed"),
    RegionSet("path/to/document2.bed"),
    RegionSet("path/to/document3.bed"),
    RegionSet("path/to/document4.bed"),
    RegionSet("path/to/document5.bed"),
]

for i, document in enumerate(documents):
    dense_embedding = atacformer.embed(document)
    sparse_embedding = bm25.embed(document)

    client.upsert(
        collection_name="bedbase",
        points=[
            qdrant_models.PointStruct(
                id=i,
                vector=dense_embedding,
                sparse_vector=sparse_embedding
            )
        ]
    )

Finally, we can perform a hybrid search using both the dense and sparse embeddings:

query = RegionSet("path/to/query.bed")
dense_query_embedding = atacformer.embed(query)
sparse_query_embedding = bm25.embed(query)

response = client.query_points(
    collection_name="bedbase",
    prefetch=[
        qdrant_models.Prefetch(
            query=sparse_query_embedding,
            using="sparse",
            limit=3,
        ),
        qdrant_models.Prefetch(
            query=dense_query_embedding,
            using="dense",
            limit=3,
        )
    ],
    query=qdrant_models.FusionQuery(fusion=qdrant_models.Fusion.RRF),
    limit=3,
)

sanghoonio · 2026-03-04T03:44:06Z

Have you been able to test/benchmark this? Curious about how actually biologically meaningful generalized universe token matches can be to queries

nleroy917 · 2026-03-04T03:50:42Z

Waiting for Alex to run some tests on the BEDbase data... it should perform well since it just utilizes what overlaps and ignores what doesn't

nleroy917 · 2026-03-07T00:44:01Z

@sanghoonio trying to do some tests now... it seems to work. I made the mistake of mixing hg19 and hg38 data, but the search results are sensible. The biggest problem with the implementation is the same as always... the numnber of regions in the BEDbase BED files far surpasses millions of regions and so this means the sparse embeddings stored will have >1M indices and values... so it becomes a problem for data storage...

But it seems to work nonetheless

nleroy917 · 2026-03-07T16:09:19Z

One possible solution is to just keep the top K best indices. Basically where are the "mass" is perse

sanghoonio · 2026-03-08T04:52:59Z

Ah I thought you included the splade implementation in this also. As for some sparsity management, we could do some sort of smarter activation function for the BM25 scores?

sanghoonio · 2026-03-08T04:53:21Z

Re: the sparse vector storage problem with large BED files — a few ideas for activation functions that could introduce sparsity into BM25 outputs, as an alternative (or complement) to hard top-K truncation:

Soft thresholding: score = max(0, bm25_score - τ). Scores below τ become exactly zero. τ could be set as a percentile of each document's score distribution so it adapts per file rather than being a fixed cutoff.

Log-saturation (borrowing from SPLADE): log(1 + ReLU(bm25_score - τ)) — zeros out low scores and compresses high ones, giving sparsity + additional saturation on top of what BM25 already does.

Exponential decay by rank: Sort scores descending, multiply each by e^(-rank/λ), threshold at some epsilon. λ controls how fast the tail drops off.

Elbow detection: Sort scores descending, find the natural "knee" in the curve (e.g. via second derivative), and cut there. More adaptive — a file with 500 strong signals keeps 500, a file with 50 keeps 50.

Soft thresholding is probably the most practical starting point — one line of code, interpretable parameter, and naturally adapts to document size. Also composes cleanly if SPLADE-like sparse vectors get implemented later.

nleroy917 added 7 commits February 28, 2026 17:36

start bm25 implementation for genomic interval data

e99f4af

initial pass

2732051

filter out unk token ids

5658dc9

enable sending tokenizer directly

ea0267c

type stubs

346472b

switch to camel case

25fd660

update .gitignore

6cbb5db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of BM25 for genomic intervals#239

Implementation of BM25 for genomic intervals#239
nleroy917 wants to merge 7 commits intomasterfrom
bm25

nleroy917 commented Mar 1, 2026 •

edited

Loading

Uh oh!

sanghoonio commented Mar 4, 2026 •

edited

Loading

Uh oh!

nleroy917 commented Mar 4, 2026

Uh oh!

nleroy917 commented Mar 7, 2026

Uh oh!

nleroy917 commented Mar 7, 2026

Uh oh!

sanghoonio commented Mar 8, 2026

Uh oh!

sanghoonio commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nleroy917 commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example usage

Use with Atacformer and Qdrant

Uh oh!

sanghoonio commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nleroy917 commented Mar 4, 2026

Uh oh!

nleroy917 commented Mar 7, 2026

Uh oh!

nleroy917 commented Mar 7, 2026

Uh oh!

sanghoonio commented Mar 8, 2026

Uh oh!

sanghoonio commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nleroy917 commented Mar 1, 2026 •

edited

Loading

sanghoonio commented Mar 4, 2026 •

edited

Loading