This code implements relatively fast cosine similarity computation and kNN classification for large matrices, so I never have to worry about it again. This outperforms GenSim's function for vector comparison. It makes clever use of einsums to speed up computation.
It's also possible to batch the computations to save space.
Vectors are classified using a kNN majority-vote approach.
The main algorithm is implemented in src/knn.py
.
You can test locally in a Docker container with an ES index, if you have credentials for an ElasticSearch cluster:
- Port-forward the ElasticSearch cluster to port
9200
. - Set the ElasticSearch environment variables in
src/local.env
:ES_USERNAME
andES_PASSWORD
if the cluster requires authentication, andES_INDEX
with the name of the index that you want to process. It should contain documents with a field namedfull_text
. - Run
make run
. This will train the model, and add asimilar_docs
field to the documents in the index. Note that this Make command limits the CPU and memory usage; you can adjust this with the variables set in theMakefile
.