This repository contains an implementation of an sklearn Vectorizer that produces document embeddings via the SIF algorithm described in Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. "A simple but tough-to-beat baseline for sentence embeddings." (2016)
$ pip install -r requirements.txt
sts_benchmark.ipynb reproduces the results on the STSbenchmark dataset. The benchmark uses spacy to obtain word vectors for the words in the corpus, but the vectorizer accepts word vectors from any source (e.g. GloVe). Unigram probabilities p(w) are obtained from enwiki_vocab_min200.txt used in the original paper or by the use of the wordfreq package.
docs_train = ["This is a training document or sentence", "Another train document or sentence"]
docs_test = ["I'm a test document"]
vectorizer = EmbeddingVectorizer(
tokenizer=lambda doc: doc.split(), # a function or lambda to tokenize the input documents / sentences
word_vectorizer=lambda word: word_vectors[word], # a function or lambda to obtain a vector for a given word
word_freq=lambda word: word_frequencies[word], # a function or lambda to obtain the frequency of a given word
weighted=True,
remove_components=1,
lowercase=True)
vectorizer.fit(docs_train)
vectorizer.transform(docs_test)
- Description: http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark
- Dataset: http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz
- Paper: http://www.aclweb.org/anthology/S/S17/S17-2001.pdf
- Paper: https://openreview.net/pdf?id=SyK00v5xx
- Github: https://github.com/PrincetonML/SIF
- enwiki_vocab_min200.txt: https://github.com/PrincetonML/SIF/tree/master/auxiliary_data