Skip to content

Scikit-learn vectorizer implementing "A simple but tough-to-beat baseline for sentence embeddings." by Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. (2016)

Notifications You must be signed in to change notification settings

ChristophAlt/embedding_vectorizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EmbeddingVectorizer

This repository contains an implementation of an sklearn Vectorizer that produces document embeddings via the SIF algorithm described in Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. "A simple but tough-to-beat baseline for sentence embeddings." (2016)

Install

$ pip install -r requirements.txt

Get started

sts_benchmark.ipynb reproduces the results on the STSbenchmark dataset. The benchmark uses spacy to obtain word vectors for the words in the corpus, but the vectorizer accepts word vectors from any source (e.g. GloVe). Unigram probabilities p(w) are obtained from enwiki_vocab_min200.txt used in the original paper or by the use of the wordfreq package.

Example

docs_train = ["This is a training document or sentence", "Another train document or sentence"]
docs_test = ["I'm a test document"]

vectorizer = EmbeddingVectorizer(
    tokenizer=lambda doc: doc.split(),  # a function or lambda to tokenize the input documents / sentences
    word_vectorizer=lambda word: word_vectors[word],  # a function or lambda to obtain a vector for a given word
    word_freq=lambda word: word_frequencies[word],  # a function or lambda to obtain the frequency of a given word
    weighted=True,
    remove_components=1,
    lowercase=True)

vectorizer.fit(docs_train)

vectorizer.transform(docs_test)

Sources

STSbenchmark

SIF

About

Scikit-learn vectorizer implementing "A simple but tough-to-beat baseline for sentence embeddings." by Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. (2016)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published