EmbeddingVectorizer

This repository contains an implementation of an sklearn Vectorizer that produces document embeddings via the SIF algorithm described in Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. "A simple but tough-to-beat baseline for sentence embeddings." (2016)

Install

$ pip install -r requirements.txt

Get started

sts_benchmark.ipynb reproduces the results on the STSbenchmark dataset. The benchmark uses spacy to obtain word vectors for the words in the corpus, but the vectorizer accepts word vectors from any source (e.g. GloVe). Unigram probabilities p(w) are obtained from enwiki_vocab_min200.txt used in the original paper or by the use of the wordfreq package.

Example

docs_train = ["This is a training document or sentence", "Another train document or sentence"]
docs_test = ["I'm a test document"]

vectorizer = EmbeddingVectorizer(
    tokenizer=lambda doc: doc.split(),  # a function or lambda to tokenize the input documents / sentences
    word_vectorizer=lambda word: word_vectors[word],  # a function or lambda to obtain a vector for a given word
    word_freq=lambda word: word_frequencies[word],  # a function or lambda to obtain the frequency of a given word
    weighted=True,
    remove_components=1,
    lowercase=True)

vectorizer.fit(docs_train)

vectorizer.transform(docs_test)

Sources

STSbenchmark

Description: http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark
Dataset: http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz
Paper: http://www.aclweb.org/anthology/S/S17/S17-2001.pdf

SIF

Paper: https://openreview.net/pdf?id=SyK00v5xx
Github: https://github.com/PrincetonML/SIF
enwiki_vocab_min200.txt: https://github.com/PrincetonML/SIF/tree/master/auxiliary_data

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
.gitignore		.gitignore
README.md		README.md
embedding_vectorizer.py		embedding_vectorizer.py
requirements.txt		requirements.txt
sts_benchmark.ipynb		sts_benchmark.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmbeddingVectorizer

Install

Get started

Example

Sources

STSbenchmark

SIF

About

Releases

Packages

Languages

ChristophAlt/embedding_vectorizer

Folders and files

Latest commit

History

Repository files navigation

EmbeddingVectorizer

Install

Get started

Example

Sources

STSbenchmark

SIF

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages