Skip to content

Latest commit

 

History

History
27 lines (18 loc) · 1.97 KB

README.md

File metadata and controls

27 lines (18 loc) · 1.97 KB

lexsub : context-sensitive word substitutions using Word2Vec

Disambiguating between the possible senses of a word in the context of a sentence is a fundamental problem in NLP. However, this assumes a universal set of "meanings" to disambiguate between. A more natural but also more practical task is finding a good substitution for a word in context. For example, in the sentence "She went to the bar last night", we know bar means pub, but the word bar has other meanings: a chocolate bar, or a ban/restriction on something.

drawing

This repository uses a Word2Vec embedding based on the Google News corpus, made available here and through the gensim library to rank candidate word substitutions by their suitability to the context of the sentence.

Setup

  1. Download the Google News word vectors from here and make sure you have the gensim package installed.
  2. Make sure you've installed nltk (natural language toolkit) and have downloaded the lin thesaurus and wordnet corpora by executing the following in the python console: import nltk, nltk.download('lin_thesaurus'), nltk.download('wordnet')

Example Usage

from lexsub import LexSub
from gensim.models import KeyedVectors

word2vec_path = "/path/to/GoogleNews-vectors-negative300.bin"
vectors = KeyedVectors.load_word2vec_format(word2vec_path, binary=True)
ls = LexSub(vectors, candidate_generator='lin')

sentence = "She had a drink at the bar"
target = "bar.n"
result = ls.lex_sub(target, sentence)
print(result)
# ['bars', 'pub', 'tavern', 'nightclub', 'restaurant']