-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix stopwords loading bug #383
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! It's much appreciated!
I've marked down a few changes I'd like to make in comments.
Let me know if you do not have the time or energy to do these yourself. I'd be happy to take care of them.
One other thing I'd like to be able to test is that the stop words are actually being skipped.
I.e, I wrote something like this:
class GetEntitiesWithStopWords(unittest.TestCase):
# NB! The order in which the different CDBs are created
# is important here since the way that the stop words are
# set is class-based, it creates the side effect of having
# the same stop words the next time around
# regardless of whether or not they should've been set
@classmethod
def setUpClass(cls) -> None:
cls.cdb = CDB.load(os.path.join(os.path.dirname(os.path.realpath(__file__)), "..", "examples", "cdb.dat"))
cls.vocab = Vocab.load(os.path.join(os.path.dirname(os.path.realpath(__file__)), "..", "examples", "vocab.dat"))
cls.vocab.make_unigram_table()
cls.cdb.config.general.spacy_model = "en_core_web_md"
cls.cdb.config.ner.min_name_len = 2
cls.cdb.config.ner.upper_case_limit_len = 3
cls.cdb.config.general.spell_check = True
cls.cdb.config.linking.train_count_threshold = 10
cls.cdb.config.linking.similarity_threshold = 0.3
cls.cdb.config.linking.train = True
cls.cdb.config.linking.disamb_length_limit = 5
cls.cdb.config.general.full_unlink = True
# the regular CAT without stopwords
cls.no_stopwords = CAT(cdb=cls.cdb, config=cls.cdb.config, vocab=cls.vocab, meta_cats=[])
# this (the following two lines)
# needs to be done before initialising the CAT
# since that initialises the pipe
cls.cdb.config.preprocessing.stopwords = {"stop", "words"}
cls.cdb.config.preprocessing.skip_stopwords = True
# the CAT that skips the stopwords
cls.w_stopwords = CAT(cdb=cls.cdb, config=cls.cdb.config, vocab=cls.vocab, meta_cats=[])
def test_stopwords_are_skipped(self, text: str = "second words csv"):
# without stopwords no entities are captured
# with stopwords, the `second words csv` entity is captured
doc_no_stopwords = self.no_stopwords(text)
self.cdb.config.preprocessing.skip_stopwords = True
doc_w_stopwords = self.w_stopwords(text)
self.assertGreater(len(doc_no_stopwords), len(doc_w_stopwords))
I suppose we should be able to do that without the CAT
objects, but we'd need to do the rest of the pipe creation anyway.
Let me know if you wish to make the changes yourself or if you'd rather I do that.
@jenniferajiang If you are happy with the changes, please merge it in. Then the changes will show up here and after GHA we should be able to merge this in as well. |
Implement my recommended changes
@mart-r Thank you, I've merged the changes! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good to me.
Thanks for the help!
#382