Fix stopwords loading bug #383

jenniferjiangkells · 2023-12-18T18:09:59Z

mart-r

Thanks for the PR! It's much appreciated!

I've marked down a few changes I'd like to make in comments.
Let me know if you do not have the time or energy to do these yourself. I'd be happy to take care of them.

One other thing I'd like to be able to test is that the stop words are actually being skipped.
I.e, I wrote something like this:

class GetEntitiesWithStopWords(unittest.TestCase):
    # NB! The order in which the different CDBs are created
    # is important here since the way that the stop words are
    # set is class-based, it creates the side effect of having
    # the same stop words the next time around
    # regardless of whether or not they should've been set

    @classmethod
    def setUpClass(cls) -> None:
        cls.cdb = CDB.load(os.path.join(os.path.dirname(os.path.realpath(__file__)), "..", "examples", "cdb.dat"))
        cls.vocab = Vocab.load(os.path.join(os.path.dirname(os.path.realpath(__file__)), "..", "examples", "vocab.dat"))
        cls.vocab.make_unigram_table()
        cls.cdb.config.general.spacy_model = "en_core_web_md"
        cls.cdb.config.ner.min_name_len = 2
        cls.cdb.config.ner.upper_case_limit_len = 3
        cls.cdb.config.general.spell_check = True
        cls.cdb.config.linking.train_count_threshold = 10
        cls.cdb.config.linking.similarity_threshold = 0.3
        cls.cdb.config.linking.train = True
        cls.cdb.config.linking.disamb_length_limit = 5
        cls.cdb.config.general.full_unlink = True
        # the regular CAT without stopwords
        cls.no_stopwords = CAT(cdb=cls.cdb, config=cls.cdb.config, vocab=cls.vocab, meta_cats=[])
        # this (the following two lines)
        # needs to be done before initialising the CAT
        # since that initialises the pipe
        cls.cdb.config.preprocessing.stopwords = {"stop", "words"}
        cls.cdb.config.preprocessing.skip_stopwords = True
        # the CAT that skips the stopwords
        cls.w_stopwords = CAT(cdb=cls.cdb, config=cls.cdb.config, vocab=cls.vocab, meta_cats=[])

    def test_stopwords_are_skipped(self, text: str = "second words csv"):
        # without stopwords no entities are captured
        # with stopwords, the `second words csv` entity is captured
        doc_no_stopwords = self.no_stopwords(text)
        self.cdb.config.preprocessing.skip_stopwords = True
        doc_w_stopwords = self.w_stopwords(text)
        self.assertGreater(len(doc_no_stopwords), len(doc_w_stopwords))

I suppose we should be able to do that without the CAT objects, but we'd need to do the rest of the pipe creation anyway.

Let me know if you wish to make the changes yourself or if you'd rather I do that.

medcat/pipe.py

tests/test_pipe.py

mart-r · 2024-01-02T09:26:32Z

@jenniferajiang
Hey!
I've made the changes I recommended and PR'ed it into your branch:
uclh-criu#1

If you are happy with the changes, please merge it in. Then the changes will show up here and after GHA we should be able to merge this in as well.

Implement my recommended changes

jenniferjiangkells · 2024-01-02T11:06:01Z

@mart-r Thank you, I've merged the changes!

mart-r

Looking good to me.
Thanks for the help!

jenniferajiang added 2 commits December 18, 2023 15:00

Load stopwords in Defaults before spacy model

90bf65e

Added tests

72ac8d7

mart-r requested changes Dec 19, 2023

View reviewed changes

medcat/pipe.py Outdated Show resolved Hide resolved

tests/test_pipe.py Outdated Show resolved Hide resolved

mart-r added 6 commits December 22, 2023 11:47

Remove tests of internals where possible

37a9d92

Add test for skipping of stopwords

392f80b

Avoid supporting only English for stopwords

276bcf1

Merge branch 'master' into stopwords-loading-fix

f0572ee

Remove debug output

69c2393

Make sure stopwords language getter works for file-path spacy models

80b4387

Merge pull request #1 from CogStack/stopwords-loading-fix

45fa0e2

Implement my recommended changes

mart-r approved these changes Jan 2, 2024

View reviewed changes

tomolopolis merged commit f0ef8cd into CogStack:master Jan 3, 2024
5 checks passed

mart-r mentioned this pull request Jan 8, 2024

Stopwords do not load properly #382

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stopwords loading bug #383

Fix stopwords loading bug #383

jenniferjiangkells commented Dec 18, 2023

mart-r left a comment •

edited

Loading

mart-r commented Jan 2, 2024

jenniferjiangkells commented Jan 2, 2024

mart-r left a comment

Fix stopwords loading bug #383

Fix stopwords loading bug #383

Conversation

jenniferjiangkells commented Dec 18, 2023

mart-r left a comment • edited Loading

Choose a reason for hiding this comment

mart-r commented Jan 2, 2024

jenniferjiangkells commented Jan 2, 2024

mart-r left a comment

Choose a reason for hiding this comment

mart-r left a comment •

edited

Loading