Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stopwords loading bug #383

Merged
merged 9 commits into from
Jan 3, 2024

Conversation

jenniferjiangkells
Copy link
Contributor

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! It's much appreciated!

I've marked down a few changes I'd like to make in comments.
Let me know if you do not have the time or energy to do these yourself. I'd be happy to take care of them.

One other thing I'd like to be able to test is that the stop words are actually being skipped.
I.e, I wrote something like this:

class GetEntitiesWithStopWords(unittest.TestCase):
    # NB! The order in which the different CDBs are created
    # is important here since the way that the stop words are
    # set is class-based, it creates the side effect of having
    # the same stop words the next time around
    # regardless of whether or not they should've been set

    @classmethod
    def setUpClass(cls) -> None:
        cls.cdb = CDB.load(os.path.join(os.path.dirname(os.path.realpath(__file__)), "..", "examples", "cdb.dat"))
        cls.vocab = Vocab.load(os.path.join(os.path.dirname(os.path.realpath(__file__)), "..", "examples", "vocab.dat"))
        cls.vocab.make_unigram_table()
        cls.cdb.config.general.spacy_model = "en_core_web_md"
        cls.cdb.config.ner.min_name_len = 2
        cls.cdb.config.ner.upper_case_limit_len = 3
        cls.cdb.config.general.spell_check = True
        cls.cdb.config.linking.train_count_threshold = 10
        cls.cdb.config.linking.similarity_threshold = 0.3
        cls.cdb.config.linking.train = True
        cls.cdb.config.linking.disamb_length_limit = 5
        cls.cdb.config.general.full_unlink = True
        # the regular CAT without stopwords
        cls.no_stopwords = CAT(cdb=cls.cdb, config=cls.cdb.config, vocab=cls.vocab, meta_cats=[])
        # this (the following two lines)
        # needs to be done before initialising the CAT
        # since that initialises the pipe
        cls.cdb.config.preprocessing.stopwords = {"stop", "words"}
        cls.cdb.config.preprocessing.skip_stopwords = True
        # the CAT that skips the stopwords
        cls.w_stopwords = CAT(cdb=cls.cdb, config=cls.cdb.config, vocab=cls.vocab, meta_cats=[])

    def test_stopwords_are_skipped(self, text: str = "second words csv"):
        # without stopwords no entities are captured
        # with stopwords, the `second words csv` entity is captured
        doc_no_stopwords = self.no_stopwords(text)
        self.cdb.config.preprocessing.skip_stopwords = True
        doc_w_stopwords = self.w_stopwords(text)
        self.assertGreater(len(doc_no_stopwords), len(doc_w_stopwords))

I suppose we should be able to do that without the CAT objects, but we'd need to do the rest of the pipe creation anyway.

Let me know if you wish to make the changes yourself or if you'd rather I do that.

medcat/pipe.py Outdated Show resolved Hide resolved
tests/test_pipe.py Outdated Show resolved Hide resolved
@mart-r
Copy link
Collaborator

mart-r commented Jan 2, 2024

@jenniferajiang
Hey!
I've made the changes I recommended and PR'ed it into your branch:
uclh-criu#1

If you are happy with the changes, please merge it in. Then the changes will show up here and after GHA we should be able to merge this in as well.

@jenniferjiangkells
Copy link
Contributor Author

@mart-r Thank you, I've merged the changes!

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me.
Thanks for the help!

@tomolopolis tomolopolis merged commit f0ef8cd into CogStack:master Jan 3, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants