-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add: metacat can predict on spans in arbitrary spangroups #391
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting in the effort! It's much appreciated!
Some changes for GHA to run.
Some changes for clarity.
Thank you for your comments! I will address them. |
@mart-r I have iterated on your PR comments. I also added a more informative warning in case the span group is not set no the doc and a test for it. |
I've done my best to address all of your changes. Let me know if you have further comments! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
But please address the GitHub Actions issues (linting so far):
medcat/meta_cat.py:359:1: W293 blank line contains whitespace
@mart-r Got it, I fixed the linting and flake8 passes locally. |
Hello - I opened this PR request to start a discussion about adding a new feature that allows a trained MetaCAT model to run on arbitrary spangroup. This will will allow configuring a MetaCAT model such that it can be pointed at an arbitrary spacy span group and set meta_anns on it.
Currently, it is allowed to run on the two set of
list[spacy.token.Span]
:doc.ents
ordoc._.ents
.However, there are use cases where we may store other spans as part of a pipeline under
doc.spans["my_spangroup_name"]
that are distinct from doc.ents in which we want to call predict on the meta cat model.I don't think it's yet necessary to be able to train on arbitrary spangroups since that would get into changing the input format or extract format from medcat trainer.
One workaround is to either zero out the ents object and set the spangroup to those ents and call predict however this starts to feel hacky.
Another workaround is to create a duplicate
spacy.token.Doc
object and usingdoc.ents
for the separate spangroup. However this is inefficient as it requires running parts of the spacy pipeline again and means we can't use this spangroup in other pipeline components.