Skip to content
This repository was archived by the owner on Jul 28, 2025. It is now read-only.

Commit ceb74b1

Browse files
authored
Merge pull request #506 from CogStack/master
v1.14.0 release PR
2 parents 34e5cde + 37a8a63 commit ceb74b1

29 files changed

+1011
-933
lines changed

.github/workflows/main.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ jobs:
1212
runs-on: ubuntu-latest
1313
strategy:
1414
matrix:
15-
python-version: [ '3.8', '3.9', '3.10', '3.11' ]
15+
python-version: [ '3.9', '3.10', '3.11' ]
1616
max-parallel: 4
1717

1818
steps:
@@ -42,6 +42,8 @@ jobs:
4242
timeout 25m python -m unittest ${second_half_nl[@]}
4343
- name: Regression
4444
run: source tests/resources/regression/run_regression.sh
45+
- name: Model backwards compatibility
46+
run: source tests/resources/model_compatibility/check_backwards_compatibility.sh
4547
- name: Get the latest release version
4648
id: get_latest_release
4749
uses: actions/github-script@v6

docs/main.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -122,12 +122,12 @@ If you have access to UMLS or SNOMED-CT, you can download the pre-built CDB and
122122
A basic trained model is made public. It contains ~ 35K concepts available in `MedMentions`. This was compiled from MedMentions and does not have any data from [NLM](https://www.nlm.nih.gov/research/umls/) as that data is not publicaly available.
123123

124124
Model packs:
125-
- MedMentions with Status (Is Concept Affirmed or Negated/Hypothetical) [Download](https://medcat.rosalind.kcl.ac.uk/media/medmen_wstatus_2021_oct.zip)
125+
- MedMentions with Status (Is Concept Affirmed or Negated/Hypothetical) [Download](https://cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com/medcat-example-models/medmen_wstatus_2021_oct.zip)
126126

127127
Separate models:
128-
- Vocabulary [Download](https://medcat.rosalind.kcl.ac.uk/media/vocab.dat) - Built from MedMentions
129-
- CDB [Download](https://medcat.rosalind.kcl.ac.uk/media/cdb-medmen-v1_2.dat) - Built from MedMentions
130-
- MetaCAT Status [Download](https://medcat.rosalind.kcl.ac.uk/media/mc_status.zip) - Built from a sample from MIMIC-III, detects is an annotation Affirmed (Positve) or Other (Negated or Hypothetical)
128+
- Vocabulary [Download](https://cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com/medcat-example-models/vocab.dat) - Built from MedMentions
129+
- CDB [Download](https://cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com/medcat-example-models/cdb-medmen-v1.dat) - Built from MedMentions
130+
- MetaCAT Status [Download](https://cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com/medcat-example-models/mc_status.zip) - Built from a sample from MIMIC-III, detects is an annotation Affirmed (Positve) or Other (Negated or Hypothetical)
131131

132132
## Acknowledgements
133133
Entity extraction was trained on [MedMentions](https://github.com/chanzuckerberg/MedMentions) In total it has ~ 35K entites from UMLS

docs/requirements.txt

Lines changed: 75 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -2,103 +2,105 @@ sphinx==6.2.1
22
sphinx-rtd-theme~=1.0
33
myst-parser~=0.17
44
sphinx-autoapi~=3.0.0
5-
MarkupSafe==2.1.3
6-
accelerate==0.23.0
7-
aiofiles==23.2.1
8-
aiohttp==3.8.5
5+
MarkupSafe==2.1.5
6+
accelerate==0.34.2
7+
aiofiles==24.1.0
8+
aiohttp==3.10.5
99
aiosignal==1.3.1
10-
asttokens==2.4.0
10+
asttokens==2.4.1
1111
async-timeout==4.0.3
12-
attrs==23.1.0
12+
attrs==24.2.0
1313
backcall==0.2.0
1414
blis==0.7.11
1515
catalogue==2.0.10
16-
certifi==2023.7.22
17-
charset-normalizer==3.3.0
16+
certifi==2024.8.30
17+
charset-normalizer==3.3.2
1818
click==8.1.7
19-
comm==0.1.4
20-
confection==0.1.3
19+
comm==0.2.2
20+
confection==0.1.5
2121
cymem==2.0.8
22-
datasets==2.14.5
22+
darglint==1.8.1
23+
datasets==2.21.0
2324
decorator==5.1.1
24-
dill==0.3.7
25-
exceptiongroup==1.1.3
26-
executing==2.0.0
27-
filelock==3.12.4
28-
flake8==4.0.1
29-
frozenlist==1.4.0
30-
fsspec==2023.6.0
31-
gensim==4.3.2
32-
huggingface-hub==0.17.3
33-
idna==3.4
34-
ipython==8.16.1
35-
ipywidgets==8.1.1
25+
dill==0.3.8
26+
exceptiongroup==1.2.2
27+
executing==2.1.0
28+
filelock==3.16.0
29+
flake8==7.0.0
30+
frozenlist==1.4.1
31+
fsspec==2024.6.1
32+
gensim==4.3.3
33+
huggingface-hub==0.24.7
34+
idna==3.10
35+
ipython==8.27.0
36+
ipywidgets==8.1.5
3637
jedi==0.19.1
37-
jinja2==3.1.2
38-
joblib==1.3.2
39-
jsonpickle==3.0.2
40-
jupyterlab-widgets==3.0.9
41-
langcodes==3.3.0
42-
matplotlib-inline==0.1.6
43-
mccabe==0.6.1
38+
jinja2==3.1.4
39+
joblib==1.4.2
40+
jsonpickle==3.3.0
41+
jupyterlab-widgets==3.0.13
42+
langcodes==3.4.0
43+
matplotlib-inline==0.1.7
44+
mccabe==0.7.0
4445
mpmath==1.3.0
45-
multidict==6.0.4
46-
multiprocess==0.70.15
46+
multidict==6.1.0
47+
multiprocess==0.70.16
4748
murmurhash==1.0.10
48-
mypy==1.0.0
49-
mypy-extensions==0.4.3
50-
networkx==3.1
49+
mypy==1.11.2
50+
mypy-extensions==1.0.0
51+
networkx==3.3
5152
numpy==1.25.2
52-
packaging==23.2
53-
pandas==2.1.1
54-
parso==0.8.3
55-
pathy==0.10.2
56-
pexpect==4.8.0
53+
packaging==24.1
54+
pandas==2.2.2
55+
parso==0.8.4
56+
pathy==0.11.0
57+
peft==0.12.0
58+
pexpect==4.9.0
5759
pickleshare==0.7.5
5860
preshed==3.0.9
59-
prompt-toolkit==3.0.39
60-
psutil==5.9.5
61+
prompt-toolkit==3.0.47
62+
psutil==6.0.0
6163
ptyprocess==0.7.0
62-
pure-eval==0.2.2
63-
pyarrow==13.0.0
64-
pycodestyle==2.8.0
65-
pydantic==1.10.13
66-
pyflakes==2.4.0
67-
pygments==2.16.1
68-
python-dateutil==2.8.2
69-
pytz==2023.3.post1
70-
pyyaml==6.0.1
71-
regex==2023.10.3
72-
requests==2.31.0
73-
safetensors==0.4.0
74-
scikit-learn==1.3.1
64+
pure-eval==0.2.3
65+
pyarrow==17.0.0
66+
pycodestyle==2.11.1
67+
pydantic==1.10.18
68+
pyflakes==3.2.0
69+
pygments==2.18.0
70+
python-dateutil==2.9.0
71+
pytz==2024.2
72+
pyyaml==6.0.2
73+
regex==2024.9.11
74+
requests==2.32.3
75+
safetensors==0.4.5
76+
scikit-learn==1.5.2
7577
scipy==1.9.3
7678
six==1.16.0
7779
smart-open==6.4.0
78-
spacy==3.4.4
80+
spacy==3.6.1
7981
spacy-legacy==3.0.12
8082
spacy-loggers==1.0.5
8183
srsly==2.4.8
8284
stack-data==0.6.3
83-
sympy==1.12
85+
sympy==1.13.2
8486
thinc==8.1.12
85-
threadpoolctl==3.2.0
86-
tokenizers==0.14.1
87+
threadpoolctl==3.5.0
88+
tokenizers==0.19.1
8789
tomli==2.0.1
88-
torch==2.1.0
89-
tqdm==4.66.1
90-
traitlets==5.11.2
91-
transformers==4.34.0
92-
triton==2.1.0
93-
typer==0.7.0
90+
torch==2.4.1
91+
tqdm==4.66.5
92+
traitlets==5.14.3
93+
transformers==4.44.2
94+
triton==3.0.0
95+
typer==0.9.4
9496
types-PyYAML==6.0.3
9597
types-aiofiles==0.8.3
9698
types-setuptools==57.4.10
97-
typing-extensions==4.8.0
98-
tzdata==2023.3
99-
urllib3==2.0.6
100-
wasabi==0.10.1
101-
wcwidth==0.2.8
102-
widgetsnbextension==4.0.9
103-
xxhash==3.4.1
104-
yarl==1.9.2
99+
typing-extensions==4.12.2
100+
tzdata==2024.1
101+
urllib3==2.2.3
102+
wasabi==1.1.3
103+
wcwidth==0.2.13
104+
widgetsnbextension==4.0.13
105+
xxhash==3.5.0
106+
yarl==1.11.1

examples/cdb_new.dat

-3.29 KB
Binary file not shown.

install_requires.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
'numpy>=1.22.0,<1.26.0' # 1.22.0 is first to support python 3.11; post 1.26.0 there's issues with scipy
22
'pandas>=1.4.2' # first to support 3.11
33
'gensim>=4.3.0,<5.0.0' # 5.3.0 is first to support 3.11; avoid major version bump
4-
'spacy>=3.6.0,<4.0.0' # Some later model packs (e.g HPO) are made with 3.6.0 spacy model; avoid major version bump
4+
'spacy>=3.6.0,<3.8.0' # 3.8 only supports numpy2 which we can't use due to other dependencies
55
'scipy~=1.9.2' # 1.9.2 is first to support 3.11
66
'transformers>=4.34.0,<5.0.0' # avoid major version bump
77
'accelerate>=0.23.0' # required by Trainer class in de-id
@@ -21,4 +21,4 @@
2121
'click>=8.0.4' # allow later versions, tested with 8.1.3
2222
'pydantic>=1.10.0,<2.0' # for spacy compatibility; avoid 2.0 due to breaking changes
2323
"humanfriendly~=10.0" # for human readable file / RAM sizes
24-
"peft>=0.8.2"
24+
"peft>=0.8.2"

medcat/cat.py

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1127,11 +1127,29 @@ def get_entities_multi_texts(self,
11271127
self.pipe.set_error_handler(self._pipe_error_handler)
11281128
try:
11291129
texts_ = self._get_trimmed_texts(texts)
1130+
if self.config.general.usage_monitor.enabled:
1131+
input_lengths: List[Tuple[int, int]] = []
1132+
for orig_text, trimmed_text in zip(texts, texts_):
1133+
if orig_text is None or trimmed_text is None:
1134+
l1, l2 = 0, 0
1135+
else:
1136+
l1 = len(orig_text)
1137+
l2 = len(trimmed_text)
1138+
input_lengths.append((l1, l2))
11301139
docs = self.pipe.batch_multi_process(texts_, n_process, batch_size)
11311140

1132-
for doc in tqdm(docs, total=len(texts_)):
1141+
for doc_nr, doc in tqdm(enumerate(docs), total=len(texts_)):
11331142
doc = None if doc.text.strip() == '' else doc
11341143
out.append(self._doc_to_out(doc, only_cui, addl_info, out_with_text=True))
1144+
if self.config.general.usage_monitor.enabled:
1145+
l1, l2 = input_lengths[doc_nr]
1146+
if doc is None:
1147+
nents = 0
1148+
elif self.config.general.show_nested_entities:
1149+
nents = len(doc._.ents) # type: ignore
1150+
else:
1151+
nents = len(doc.ents) # type: ignore
1152+
self.usage_monitor.log_inference(l1, l2, nents)
11351153

11361154
# Currently spaCy cannot mark which pieces of texts failed within the pipe so be this workaround,
11371155
# which also assumes texts are different from each others.
@@ -1637,6 +1655,9 @@ def _mp_cons(self, in_q: Queue, out_list: List, min_free_memory: float,
16371655
logger.warning("PID: %s failed one document in _mp_cons, running will continue normally. \n" +
16381656
"Document length in chars: %s, and ID: %s", pid, len(str(text)), i_text)
16391657
logger.warning(str(e))
1658+
if self.config.general.usage_monitor.enabled:
1659+
# NOTE: This is in another process, so need to explicitly flush
1660+
self.usage_monitor._flush_logs()
16401661
sleep(2)
16411662

16421663
def _add_nested_ent(self, doc: Doc, _ents: List[Span], _ent: Union[Dict, Span]) -> None:

medcat/meta_cat.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -257,20 +257,19 @@ def train_raw(self, data_loaded: Dict, save_dir_path: Optional[str] = None, data
257257
category_value2id = g_config['category_value2id']
258258
if not category_value2id:
259259
# Encode the category values
260-
data_undersampled, full_data, category_value2id = encode_category_values(data,
260+
full_data, data_undersampled, category_value2id = encode_category_values(data,
261261
category_undersample=self.config.model.category_undersample)
262262
g_config['category_value2id'] = category_value2id
263263
else:
264264
# We already have everything, just get the data
265-
data_undersampled, full_data, category_value2id = encode_category_values(data,
265+
full_data, data_undersampled, category_value2id = encode_category_values(data,
266266
existing_category_value2id=category_value2id,
267267
category_undersample=self.config.model.category_undersample)
268268
g_config['category_value2id'] = category_value2id
269269
# Make sure the config number of classes is the same as the one found in the data
270270
if len(category_value2id) != self.config.model['nclasses']:
271271
logger.warning(
272-
"The number of classes set in the config is not the same as the one found in the data: {} vs {}".format(
273-
self.config.model['nclasses'], len(category_value2id)))
272+
"The number of classes set in the config is not the same as the one found in the data: %d vs %d",self.config.model['nclasses'], len(category_value2id))
274273
logger.warning("Auto-setting the nclasses value in config and rebuilding the model.")
275274
self.config.model['nclasses'] = len(category_value2id)
276275

medcat/ner/transformers_ner.py

Lines changed: 34 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
import datasets
55
from spacy.tokens import Doc
66
from datetime import datetime
7-
from typing import Iterable, Iterator, Optional, Dict, List, cast, Union, Tuple, Callable
7+
from typing import Iterable, Iterator, Optional, Dict, List, cast, Union, Tuple, Callable, Type
88
from spacy.tokens import Span
99
import inspect
1010
from functools import partial
@@ -87,7 +87,13 @@ def create_eval_pipeline(self):
8787
# NOTE: this will fix the DeID model(s) created before medcat 1.9.3
8888
# though this fix may very well be unstable
8989
self.ner_pipe.tokenizer._in_target_context_manager = False
90+
if not hasattr(self.ner_pipe.tokenizer, 'split_special_tokens'):
91+
# NOTE: this will fix the DeID model(s) created with transformers before 4.42
92+
# and allow them to run with later transforemrs
93+
self.ner_pipe.tokenizer.split_special_tokens = False
9094
self.ner_pipe.device = self.model.device
95+
self._consecutive_identical_failures = 0
96+
self._last_exception: Optional[Tuple[str, Type[Exception]]] = None
9197

9298
def get_hash(self) -> str:
9399
"""A partial hash trying to catch differences between models.
@@ -390,34 +396,33 @@ def _process(self,
390396
#all_text_processed = self.tokenizer.encode_eval(all_text)
391397
# For now we will process the documents one by one, should be improved in the future to use batching
392398
for doc in docs:
393-
try:
394-
res = self.ner_pipe(doc.text, aggregation_strategy=self.config.general['ner_aggregation_strategy'])
395-
doc.ents = [] # type: ignore
396-
for r in res:
397-
inds = []
398-
for ind, word in enumerate(doc):
399-
end_char = word.idx + len(word.text)
400-
if end_char <= r['end'] and end_char > r['start']:
401-
inds.append(ind)
402-
# To not loop through everything
403-
if end_char > r['end']:
404-
break
405-
if inds:
406-
entity = Span(doc, min(inds), max(inds) + 1, label=r['entity_group'])
407-
entity._.cui = r['entity_group']
408-
entity._.context_similarity = r['score']
409-
entity._.detected_name = r['word']
410-
entity._.id = len(doc._.ents)
411-
entity._.confidence = r['score']
412-
413-
doc._.ents.append(entity)
414-
create_main_ann(self.cdb, doc)
415-
if self.cdb.config.general['make_pretty_labels'] is not None:
416-
make_pretty_labels(self.cdb, doc, LabelStyle[self.cdb.config.general['make_pretty_labels']])
417-
if self.cdb.config.general['map_cui_to_group'] is not None and self.cdb.addl_info.get('cui2group', {}):
418-
map_ents_to_groups(self.cdb, doc)
419-
except Exception as e:
420-
logger.warning(e, exc_info=True)
399+
res = self.ner_pipe(doc.text, aggregation_strategy=self.config.general['ner_aggregation_strategy'])
400+
doc.ents = [] # type: ignore
401+
for r in res:
402+
inds = []
403+
for ind, word in enumerate(doc):
404+
end_char = word.idx + len(word.text)
405+
if end_char <= r['end'] and end_char > r['start']:
406+
inds.append(ind)
407+
# To not loop through everything
408+
if end_char > r['end']:
409+
break
410+
if inds:
411+
entity = Span(doc, min(inds), max(inds) + 1, label=r['entity_group'])
412+
entity._.cui = r['entity_group']
413+
entity._.context_similarity = r['score']
414+
entity._.detected_name = r['word']
415+
entity._.id = len(doc._.ents)
416+
entity._.confidence = r['score']
417+
418+
doc._.ents.append(entity)
419+
create_main_ann(self.cdb, doc)
420+
if self.cdb.config.general['make_pretty_labels'] is not None:
421+
make_pretty_labels(self.cdb, doc, LabelStyle[self.cdb.config.general['make_pretty_labels']])
422+
if self.cdb.config.general['map_cui_to_group'] is not None and self.cdb.addl_info.get('cui2group', {}):
423+
map_ents_to_groups(self.cdb, doc)
424+
self._consecutive_identical_failures = 0 # success
425+
self._last_exception = None
421426
yield from docs
422427

423428
# Override

0 commit comments

Comments
 (0)