Skip to content

CU-8699049kf MedCAT v2 support #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 67 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
d56154d
CU-8699049kf: Bump requirement to v2 (0.3.3)
mart-r May 13, 2025
dcdd984
CU-8698up3x0: Add optional extras needed for WWC
mart-r May 13, 2025
cf3cfbc
CU-8699049kf: Update requirements to work with newer packages.
mart-r May 13, 2025
03e9814
CU-8699049kf: Add compatibility layer as a package
mart-r May 13, 2025
0e47df4
CU-8699049kf: Install compatibility layer with requirements
mart-r May 13, 2025
4001e88
CU-8699049kf: Run workflow on Ubuntu 24.04 instead of EoL 20.04
mart-r May 13, 2025
7690236
CU-8699049kf: Run wokrflow on 3.9>=python>=3.12
mart-r May 13, 2025
a8983cb
CU-8699049kf: Add type ignoring in compatibility package
mart-r May 13, 2025
2ea5650
CU-8699049kf: Add custom test runner
mart-r May 13, 2025
232cebd
CU-8699049kf: Update custom runner to allow for specific test locations
mart-r May 13, 2025
027fc92
CU-8699049kf: Use custom test runner in workflow
mart-r May 13, 2025
126831e
CU-8699049kf: Fix custom test runner name in workflow
mart-r May 13, 2025
2bde16f
CU-8699049kf: Move test runner to a different folder to fix issues
mart-r May 13, 2025
8edb65c
CU-8699049kf: Remove redundant code / imports
mart-r May 13, 2025
7f0380a
CU-8699049kf: Move tests to a different folder for namespaces reasons
mart-r May 14, 2025
6d138e9
CU-8699049kf: Fix test-data location
mart-r May 14, 2025
a02dbc0
CU-8699049kf: Fix test-data location (CDB creation)
mart-r May 14, 2025
f07ac28
CU-8699049kf: Add manual relocation of packages
mart-r May 14, 2025
3f9399d
CU-8699049kf: Fix test resources path (create modelpack)
mart-r May 14, 2025
eadcdc9
CU-8699049kf: Use medcat2-based modelpack load/save code
mart-r May 14, 2025
7e9a20d
CU-8699049kf: Use medcat2-based Vocab load/save code
mart-r May 14, 2025
5aa4928
CU-8699049kf: Make sure to create directory before saving in it (Vocab)
mart-r May 14, 2025
5d163d6
CU-8699049kf: Add automatic legacy conversion of CDB and Vocab to com…
mart-r May 14, 2025
e9bd3fb
CU-8699049kf: Treat saved paths as folders during tests (Vocab)
mart-r May 14, 2025
a579a0b
CU-8699049kf: Adapt CDB creation to v2 paths (config) and serialising
mart-r May 14, 2025
732163d
CU-8699049kf: Adapt UMLS CDB creation to v2 paths (config) and serial…
mart-r May 14, 2025
4a6f509
CU-8699049kf: Adapt model pack creation to v2 paths (config) and seri…
mart-r May 14, 2025
2ade2fd
CU-8699049kf: Adapt Vocab test-time serialising to v2 methods
mart-r May 14, 2025
fb0b5ac
CU-8699049kf: Adap model pack creation tests to v2 standards (basenam…
mart-r May 14, 2025
23a19e4
CU-8699049kf: Allow overwriting existing models for test-time purposes
mart-r May 14, 2025
ef7c6f7
CU-8699049kf: Move to v2-type (de)serialising in CDB creation tests (…
mart-r May 14, 2025
a333084
CU-8699049kf: Fix test runner path in workflow
mart-r May 14, 2025
f1249fc
CU-8699049kf: Move model compare closer to v2 format
mart-r May 14, 2025
5677fb2
CU-8699049kf: Fix typing for ResultsTally model
mart-r May 14, 2025
0e6cef9
CU-8699049kf: Fix test-time mocked method for training
mart-r May 14, 2025
53ed1b4
CU-8699049kf: Fix missing keyword argument when initialising results …
mart-r May 14, 2025
b26fbef
CU-8699049kf: Fix v2-specific cui set in comparison
mart-r May 14, 2025
439d58d
CU-8699049kf: Fix v2-specific CDB stats/info in comparison
mart-r May 14, 2025
da79150
CU-8699049kf: Use v2-specific version fo CDB comparison
mart-r May 14, 2025
54fa6c7
CU-8699049kf: Update dependency to v0.3.4
mart-r May 14, 2025
01c8e1a
CU-8699049kf: Add message to problematic assert call in tests
mart-r May 14, 2025
9adf59d
CU-8699049kf: Change way of asserting mock method call
mart-r May 14, 2025
38b61f2
CU-8699049kf: Patch the train method fo instance as well
mart-r May 14, 2025
5fd419f
CU-8699049kf: Update unsupervised training script to v2
mart-r May 15, 2025
c4fe9b8
CU-8699049kf: Update supervised training script to v2
mart-r May 15, 2025
b4434a4
CU-8699049kf: Update MetaCAT notebook to v2
mart-r May 15, 2025
40fc838
CU-8699049kf: Update 2-phase learning MetaCAT notebook to v2 to the b…
mart-r May 15, 2025
ab45464
CU-8699049kf: Update run_model script for v2
mart-r May 15, 2025
1755848
CU-8699049kf: Update run_model script for v2 (remove unavailable keyw…
mart-r May 15, 2025
3cb57cd
CU-8699049kf: Update run_model notebook for v2 as best as possible
mart-r May 15, 2025
149f5b1
CU-8699049kf: Update (most of) mct_analysis to v2
mart-r May 15, 2025
58aaebc
CU-8699049kf: Fix a few minor typing issues
mart-r May 15, 2025
09ca315
CU-8699049kf: Update to latest v2 release (0.5.0)
mart-r Jun 9, 2025
05c8e64
CU-8699049kf: Remove compatibility layer (no longer needed)
mart-r Jun 9, 2025
bf99043
Merge branch 'main' into CU-8699049kf-mct-v2
mart-r Jun 9, 2025
2ec14ae
CU-8699049kf: Update for legacy CDB/Vocab load
mart-r Jun 9, 2025
2249cd2
CU-8699049kf: Fix imports when creating CDBs
mart-r Jun 9, 2025
29e3c9d
CU-8699049kf: Fix imports in MCT analysis
mart-r Jun 9, 2025
595f30d
CU-8699049kf: Remove accidental force-deserialisation
mart-r Jun 9, 2025
3e0b7ec
CU-8699049kf: Fix typo in Vocab load path
mart-r Jun 9, 2025
3d2269d
CU-8699049kf: Bump to latest v2 release (v0.5.1)
mart-r Jun 9, 2025
d0813e5
CU-8699049kf: Update requirements to v0.6.0
mart-r Jun 10, 2025
322de49
CU-8699049kf: Update requirements to v0.6.1
mart-r Jun 11, 2025
3b84efa
CU-8699049kf: Update dependency to 0.7.0
mart-r Jun 12, 2025
c88dbe5
CU-8699049kf: Update ents property after name change
mart-r Jun 12, 2025
de0be55
CU-8699049kf: Update for use of convenience method for CDB/Vocab loading
mart-r Jun 12, 2025
26ee9f2
CU-8699049kf: Update to v0.8.0
mart-r Jun 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ jobs:
python -m mypy `git ls-tree --full-tree --name-only -r HEAD | grep ".py$" | grep -v "tests/"` --explicit-package-bases --follow-imports=normal
- name: Test
run: |
python -m unittest discover
python -m unittest discover -s medcat/compare_models
python tests/runner/custom_test_runner.py
python tests/runner/custom_test_runner.py -s medcat/compare_models
# TODO - in the future, we might want to add automated tests for notebooks as well
# though it's not really possible right now since the notebooks are designed
# in a way that assumes interaction (i.e specifying model pack names)
17 changes: 11 additions & 6 deletions medcat/1_create_model/create_cdb/create_cdb.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import os
import pandas as pd
from medcat.config import Config
from medcat.cdb_maker import CDBMaker
from medcat.model_creation.cdb_maker import CDBMaker
from medcat.storage.serialisers import serialise, AvailableSerialisers

pd.options.mode.chained_assignment = None # type: ignore

Expand All @@ -24,6 +25,10 @@

model_dir = os.path.join(BASE_PATH, "models", "cdb")
output_cdb = os.path.join(model_dir, f"{release}_SNOMED_cdb.dat")
os.makedirs(output_cdb, exist_ok=True)
# NOTE: by default, new models creaeted at the same location will not be saved
# so here we allow overwrtiing
allow_overwrite = True
csv = pd.read_csv(csv_path)

# Remove null values
Expand All @@ -50,9 +55,9 @@

# Setup config
config = Config()
config.general['spacy_model'] = 'en_core_web_md'
config.cdb_maker['remove_parenthesis'] = 1
config.general['cdb_source_name'] = f'SNOMED_{release}'
config.general.nlp.modelname = 'en_core_web_md'
config.cdb_maker.remove_parenthesis = 1
# config.general.cdb_source_name = f'SNOMED_{release}'

maker = CDBMaker(config)

Expand All @@ -64,8 +69,8 @@

# Add type_id pretty names to cdb
cdb.addl_info['type_id2name'] = pd.Series(csv.description_type_ids.values, index=csv.type_ids.astype(str)).to_dict()
cdb.config.linking['filters']['cuis'] = set(csv['cui'].tolist()) # Add all cuis to filter out legacy terms.
cdb.config.components.linking.filters.cuis = set(csv['cui'].tolist()) # Add all cuis to filter out legacy terms.

# save model
cdb.save(output_cdb)
serialise(AvailableSerialisers.dill, cdb, output_cdb, overwrite=allow_overwrite)
print(f"CDB Model saved successfully as: {output_cdb}")
17 changes: 11 additions & 6 deletions medcat/1_create_model/create_cdb/create_umls_cdb.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import os
import pandas as pd
from medcat.config import Config
from medcat.cdb_maker import CDBMaker
from medcat.model_creation.cdb_maker import CDBMaker
from medcat.storage.serialisers import serialise, AvailableSerialisers

pd.options.mode.chained_assignment = None # type: ignore

Expand All @@ -28,6 +29,10 @@

model_dir = os.path.join(BASE_PATH, "models", "cdb")
output_cdb = os.path.join(model_dir, f"{release}_UMLS_cdb.dat")
os.makedirs(output_cdb, exist_ok=True)
# NOTE: by default, new models creaeted at the same location will not be saved
# so here we allow overwrtiing
allow_overwrite = True
csv = pd.read_csv(csv_path)

# Remove null values
Expand All @@ -39,9 +44,9 @@

# Setup config
config = Config()
config.general['spacy_model'] = 'en_core_web_md'
config.cdb_maker['remove_parenthesis'] = 1
config.general['cdb_source_name'] = f'UMLS_{release}'
config.general.nlp.modelname = 'en_core_web_md'
config.cdb_maker.remove_parenthesis = 1
# config.general.cdb_source_name = f'UMLS_{release}'

maker = CDBMaker(config)

Expand All @@ -52,8 +57,8 @@
cdb = maker.prepare_csvs(csv_paths, full_build=True)

# Add type_id pretty names to cdb
cdb.config.linking['filters']['cuis'] = set(csv['cui'].tolist()) # Add all cuis to filter out legacy terms.
cdb.config.components.linking.filters.cuis = set(csv['cui'].tolist()) # Add all cuis to filter out legacy terms.

# save model
cdb.save(output_cdb)
serialise(AvailableSerialisers.dill, cdb, output_cdb, overwrite=allow_overwrite)
print(f"CDB Model saved successfully as: {output_cdb}")
32 changes: 21 additions & 11 deletions medcat/1_create_model/create_modelpack/create_modelpack.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,27 +39,37 @@ def load_cdb_and_save_modelpack(cdb_path: str,
str: The model pack path.
"""
# Load cdb
cdb = CDB.load(cdb_path)
cdb: CDB
try:
cdb = CDB.load(cdb_path)
except NotADirectoryError:
from medcat.utils.legacy.convert_cdb import get_cdb_from_old
cdb = get_cdb_from_old(cdb_path)

# Set cdb configuration
# technically we already created this during the cdb creation
cdb.config.ner['min_name_len'] = 2
cdb.config.ner['upper_case_limit_len'] = 3
cdb.config.general['spell_check'] = True
cdb.config.linking['train_count_threshold'] = 10
cdb.config.linking['similarity_threshold'] = 0.3
cdb.config.linking['train'] = True
cdb.config.linking['disamb_length_limit'] = 4
cdb.config.general['full_unlink'] = True
cdb.config.components.ner.min_name_len = 2
cdb.config.components.ner.upper_case_limit_len = 3
cdb.config.general.spell_check = True
cdb.config.components.linking.train_count_threshold = 10
cdb.config.components.linking.similarity_threshold = 0.3
cdb.config.components.linking.train = True
cdb.config.components.linking.disamb_length_limit = 4
cdb.config.general.full_unlink = True

# Load vocab
vocab = Vocab.load(vocab_path)
vocab: Vocab
try:
vocab = Vocab.load(vocab_path)
except NotADirectoryError:
from medcat.utils.legacy.convert_vocab import get_vocab_from_old
vocab = get_vocab_from_old(vocab_path)

# Initialise the model
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab)

# Create and save model pack
return cat.create_model_pack(save_dir_path=modelpack_path, model_pack_name=modelpack_name)
return cat.save_model_pack(modelpack_path, pack_name=modelpack_name)


def load_cdb_and_save_modelpack_in_def_location(cdb_name: str,
Expand Down
6 changes: 4 additions & 2 deletions medcat/1_create_model/create_vocab/create_vocab.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from medcat.vocab import Vocab
from medcat.storage.serialisers import serialise, AvailableSerialisers
import os

vocab = Vocab()
Expand All @@ -17,5 +18,6 @@
# embeddings of 300 dimensions is standard

vocab.add_words(os.path.join(vocab_dir, 'vocab_data.txt'), replace=True)
vocab.make_unigram_table()
vocab.save(os.path.join(vocab_dir, "vocab.dat"))
vocab_folder = os.path.join(vocab_dir, "vocab.dat")
os.makedirs(vocab_folder, exist_ok=True)
serialise(AvailableSerialisers.dill, vocab, vocab_folder)
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@
"metadata": {},
"outputs": [],
"source": [
"cat.cdb.print_stats()"
"cat.cdb.get_basic_info()"
]
},
{
Expand Down Expand Up @@ -88,21 +88,12 @@
"outputs": [],
"source": [
"# Print statistics on the CDB before training\n",
"cat.cdb.print_stats()\n",
"cat.cdb.get_basic_info()\n",
"\n",
"# Run the annotation procedure over all the documents we have,\n",
"# given that we have a large number of documents this can take quite some time.\n",
"\n",
"for i, text in enumerate(data['text'].values):\n",
" # This will now run the training in the background \n",
" try:\n",
" _ = cat(text, do_train=True)\n",
" except TypeError:\n",
" pass\n",
" \n",
" # So we know how things are moving\n",
" if i % 10000 == 0:\n",
" print(\"Finished {} - text blocks\".format(i))\n"
"cat.trainer.train_unsupervised(data.text)\n"
]
},
{
Expand All @@ -112,7 +103,7 @@
"outputs": [],
"source": [
"# Print statistics on the CDB after training\n",
"cat.cdb.print_stats()"
"cat.cdb.get_basic_info()"
]
},
{
Expand All @@ -122,7 +113,8 @@
"outputs": [],
"source": [
"# save modelpack\n",
"cat.create_model_pack(save_dir_path=model_dir, model_pack_name=output_modelpack)\n"
"\n",
"cat.save_model_pack(model_dir, pack_name=output_modelpack)\n"
]
},
{
Expand All @@ -135,7 +127,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "medcat",
"display_name": "venv_v2",
"language": "python",
"name": "python3"
},
Expand All @@ -149,12 +141,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8 (main, Nov 24 2022, 08:08:27) [Clang 14.0.6 ]"
},
"vscode": {
"interpreter": {
"hash": "4e4ccc64ca47f932c34194843713e175cf3a19af3798844e4190152d16ba61ca"
}
"version": "3.10.13"
}
},
"nbformat": 4,
Expand Down
Loading