Updates for MetaCAT #515

shubham-s-agarwal · 2025-01-24T12:35:56Z

Addressing the multiple zero-division-error warnings per epoch while training
Accommodating the variations in category name and class name across NHS sites; this is done by introducing 2 new config parameters: category_names_map and class_names_map that store the possible variations

- Addressing the multiple zero-division-error warnings per epoch while training - Accommodating the variations in category name and class name across NHS sites

mart-r

I think there's some naming issues there.

And (potentially) some (minor!) logic issues as well. After all, if the idea of this PR is to more elegantly catch user errors, then it makes sense to do so as much as we can.

Also, zero-division wise, does it make sense to set it to 1 if there's retrieved elements? I would think 0 would suit better, but am open to being persuaded otherwise.

mart-r · 2025-01-24T13:41:00Z

medcat/config_meta_cat.py

@@ -27,8 +27,22 @@ class General(MixingConfig, BaseModel):
    """What category is this meta_cat model predicting/training.

    NB! For these changes to take effect, the pipe would need to be recreated."""
+    category_names_map: List = []


This not a mapping. It's a list. As such, the name doesn't make a lot of sense.
Perhaps this could be named alternative_category_names or something along those lines?

mart-r · 2025-01-24T13:43:25Z

medcat/config_meta_cat.py

    category_value2id: Dict = {}
    """Map from category values to ID, if empty it will be autocalculated during training"""
+    class_names_map: List[List] = [[]]


Again, not a map, but a list. At least in its current state.
Perhaps making it a Dict[str, str] could make sense? I.e a mapping from the presented class name to the expected one.
The example would then be:

{"Hypothetical (N/A)": "Hypothetical", "Not present (False)": "False", "Present (True)": "True"}

medcat/meta_cat.py

mart-r · 2025-01-24T13:49:54Z

medcat/meta_cat.py

-                    category_name, " | ".join(list(data.keys()))))
+            category_matching = [cat for cat in category_name_options if cat in data.keys()]
+            if len(category_matching) > 0:
+                logger.warning("The category name provided in the config - '%s' is not present in the data. However, the corresponding name - '%s' from the category_name_mapping has been found. Updating the category name...",category_name,*category_matching)


I'd say this would be an info message since after this change it'll be expected behaviour.

mart-r · 2025-01-24T13:53:06Z

medcat/meta_cat.py

+                category_name = g_config['category_name']
+            else:
+                raise Exception(
+                    "The category name does not exist in this json file. You've provided '{}', while the possible options are: {}".format(


Perhaps mention the possibility of setting config.general.category_names_map for alternatives in here?

medcat/utils/meta_cat/data_utils.py

mart-r · 2025-01-24T14:17:10Z

medcat/utils/meta_cat/data_utils.py

+                        if len(found_in) != 0:
+                            class_name_matched = [label for label in found_in if label in category_values][0]
+                            updated_category_value2id[class_name_matched] = category_value2id[_class]
+                            logger.warning("Class name '%s' does not exist in the data; however a variation of it '%s' is present; updating it...",_class,class_name_matched)


Again, I feel like at this point, it'd be expected behaviour and better suited as an info message.

mart-r · 2025-01-24T14:19:59Z

medcat/utils/meta_cat/data_utils.py

+            # Else throw an exception since the labels don't match
+            else:
+                raise Exception(
+                    f"The classes set in the config are not the same as the one found in the data. The classes present in the config vs the ones found in the data - {set(category_value2id.keys())}, {category_values}")


Perhaps also mention the possibility of setting config.general.category_names_map for alternatives?

mart-r · 2025-01-24T14:29:44Z

medcat/utils/meta_cat/ml_utils.py

@@ -329,12 +329,12 @@ def initialize_model(classifier, data_, batch_size_, lr_, epochs=4):
        print_report(epoch, running_loss_test, all_logits_test, y=y_test, name='Test')

        _report = classification_report(y_test, np.argmax(np.concatenate(all_logits_test, axis=0), axis=1),
-                                        output_dict=True)
+                                        output_dict=True,zero_division=1)


Does it make sense to have it return 1? I.e if there are no retrieved elements, then precision is undetermined. But if we set it to 1 and don't show the warning, the user might think everything is working great. But in reality, nothing was extracted.
Perhaps setting it to 0 would make more sense?

Though in an ideal world, I'd love to still show a warning to the user. But maybe limit it to once per 10 minutes or something. But I don't think they have API for something like that. And I don't think it'd make sense to spend the time implementing it, either.

mart-r · 2025-01-24T14:30:18Z

medcat/utils/meta_cat/ml_utils.py

        if not winner_report or _report[config.train['metric']['base']][config.train['metric']['score']] > \
                winner_report['report'][config.train['metric']['base']][config.train['metric']['score']]:

            report = classification_report(y_test, np.argmax(np.concatenate(all_logits_test, axis=0), axis=1),
-                                           output_dict=True)
+                                           output_dict=True,zero_division=1)


Same question here. Does it make sense to show 1 in case of no retrieved elements? Or would 0 be more appropriate?

shubham-s-agarwal · 2025-01-24T15:53:22Z

I've addressed all the changes!

mart-r

Looks good to me.

Though perhaps the alternative_class_names config option's doc strings could do with a small change to indicate that it's a list of lists of alternative class names, rather than a map. But I think it's fine since there's a clear example anyway.

tomolopolis · 2025-01-27T09:28:13Z

medcat/config_meta_cat.py

+    Example: For Experiencer, the alternate name is Subject
+    alternative_category_names: ['Experiencer','Subject']
+
+    In the case that one specified in 'category_name' parameter does not match the data, this ensures no error is raised and it is automatically mapped


Can you also say that the model output will be the value configured in <whatever the config property> is

I am not sure I completely understood this
Can you please explain this?

tomolopolis · 2025-01-27T09:30:07Z

medcat/config_meta_cat.py

    category_value2id: Dict = {}
    """Map from category values to ID, if empty it will be autocalculated during training"""
+    alternative_class_names: List[List] = [[]]
+    """Map that stores the variations of possible class names for the given category (task)


Please change to List of lists that stores..., and again specify that the final output of the model will be the class name label specified in the the config <property>

tomolopolis · 2025-01-27T09:53:07Z

medcat/meta_cat.py

+        self.config.model['nclasses'] = len(category_value2id)
+
+        # This is now handled in data_utils where an exception is raised when mismatch is found
+        # Make sure that the categoryvalue2id if present is same as the labels found


these commented lines are needed?

shubham-s-agarwal added 2 commits January 24, 2025 12:27

Pushing update for MetaCAT

f72fc28

- Addressing the multiple zero-division-error warnings per epoch while training - Accommodating the variations in category name and class name across NHS sites

Adding comments

874954e

shubham-s-agarwal added the enhancement New feature or request label Jan 24, 2025

shubham-s-agarwal requested review from tomolopolis and mart-r January 24, 2025 12:35

shubham-s-agarwal self-assigned this Jan 24, 2025

mart-r requested changes Jan 24, 2025

View reviewed changes

Pushing requested changes

ea8e6fe

Pushing type fix

c703014

mart-r approved these changes Jan 24, 2025

View reviewed changes

tomolopolis reviewed Jan 27, 2025

View reviewed changes

Pushing updates to metacat config

55d2165

mart-r mentioned this pull request Jan 29, 2025

CU-8697qfvzz train metacat on sup train #516

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates for MetaCAT #515

Updates for MetaCAT #515

shubham-s-agarwal commented Jan 24, 2025

mart-r left a comment

mart-r Jan 24, 2025

mart-r Jan 24, 2025

mart-r Jan 24, 2025

mart-r Jan 24, 2025

mart-r Jan 24, 2025

mart-r Jan 24, 2025

mart-r Jan 24, 2025

mart-r Jan 24, 2025

shubham-s-agarwal commented Jan 24, 2025

mart-r left a comment

tomolopolis Jan 27, 2025

shubham-s-agarwal Jan 27, 2025

tomolopolis Jan 27, 2025

tomolopolis Jan 27, 2025

Updates for MetaCAT #515

Are you sure you want to change the base?

Updates for MetaCAT #515

Conversation

shubham-s-agarwal commented Jan 24, 2025

mart-r left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubham-s-agarwal commented Jan 24, 2025

mart-r left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment