Skip to content

Conversation

devrimcavusoglu
Copy link
Contributor

Fixes #1986

Description: Former "remove & write" implementation order is reversed to "write & remove", api call has been added for ClearMLSaver remove functionality.

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@github-actions github-actions bot added module: contrib Contrib module module: handlers Core Handlers module labels May 12, 2021
@vfdev-5
Copy link
Collaborator

vfdev-5 commented May 12, 2021

@devrimcavusoglu thanks for the PR, i'll review it asap

Comment on lines 809 to 813
from clearml.storage.helper import StorageHelper

clearml_logger = self._task.get_logger()
helper = StorageHelper.get(filename, logger=clearml_logger)
helper.delete(filename)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devrimcavusoglu could you please explain why you are adding this code ?
I understand that the idea is to use clearml's v1.0 new feature and make remove work as expected for others.
I'm not sure how it would work with the previous logic where a queue is used...

cc @bmartinn any suggestion on how we could recode everything here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said, this would make use of clearml v1.0 new feature, and used to actually remove the remote or local file(s). For backward compatibility, I'd appreciate any suggestions.

Copy link
Contributor

@bmartinn bmartinn May 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to pass logger to StorageHelper.get , basically this should be fine:

helper = StorageHelper.get(filename)
helper.delete(filename)

clearml v1.0 supports older versions of clearml-server, so pushing the requirements for clearml>1.0 should not be limiting in any way for users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review & comment @bmartinn. I'll update PR with that change asap. I and @vfdev-5 were mainly concerned about backward compatibility as it may negatively impact users using clearml<1.0; however, I see that it is not a concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devrimcavusoglu I rechecked backwards compatibility, the only exception will be uploading models directly to clearml-server < 1.0. If this is the case the delete function might return an error as the http file server did not support the delete operation. I really think that just protecting with except ValueError should be enough. wdyt?

try:
  helper.delete(filename)
except ValueError:
  # log something ?!

Copy link
Collaborator

@vfdev-5 vfdev-5 May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmartinn can you please confirm whether we still need _CallbacksContext and _checkpoint_slots code with clearml v1.X ?

I have an impression that if we could remove data from remote storage then we can drastically simplify ClearMLSaver for v1.X ?

The idea could be to setup ClearMLSaver as ClearMLSaverV1 if clearml is present and has version >= 1.0.

For clearml version < 1.0 we can keep the existing code as ClearMLSaverV0 and try to temporary hack it in order to make it work with the new write->remove logic. Later we will remove the support for clearml version < 1.0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devrimcavusoglu I wonder if we could recode __call__ and remove methods for ClearMLSaverV0 to delay saving execution. We have two possibilities when saving new checkpoint:

  • checkpoint is intermediate in the slots -> only save is executed
  • checkpoint should replace a previous one in the slots -> save and remove are executed

Second case could be covered easily such that we call "saving" ops in remove method:

    ...
    slots[slots.index(filename)] = None
    ...
    try:
        super(ClearMLSaver, self).__call__(checkpoint, filename, metadata)
    finally:
        WeightsFileHandler.remove_pre_callback(pre_cb_id)
        WeightsFileHandler.remove_post_callback(post_cb_id)

The first case is complicated as remove method is not called. Maybe we could schedule a thread to execute saving ops after a delay of 5-10 seconds ?

What do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vfdev-5 Hi there, sorry for delayed reply (was in tight schedule). Option 2 seems more solid to me, and also more practical. Regarding the class seperation (former comment) for different versions, I think rather than pointing to V0 and V1, handling it under ClearMLSaver would be more concrete, wdyt ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the class seperation (former comment) for different versions, I think rather than pointing to V0 and V1, handling it under ClearMLSaver would be more concrete, wdyt ?

I was thinking that doing things into two separate classes will help us later to quickly remove V0 when no one is using it anymore...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vfdev-5,

@bmartinn can you please confirm whether we still need _CallbacksContext and _checkpoint_slots code with clearml v1.X ?

I think that if you handle storing a new checkpoint to a temporary-named newly uploaded, file, removing the previously uploaded file and renaming the temporary-named file to the previous name, you won't need the checkpoints 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: contrib Contrib module module: handlers Core Handlers module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ModelCheckpoint improvement from "remove & write" to "write & remove"
4 participants