-
-
Notifications
You must be signed in to change notification settings - Fork 653
Issue 1986 Checkpointing write & remove order change #1995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Issue 1986 Checkpointing write & remove order change #1995
Conversation
@devrimcavusoglu thanks for the PR, i'll review it asap |
from clearml.storage.helper import StorageHelper | ||
|
||
clearml_logger = self._task.get_logger() | ||
helper = StorageHelper.get(filename, logger=clearml_logger) | ||
helper.delete(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devrimcavusoglu could you please explain why you are adding this code ?
I understand that the idea is to use clearml's v1.0 new feature and make remove
work as expected for others.
I'm not sure how it would work with the previous logic where a queue is used...
cc @bmartinn any suggestion on how we could recode everything here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said, this would make use of clearml v1.0 new feature, and used to actually remove the remote or local file(s). For backward compatibility, I'd appreciate any suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to pass logger
to StorageHelper.get
, basically this should be fine:
helper = StorageHelper.get(filename)
helper.delete(filename)
clearml
v1.0 supports older versions of clearml-server
, so pushing the requirements for clearml>1.0
should not be limiting in any way for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devrimcavusoglu I rechecked backwards compatibility, the only exception will be uploading models directly to clearml-server
< 1.0. If this is the case the delete function might return an error as the http file server did not support the delete operation. I really think that just protecting with except ValueError
should be enough. wdyt?
try:
helper.delete(filename)
except ValueError:
# log something ?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bmartinn can you please confirm whether we still need _CallbacksContext
and _checkpoint_slots
code with clearml v1.X ?
I have an impression that if we could remove data from remote storage then we can drastically simplify ClearMLSaver
for v1.X ?
The idea could be to setup ClearMLSaver
as ClearMLSaverV1
if clearml is present and has version >= 1.0.
For clearml version < 1.0 we can keep the existing code as ClearMLSaverV0
and try to temporary hack it in order to make it work with the new write->remove logic. Later we will remove the support for clearml version < 1.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devrimcavusoglu I wonder if we could recode __call__
and remove
methods for ClearMLSaverV0
to delay saving execution. We have two possibilities when saving new checkpoint:
- checkpoint is intermediate in the slots -> only save is executed
- checkpoint should replace a previous one in the slots -> save and remove are executed
Second case could be covered easily such that we call "saving" ops in remove method:
...
slots[slots.index(filename)] = None
...
try:
super(ClearMLSaver, self).__call__(checkpoint, filename, metadata)
finally:
WeightsFileHandler.remove_pre_callback(pre_cb_id)
WeightsFileHandler.remove_post_callback(post_cb_id)
The first case is complicated as remove method is not called. Maybe we could schedule a thread to execute saving ops after a delay of 5-10 seconds ?
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vfdev-5 Hi there, sorry for delayed reply (was in tight schedule). Option 2 seems more solid to me, and also more practical. Regarding the class seperation (former comment) for different versions, I think rather than pointing to V0 and V1, handling it under ClearMLSaver
would be more concrete, wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the class seperation (former comment) for different versions, I think rather than pointing to V0 and V1, handling it under ClearMLSaver would be more concrete, wdyt ?
I was thinking that doing things into two separate classes will help us later to quickly remove V0 when no one is using it anymore...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vfdev-5,
@bmartinn can you please confirm whether we still need _CallbacksContext and _checkpoint_slots code with clearml v1.X ?
I think that if you handle storing a new checkpoint to a temporary-named newly uploaded, file, removing the previously uploaded file and renaming the temporary-named file to the previous name, you won't need the checkpoints 🙂
Fixes #1986
Description: Former "remove & write" implementation order is reversed to "write & remove", api call has been added for ClearMLSaver remove functionality.
Check list: