-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix handling of loading empty metadata
file for queue
#1042
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add some test that would capture the nature of this scenario.
@@ -51,7 +51,8 @@ async def persist_metadata_if_enabled(*, data: dict, entity_directory: str, writ | |||
|
|||
# Write the metadata to the file | |||
file_path = os.path.join(entity_directory, METADATA_FILENAME) | |||
f = await asyncio.to_thread(open, file_path, mode='wb') | |||
mode = 'r+b' if os.path.exists(file_path) else 'wb' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file exists we can still open it with "wb" mode.If we open it with "r+b" mode, then we are not overwriting the whole file with new content but just changing the file starting from the beginning of the file. I doubt that is what we want.
Imagine file with content b"abc"
and you want to change it
with open(path, 'r+b') as f:
f.write(b"x")
-> b"xbc"
with open(path, 'wb') as f:
f.write(b"x")
-> b"x"
Maybe I have missed the point here, but so far it does not seem right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I suppose it is because of wb
+ the fact that we work with files asynchronously that we have a situation where empty files are created.
We open the file in wb
mode, deleting the contents and switching the asynchronous context. If the crawler is interrupted at this point, the file remains empty.
I settled on r+b
mode because we always write formatted json
files. These are also metadata files, so the fields in them are not changed. So I think it should work, since we will be overwriting the same number of lines each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explanation. I think I am getting it now. But for sure this would be great to have in well commented test as it is by no means self-explanatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After testing and thinking about it, I decided to go back to wb
. It's quite a rare bug and it doesn't have a critical impact. But if we encounter some artifacts due to r+b
, it can be much more painful to find the cause.
I added a test to reproduce the case of creating an empty file. Maybe we can use it if we try to find a more stable solution for overwriting files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM.
Description
Issues