We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In training in LLaVA 1.6, typically we might unevitable image file missing, it has this catching logic:
def __getitem__(self, i) -> Dict[str, torch.Tensor]: attempt, max_attempt = 0, 10 while attempt < max_attempt: try: # sample an item data_dict = self._sample_item(i) # if data_dict is not None: break except Exception as e: attempt += 1 print(f"Error in loading {i}, retrying...") import traceback print(e) traceback.print_exc() i = random.randint(0, len(self.list_data_dict) - 1) return data_dict
Actually works fine when using deepspeed train.py or torchrun train.py.
deepspeed train.py
torchrun train.py
But when using accelerate launch, especially with auto_find_batch_size = True, it will hang when file not found but catched.
auto_find_batch_size = True
Am not sure how to make it work with accelerate since i found accelerate saves memory than deepspeed so I had to use it
The text was updated successfully, but these errors were encountered:
This is more of a programming issue on your end than issue with accelerate.
Sorry, something went wrong.
No branches or pull requests
In training in LLaVA 1.6, typically we might unevitable image file missing, it has this catching logic:
Actually works fine when using
deepspeed train.py
ortorchrun train.py
.But when using accelerate launch, especially with
auto_find_batch_size = True
, it will hang when file not found but catched.Am not sure how to make it work with accelerate since i found accelerate saves memory than deepspeed so I had to use it
The text was updated successfully, but these errors were encountered: