Accelerate launch will hang when the training loop data loader have a file not found error (even though this is catched) #3375

lucasjinreal · 2025-02-02T08:59:22Z

In training in LLaVA 1.6, typically we might unevitable image file missing, it has this catching logic:

def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        attempt, max_attempt = 0, 10
        while attempt < max_attempt:
            try:
                # sample an item
                data_dict = self._sample_item(i)
                # if data_dict is not None:
                break
            except Exception as e:
                attempt += 1
                print(f"Error in loading {i}, retrying...")
                import traceback

                print(e)
                traceback.print_exc()
                i = random.randint(0, len(self.list_data_dict) - 1)
        return data_dict

Actually works fine when using deepspeed train.py or torchrun train.py.

But when using accelerate launch, especially with auto_find_batch_size = True, it will hang when file not found but catched.

Am not sure how to make it work with accelerate since i found accelerate saves memory than deepspeed so I had to use it

The text was updated successfully, but these errors were encountered:

KeshavSingh29 · 2025-02-06T05:43:41Z

This is more of a programming issue on your end than issue with accelerate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate launch will hang when the training loop data loader have a file not found error (even though this is catched) #3375

Accelerate launch will hang when the training loop data loader have a file not found error (even though this is catched) #3375

lucasjinreal commented Feb 2, 2025

KeshavSingh29 commented Feb 6, 2025

Accelerate launch will hang when the training loop data loader have a file not found error (even though this is catched) #3375

Accelerate launch will hang when the training loop data loader have a file not found error (even though this is catched) #3375

Comments

lucasjinreal commented Feb 2, 2025

KeshavSingh29 commented Feb 6, 2025