Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate launch will hang when the training loop data loader have a file not found error (even though this is catched) #3375

Open
lucasjinreal opened this issue Feb 2, 2025 · 1 comment

Comments

@lucasjinreal
Copy link

In training in LLaVA 1.6, typically we might unevitable image file missing, it has this catching logic:

def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        attempt, max_attempt = 0, 10
        while attempt < max_attempt:
            try:
                # sample an item
                data_dict = self._sample_item(i)
                # if data_dict is not None:
                break
            except Exception as e:
                attempt += 1
                print(f"Error in loading {i}, retrying...")
                import traceback

                print(e)
                traceback.print_exc()
                i = random.randint(0, len(self.list_data_dict) - 1)
        return data_dict

Actually works fine when using deepspeed train.py or torchrun train.py.

But when using accelerate launch, especially with auto_find_batch_size = True, it will hang when file not found but catched.

Am not sure how to make it work with accelerate since i found accelerate saves memory than deepspeed so I had to use it

@KeshavSingh29
Copy link

This is more of a programming issue on your end than issue with accelerate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants