Skip to content

Update Kaggle-Orpheus_(3B)-TTS.ipynb to fix issue #24 #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 26, 2025

Conversation

rupaut98
Copy link
Contributor

The training script failed with a ValueError because the default data collator couldn't handle the variable sequence lengths inherent in the dataset (due to different text/audio lengths). This PR fixes the issue #24 by explicitly providing a DataCollatorForSeq2Seq to the Trainer. This collator correctly pads sequences within each batch to the maximum length in that batch, using the tokenizer's pad token for inputs/attention masks and -100 for labels (to ignore them in the loss calculation).

@darkacorn
Copy link

must be kaggle specific - we looking into it

@rupaut98
Copy link
Contributor Author

rupaut98 commented Apr 6, 2025

@darkacorn thank you!

@Etherll
Copy link
Contributor

Etherll commented Apr 26, 2025

Hey, can you remove the comma after DataCollatorForSeq2Seq() it causes issues

data_collator = DataCollatorForSeq2Seq(
...
), # <-------

@rupaut98
Copy link
Contributor Author

@Etherll Didn't realize I had that extra "," there. Just fixed it! Thank you!

@shimmyshimmer
Copy link
Contributor

Thanks guys!

@shimmyshimmer shimmyshimmer merged commit b29c5fb into unslothai:main Apr 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants