-
Notifications
You must be signed in to change notification settings - Fork 12
Expected format of rcm_lang_tagged.txt #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The rcm_lang_tagged.txt still seems to be missing. |
It's not missing, that's the file containing Real Code Mixed (or RCM) data that the user has to provide. We used a small dataset off twitter in our internal experiments for the same which we can't release. You can get any real-world code-mixed data and save it in a file with this name and start the entire pipeline. |
There is no specific format required for this file, it should just contain code-mixed sentences, each on a new line just like a regular txt file. |
You can also give this file any name of your choice and update the parameter in the config file. |
Thank you @mohdsanadzakirizvi |
HI,
I am exploring this project.I would like to use SPF as sampling method.I found that rcm_lang_tagged.txt file is not present in your project.I understand that this is file should have real codemix dataset of language pair.I would like to know the format of how this data file should be. Please provide with a sample rcm_lang_tagged.txt file of hindi-english codemix data.
Thanks
The text was updated successfully, but these errors were encountered: