Skip to content

Expected format of rcm_lang_tagged.txt #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
maheshmylavarapu0057 opened this issue Jun 30, 2021 · 5 comments
Closed

Expected format of rcm_lang_tagged.txt #3

maheshmylavarapu0057 opened this issue Jun 30, 2021 · 5 comments

Comments

@maheshmylavarapu0057
Copy link

HI,
I am exploring this project.I would like to use SPF as sampling method.I found that rcm_lang_tagged.txt file is not present in your project.I understand that this is file should have real codemix dataset of language pair.I would like to know the format of how this data file should be. Please provide with a sample rcm_lang_tagged.txt file of hindi-english codemix data.
Thanks

@AmirHussein96
Copy link

The rcm_lang_tagged.txt still seems to be missing.

@mohdsanadzakirizvi
Copy link
Contributor

It's not missing, that's the file containing Real Code Mixed (or RCM) data that the user has to provide. We used a small dataset off twitter in our internal experiments for the same which we can't release. You can get any real-world code-mixed data and save it in a file with this name and start the entire pipeline.

@mohdsanadzakirizvi
Copy link
Contributor

There is no specific format required for this file, it should just contain code-mixed sentences, each on a new line just like a regular txt file.

@mohdsanadzakirizvi
Copy link
Contributor

You can also give this file any name of your choice and update the parameter in the config file.

@AmirHussein96
Copy link

Thank you @mohdsanadzakirizvi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants