Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tajik language contains large chunks of Uzbek sentences in Cyrillic script. #6

Open
Muhtasham opened this issue Oct 5, 2021 · 0 comments
Labels
lang:tg Language: Tajik ver:21.09 Version: OSCAR 21.09

Comments

@Muhtasham
Copy link

Hello there,

Since the Oscar is limited by the fasttext language classifier which was trained on Wikipedia, the datasets contain also the sentences in other languages. For instance, Tajik (tg.txt) language contains large chunks of Uzbek sentences in Cyrillic script.

for ex:
File: tg.txt Line: 660247: Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.

if you do simple check using fasttext

import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.', k=2))

Output will be

#(('__label__tg', '__label__bg'), array([0.38605371, 0.14384778]))

Which indicates that it is Tajik but in fact it is not, so a "nutritional table" on website should be created warning people about the issues.

@Uinelj Uinelj transferred this issue from oscar-project/oscar-website Nov 2, 2021
@Uinelj Uinelj changed the title Create a "nutritional table" on website warning people about the issues in datasets Tajik language contains large chunks of Uzbek sentences in Cyrillic script. Nov 2, 2021
@Uinelj Uinelj added lang:tg Language: Tajik ver:21.09 Version: OSCAR 21.09 labels Nov 2, 2021
@Uinelj Uinelj added this to OSCAR Feb 10, 2022
@Uinelj Uinelj mentioned this issue Sep 7, 2022
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang:tg Language: Tajik ver:21.09 Version: OSCAR 21.09
Projects
Status: No status
Development

No branches or pull requests

2 participants