You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since the Oscar is limited by the fasttext language classifier which was trained on Wikipedia, the datasets contain also the sentences in other languages. For instance, Tajik (tg.txt) language contains large chunks of Uzbek sentences in Cyrillic script.
for ex:
File: tg.txt Line: 660247: Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.
if you do simple check using fasttext
import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.', k=2))
Which indicates that it is Tajik but in fact it is not, so a "nutritional table" on website should be created warning people about the issues.
The text was updated successfully, but these errors were encountered:
Uinelj
transferred this issue from oscar-project/oscar-website
Nov 2, 2021
Uinelj
changed the title
Create a "nutritional table" on website warning people about the issues in datasets
Tajik language contains large chunks of Uzbek sentences in Cyrillic script.
Nov 2, 2021
Hello there,
Since the Oscar is limited by the fasttext language classifier which was trained on Wikipedia, the datasets contain also the sentences in other languages. For instance, Tajik (tg.txt) language contains large chunks of Uzbek sentences in Cyrillic script.
for ex:
File: tg.txt Line: 660247: Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.
if you do simple check using fasttext
Output will be
Which indicates that it is Tajik but in fact it is not, so a "nutritional table" on website should be created warning people about the issues.
The text was updated successfully, but these errors were encountered: