-
Notifications
You must be signed in to change notification settings - Fork 599
add arabic vocabs and some modification for the detection model so the errors are more clear #1957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,7 +13,7 @@ | |
| # Arabic & Persian | ||
| "arabic_diacritics": "ًٌٍَُِّْ", | ||
| "arabic_digits": "٠١٢٣٤٥٦٧٨٩", | ||
| "arabic_letters": "ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىي", | ||
| "arabic_letters": "- ء آ أ ؤ إ ئ ا ٪ ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ٰیٕ٪ ل م ن ه ة و ي پ چ ڢ ڤ گ ﻻ ﻷ ﻹ ﻵ ﺀ ﺁ ﺃ ﺅ ﺇ ﺉ ﺍ ﺏ ﺕ ﺙ ﺝ ﺡ ﺥ ﺩ ﺫ ﺭ ﺯ ﺱ ﺵ ﺹ ﺽ ﻁ ﻅ ﻉ ﻍ ﻑ ﻕ ﻙ ﻝ ﻡ ﻥ ﻩ ﻩ ﻭ ﻱ ﺑ ﺗ ﺛ ﺟ ﺣ ﺧ ﺳ ﺷ ﺻ ﺿ ﻃ ﻇ ﻋ ﻏ ﻓ ﻗ ﻛ ﻟ ﻣ ﻧ ﻫ ﻳ ﺒ ﺘ ﺜ ﺠ ﺤ ﺨ ﺴ ﺸ ﺼ ﺾ ﻄ ﻈ ﻌ ﻐ ﻔ ﻘ ﻜ ﻠ ﻤ ﻨ ﻬ ﻴ ﺎ ﺐ ﺖ ﺚ ﺞ ﺢ ﺦ ﺪ ﺬ ﺮ ﺰ ﺲ ﺶ ﺺ ﺾ ﻂ ﻆ ﻊ ﻎ ﻒ ﻖ ﻚ ﻞ ﻢ ﻦ ﻪ ﺔ ﺓﺋ ﺓﺋ ى ﻼوفرّٕ ﺊ ﻯ ﻀ ﻯ ﻼ ﺋ ﺊﺓى ﻀال ص ح x ـ ـوx ﻰ ﻮ ﻲ ً ٌ ؟ ؛ « » — ! # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ { | } ~", | ||
| "arabic_punctuation": "؟؛«»—", | ||
| "persian_letters": "پچڢڤگ", | ||
| # Bangla | ||
|
|
@@ -786,7 +786,8 @@ | |
| VOCABS["multilingual"] = "".join( | ||
| dict.fromkeys( | ||
| # latin_based | ||
| VOCABS["english"] | ||
| VOCABS["arabic"] | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's revert this for the moment, we will add this if we have a multilingual dataset including arabic 👍 |
||
| +VOCABS["english"] | ||
| + VOCABS["albanian"] | ||
| + VOCABS["afrikaans"] | ||
| + VOCABS["azerbaijani"] | ||
|
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general a really good idea to add a sanity check 👍 But we need to rethink the implementation a bit, your current code fits only for the
Here we can add an boolean argument This logic can be added as a private method to the class and called before polygon formatting Afterwards a test needs to be added here: doctr/tests/pytorch/test_datasets_pt.py Line 135 in b547085
and doctr/tests/tensorflow/test_datasets_tf.py Line 108 in b547085
If these parts are done we can add an extra arg to the detection training scripts
and corresponding update the val_set = DetectionDataset(
img_folder=os.path.join(args.val_path, "images"),
label_path=os.path.join(args.val_path, "labels.json"),
sanity_check=args.check_dataset,
.... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest only to extend chars to the existing
arabic_lettersif some are missing additional arabic specific punctuations to add to thearabic_punctuationbecause in thearabicentry western punctuation is already included :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aditional it should not include whitespaces - our models can't work well with whitespaces so please remove if we want to make it more readable then:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Thanks for your feedback!
Just to clarify: in Arabic, letters change shape depending on their position in the word (beginning, middle, or end).
The characters I included cover all these contextual forms, which makes them more suitable for training the model accurately.
Also, the whitespaces between characters are not meant for natural spacing but are used intentionally to differentiate between the different forms of each letter during training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mh.. Understood
Could we split this into vowels, consonants, diacritics ?
At the end each char needs to be unique and whitespace/s are not allowed as mentioned to avoid that something visual is merged we can use
punctuation should be removed because it's later on added to the arabic entry :)
If I merge both I get this: