-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. #14
Comments
Thanks, I'll check this out. What were minidjvu-mod params? "-l" ? |
Yes
Outlook voor Android downloaden<https://aka.ms/ghei36>
…________________________________
From: Alexander Trufanov ***@***.***>
Sent: Wednesday, September 15, 2021 4:06:17 PM
To: trufanov-nok/minidjvu-mod ***@***.***>
Cc: rmast ***@***.***>; Author ***@***.***>
Subject: Re: [trufanov-nok/minidjvu-mod] Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. (#14)
Thanks, I'll check this out. What were minidjvu-mod params? "-l" ?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#14 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5VIOIONTLWBGRJ5IKDUCCR5TANCNFSM5ECLZB2Q>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Hey, what version of minidjvu-mod you used? I see no AT&T tag in djvu header. I think I've fixed this bug in recent version... |
Commit 94a78c5 |
I would expect just counting the black pixels of the bold and non-bold would reveal a significant percentual difference. |
I guess it's less than 10%. minidjvu-mod is already counting pixels and uses 10% difference threshold. That's a bitmap param called "mass" in the program. But the problem isn't in a finding a differences. If it would so we would end at lossless compression. The problem is in finding a visually important differences. Which will allow to keep page original look while compressing it with "equal" characters substitutions. And further "mass" threshold decrease disadvantages are too beg. Narrowing it's comparison threshold will significantly increase filesize. That's a too simple feature.
I've found a better image "feature" - a rate of avg height to avg width of the stripes. It' looks like it works better than mass in distinguishing bold from non-bold and easy to calculate. But during the tests I faced with the fact that the classification bug that I hoped is already fixed in current version is still there. So now I'm working on it. Probably I've found a bug. It will take days to fix and fine-tune encoding parameters again. |
Would it be an idea to collect a set of example files to somehow automate the testing and finetuning?
Outlook voor Android downloaden<https://aka.ms/ghei36>
…________________________________
From: Alexander Trufanov ***@***.***>
Sent: Saturday, September 18, 2021 1:27:05 AM
To: trufanov-nok/minidjvu-mod ***@***.***>
Cc: rmast ***@***.***>; Author ***@***.***>
Subject: Re: [trufanov-nok/minidjvu-mod] Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. (#14)
I guess it's less than 10%. minidjvu-mod is already counting pixels and uses 10% difference threshold. That's a bitmap param called "mass" in the program. But the problem isn't in a finding a differences. If it would so we would end at lossless compression. The problem is in finding a visually important differences. Which will allow to keep page original look while compressing it with "equal" characters substitutions. And further "mass" threshold decrease disadvantages are too beg. Narrowing it's comparison threshold will significantly increase filesize. That's a too simple feature.
Browsing the articles you've suggested
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.9226&rep=rep1&type=pdf
I've found a better image "feature" - a rate of avg height to avg width of the stripes. It' looks like it works better than mass in distinguishing bold from non-bold and easy to calculate. But during the tests I faced with the fact that the classification bug that I hoped is already fixed in current version is still there. So now I'm working on it. Probably I've found a bug. It will take days to fix and fine-tune encoding parameters again.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#14 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5QMWFWT4VZWHSJUV23UCPFETANCNFSM5ECLZB2Q>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Does the algorithm compare individual mass-percentages, or first make some heuristic/statistic about it to pinpoint some dip? I could imagine bold often comes in complete words if it's hard to choose for just one character. I could even imagine OCR recognition of a word in a language could raise the confidence of recognizing all symbols in a word, including the dots and other little marks near to letters.
Outlook voor Android downloaden<https://aka.ms/ghei36>
…________________________________
From: Robert Mast ***@***.***>
Sent: Saturday, September 18, 2021 1:14:11 PM
To: trufanov-nok/minidjvu-mod ***@***.***>; trufanov-nok/minidjvu-mod ***@***.***>
Cc: Author ***@***.***>
Subject: Re: [trufanov-nok/minidjvu-mod] Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. (#14)
Would it be an idea to collect a set of example files to somehow automate the testing and finetuning?
Outlook voor Android downloaden<https://aka.ms/ghei36>
________________________________
From: Alexander Trufanov ***@***.***>
Sent: Saturday, September 18, 2021 1:27:05 AM
To: trufanov-nok/minidjvu-mod ***@***.***>
Cc: rmast ***@***.***>; Author ***@***.***>
Subject: Re: [trufanov-nok/minidjvu-mod] Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be. (#14)
I guess it's less than 10%. minidjvu-mod is already counting pixels and uses 10% difference threshold. That's a bitmap param called "mass" in the program. But the problem isn't in a finding a differences. If it would so we would end at lossless compression. The problem is in finding a visually important differences. Which will allow to keep page original look while compressing it with "equal" characters substitutions. And further "mass" threshold decrease disadvantages are too beg. Narrowing it's comparison threshold will significantly increase filesize. That's a too simple feature.
Browsing the articles you've suggested
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.9226&rep=rep1&type=pdf
I've found a better image "feature" - a rate of avg height to avg width of the stripes. It' looks like it works better than mass in distinguishing bold from non-bold and easy to calculate. But during the tests I faced with the fact that the classification bug that I hoped is already fixed in current version is still there. So now I'm working on it. Probably I've found a bug. It will take days to fix and fine-tune encoding parameters again.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#14 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5QMWFWT4VZWHSJUV23UCPFETANCNFSM5ECLZB2Q>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Hi Alex,
I made a sampe PBM with scantailor and compressed it with DjVuSolo 3.1 bitonal 300 dpi and minidjvu-mod --lossy for comparison.
Sample.zip
In the version of minidjvu-mod you can see that a normal character has been switched for a bold one:
That switch didn't take place in DjVuSolo3.1 bitonal300, which still reached 3k smaller.
Some links on the subject of recognizing italic and bold, just from some googling:
https://www.researchgate.net/publication/235412971_Automatic_Text_Clustering_and_Classification_Based_on_Font_Geometrical_Characteristics
https://stackoverflow.com/questions/62947592/does-google-cloud-vision-api-detect-formatting-in-ocred-text-like-bold-italics
https://github.com/tesseract-ocr/tesseract/issues/1371
https://studylib.net/doc/18711914/detection-of-bold-italic-and-underline-fonts-for-hindi-ocr
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.9226&rep=rep1&type=pdf
The text was updated successfully, but these errors were encountered: