Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract might output multiple HOCR bodies when a TIFF is multilayered #104

Open
DiegoPino opened this issue Jan 31, 2025 · 0 comments
Open
Assignees
Labels
bug Something isn't working External Bug Not us, them ocrhighlight Post processor Plugins The ones with a ->run() method Solr Indexing Putting things where they can be found
Milestone

Comments

@DiegoPino
Copy link
Member

What?

Never a dull day. Multi layered TIFFs and pyramidal ones? might be processed by Tesseract as a single File with two outputs.

The largest issue with that is the fact that the HOCR body will have duplicated HTML IDs .. making the parser fail.

I have no solution yet ...

@DiegoPino DiegoPino added bug Something isn't working External Bug Not us, them ocrhighlight Post processor Plugins The ones with a ->run() method Solr Indexing Putting things where they can be found labels Jan 31, 2025
@DiegoPino DiegoPino added this to the 0.9.0 milestone Jan 31, 2025
@DiegoPino DiegoPino self-assigned this Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working External Bug Not us, them ocrhighlight Post processor Plugins The ones with a ->run() method Solr Indexing Putting things where they can be found
Projects
None yet
Development

No branches or pull requests

1 participant