Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocr path is wrong in 2 cases: image and pdf #1988

Open
tukao89 opened this issue Dec 9, 2024 · 2 comments
Open

ocr path is wrong in 2 cases: image and pdf #1988

tukao89 opened this issue Dec 9, 2024 · 2 comments
Assignees
Labels
bug For confirmed bugs
Milestone

Comments

@tukao89
Copy link

tukao89 commented Dec 9, 2024

Describe the bug

When i set ocr path as bellow (end with tesseract.exe), it'll correct with pdf file.

Job Settings

  ocr:
    enabled: true
    path: "C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
    data_path: "C:\\Program Files\\Tesseract-OCR\\tessdata"
    pdf_strategy: "ocr_and_text"
    language: "vie+eng"
    output_type: "txt"

But it's not working for image files (example .png). So i changed ocr path as bellow (remove tesseract.exe):

  ocr:
    enabled: true
    path: "C:\\Program Files\\Tesseract-OCR\\"
    data_path: "C:\\Program Files\\Tesseract-OCR\\tessdata"
    pdf_strategy: "ocr_and_text"
    language: "vie+eng"
    output_type: "txt"

==> It's error for pdf file.

Logs

2024-12-09 16:06:30,950 [ERROR] [test-ocr.pdf][\test-ocr.pdf] Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly

Versions:

  • OS: Windows
  • Version 2.10
@tukao89 tukao89 added the check_for_bug Needs to be reproduced label Dec 9, 2024
@dadoonet dadoonet added bug For confirmed bugs and removed check_for_bug Needs to be reproduced labels Jan 17, 2025
@dadoonet dadoonet self-assigned this Jan 17, 2025
@dadoonet dadoonet added this to the 2.10 milestone Jan 17, 2025
@dadoonet
Copy link
Owner

It looks like a bug indeed with the configuration of the parsers. Let me try to reproduce.

@dadoonet
Copy link
Owner

May be it's related to windows or to the tesseract version you are using. I'm not able to reproduce the problem yet. I added an IT test for it:

public class FsCrawlerTestOcrIT extends AbstractFsCrawlerITCase {
    private static final Logger logger = LogManager.getLogger();

    @Test
    public void test_ocr() throws Exception {
        String exec = "tesseract";
        Optional<Path> tessPath = Stream.of(System.getenv("PATH").split(Pattern.quote(File.pathSeparator)))
                .map(Paths::get)
                .filter(path -> Files.exists(path.resolve(exec)))
                .findFirst();
        assumeTrue("We need to have tesseract installed and present in path to run this test", tessPath.isPresent());
        Path tessDirPath = tessPath.get();
        Path tesseract = tessDirPath.resolve(exec);
        logger.info("Tesseract is installed at [{}]", tesseract);

        // Default behaviour
        {
            crawler = startCrawler();

            // We expect to have one file
            ESSearchResponse searchResponse = countTestHelper(new ESSearchRequest().withIndex(getCrawlerName()), 2L, null);

            // The default configuration should not add file attributes
            for (ESSearchHit hit : searchResponse.getHits()) {
                assertThat(JsonPath.read(hit.getSource(), "$.content"), containsString("words"));
            }

            crawler.close();
            crawler = null;
        }

        {
            Fs fs = startCrawlerDefinition()
                    .setOcr(Ocr.builder()
                            .setEnabled(true)
                            .setPath(tesseract.toString())
                            .setPdfStrategy("ocr_and_text")
                            .setLanguage("vie+eng")
                            .setOutputType("txt")
                            .build())
                    .build();

            crawler = startCrawler(getCrawlerName(), fs, endCrawlerDefinition(getCrawlerName()), null);

            // We expect to have one file
            ESSearchResponse searchResponse = countTestHelper(new ESSearchRequest().withIndex(getCrawlerName()), 2L, null);

            // The default configuration should not add file attributes
            for (ESSearchHit hit : searchResponse.getHits()) {
                assertThat(JsonPath.read(hit.getSource(), "$.content"), containsString("words"));
            }

            crawler.close();
            crawler = null;
        }

        {
            Fs fs = startCrawlerDefinition()
                    .setOcr(Ocr.builder()
                            .setEnabled(true)
                            .setPath(tessDirPath.toString())
                            .setPdfStrategy("ocr_and_text")
                            .setLanguage("vie+eng")
                            .setOutputType("txt")
                            .build())
                    .build();

            crawler = startCrawler(getCrawlerName(), fs, endCrawlerDefinition(getCrawlerName()), null);

            // We expect to have one file
            ESSearchResponse searchResponse = countTestHelper(new ESSearchRequest().withIndex(getCrawlerName()), 2L, null);

            // The default configuration should not add file attributes
            for (ESSearchHit hit : searchResponse.getHits()) {
                assertThat(JsonPath.read(hit.getSource(), "$.content"), containsString("words"));
            }
        }
    }
}

I'm wondering what is happening here.

dadoonet added a commit that referenced this issue Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For confirmed bugs
Projects
None yet
Development

No branches or pull requests

2 participants