ocr path is wrong in 2 cases: image and pdf #1988

tukao89 · 2024-12-09T09:14:00Z

Describe the bug

When i set ocr path as bellow (end with tesseract.exe), it'll correct with pdf file.

Job Settings

  ocr:
    enabled: true
    path: "C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
    data_path: "C:\\Program Files\\Tesseract-OCR\\tessdata"
    pdf_strategy: "ocr_and_text"
    language: "vie+eng"
    output_type: "txt"

But it's not working for image files (example .png). So i changed ocr path as bellow (remove tesseract.exe):

  ocr:
    enabled: true
    path: "C:\\Program Files\\Tesseract-OCR\\"
    data_path: "C:\\Program Files\\Tesseract-OCR\\tessdata"
    pdf_strategy: "ocr_and_text"
    language: "vie+eng"
    output_type: "txt"

==> It's error for pdf file.

Logs

2024-12-09 16:06:30,950 [ERROR] [test-ocr.pdf][\test-ocr.pdf] Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly

Versions:

OS: Windows
Version 2.10

The text was updated successfully, but these errors were encountered:

dadoonet · 2025-01-17T22:18:28Z

It looks like a bug indeed with the configuration of the parsers. Let me try to reproduce.

dadoonet · 2025-01-30T14:15:56Z

May be it's related to windows or to the tesseract version you are using. I'm not able to reproduce the problem yet. I added an IT test for it:

public class FsCrawlerTestOcrIT extends AbstractFsCrawlerITCase {
    private static final Logger logger = LogManager.getLogger();

    @Test
    public void test_ocr() throws Exception {
        String exec = "tesseract";
        Optional<Path> tessPath = Stream.of(System.getenv("PATH").split(Pattern.quote(File.pathSeparator)))
                .map(Paths::get)
                .filter(path -> Files.exists(path.resolve(exec)))
                .findFirst();
        assumeTrue("We need to have tesseract installed and present in path to run this test", tessPath.isPresent());
        Path tessDirPath = tessPath.get();
        Path tesseract = tessDirPath.resolve(exec);
        logger.info("Tesseract is installed at [{}]", tesseract);

        // Default behaviour
        {
            crawler = startCrawler();

            // We expect to have one file
            ESSearchResponse searchResponse = countTestHelper(new ESSearchRequest().withIndex(getCrawlerName()), 2L, null);

            // The default configuration should not add file attributes
            for (ESSearchHit hit : searchResponse.getHits()) {
                assertThat(JsonPath.read(hit.getSource(), "$.content"), containsString("words"));
            }

            crawler.close();
            crawler = null;
        }

        {
            Fs fs = startCrawlerDefinition()
                    .setOcr(Ocr.builder()
                            .setEnabled(true)
                            .setPath(tesseract.toString())
                            .setPdfStrategy("ocr_and_text")
                            .setLanguage("vie+eng")
                            .setOutputType("txt")
                            .build())
                    .build();

            crawler = startCrawler(getCrawlerName(), fs, endCrawlerDefinition(getCrawlerName()), null);

            // We expect to have one file
            ESSearchResponse searchResponse = countTestHelper(new ESSearchRequest().withIndex(getCrawlerName()), 2L, null);

            // The default configuration should not add file attributes
            for (ESSearchHit hit : searchResponse.getHits()) {
                assertThat(JsonPath.read(hit.getSource(), "$.content"), containsString("words"));
            }

            crawler.close();
            crawler = null;
        }

        {
            Fs fs = startCrawlerDefinition()
                    .setOcr(Ocr.builder()
                            .setEnabled(true)
                            .setPath(tessDirPath.toString())
                            .setPdfStrategy("ocr_and_text")
                            .setLanguage("vie+eng")
                            .setOutputType("txt")
                            .build())
                    .build();

            crawler = startCrawler(getCrawlerName(), fs, endCrawlerDefinition(getCrawlerName()), null);

            // We expect to have one file
            ESSearchResponse searchResponse = countTestHelper(new ESSearchRequest().withIndex(getCrawlerName()), 2L, null);

            // The default configuration should not add file attributes
            for (ESSearchHit hit : searchResponse.getHits()) {
                assertThat(JsonPath.read(hit.getSource(), "$.content"), containsString("words"));
            }
        }
    }
}

I'm wondering what is happening here.

See #1988.

tukao89 added the check_for_bug Needs to be reproduced label Dec 9, 2024

dadoonet added bug For confirmed bugs and removed check_for_bug Needs to be reproduced labels Jan 17, 2025

dadoonet self-assigned this Jan 17, 2025

dadoonet added this to the 2.10 milestone Jan 17, 2025

dadoonet added a commit that referenced this issue Jan 30, 2025

Add test for OCR

1eeefa6

See #1988.

dadoonet mentioned this issue Jan 30, 2025

Add test for OCR #2006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ocr path is wrong in 2 cases: image and pdf #1988

ocr path is wrong in 2 cases: image and pdf #1988

tukao89 commented Dec 9, 2024 •

edited

Loading

dadoonet commented Jan 17, 2025

dadoonet commented Jan 30, 2025

ocr path is wrong in 2 cases: image and pdf #1988

ocr path is wrong in 2 cases: image and pdf #1988

Comments

tukao89 commented Dec 9, 2024 • edited Loading

dadoonet commented Jan 17, 2025

dadoonet commented Jan 30, 2025

tukao89 commented Dec 9, 2024 •

edited

Loading