-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ocr path is wrong in 2 cases: image and pdf #1988
Comments
It looks like a bug indeed with the configuration of the parsers. Let me try to reproduce. |
May be it's related to windows or to the tesseract version you are using. I'm not able to reproduce the problem yet. I added an IT test for it: public class FsCrawlerTestOcrIT extends AbstractFsCrawlerITCase {
private static final Logger logger = LogManager.getLogger();
@Test
public void test_ocr() throws Exception {
String exec = "tesseract";
Optional<Path> tessPath = Stream.of(System.getenv("PATH").split(Pattern.quote(File.pathSeparator)))
.map(Paths::get)
.filter(path -> Files.exists(path.resolve(exec)))
.findFirst();
assumeTrue("We need to have tesseract installed and present in path to run this test", tessPath.isPresent());
Path tessDirPath = tessPath.get();
Path tesseract = tessDirPath.resolve(exec);
logger.info("Tesseract is installed at [{}]", tesseract);
// Default behaviour
{
crawler = startCrawler();
// We expect to have one file
ESSearchResponse searchResponse = countTestHelper(new ESSearchRequest().withIndex(getCrawlerName()), 2L, null);
// The default configuration should not add file attributes
for (ESSearchHit hit : searchResponse.getHits()) {
assertThat(JsonPath.read(hit.getSource(), "$.content"), containsString("words"));
}
crawler.close();
crawler = null;
}
{
Fs fs = startCrawlerDefinition()
.setOcr(Ocr.builder()
.setEnabled(true)
.setPath(tesseract.toString())
.setPdfStrategy("ocr_and_text")
.setLanguage("vie+eng")
.setOutputType("txt")
.build())
.build();
crawler = startCrawler(getCrawlerName(), fs, endCrawlerDefinition(getCrawlerName()), null);
// We expect to have one file
ESSearchResponse searchResponse = countTestHelper(new ESSearchRequest().withIndex(getCrawlerName()), 2L, null);
// The default configuration should not add file attributes
for (ESSearchHit hit : searchResponse.getHits()) {
assertThat(JsonPath.read(hit.getSource(), "$.content"), containsString("words"));
}
crawler.close();
crawler = null;
}
{
Fs fs = startCrawlerDefinition()
.setOcr(Ocr.builder()
.setEnabled(true)
.setPath(tessDirPath.toString())
.setPdfStrategy("ocr_and_text")
.setLanguage("vie+eng")
.setOutputType("txt")
.build())
.build();
crawler = startCrawler(getCrawlerName(), fs, endCrawlerDefinition(getCrawlerName()), null);
// We expect to have one file
ESSearchResponse searchResponse = countTestHelper(new ESSearchRequest().withIndex(getCrawlerName()), 2L, null);
// The default configuration should not add file attributes
for (ESSearchHit hit : searchResponse.getHits()) {
assertThat(JsonPath.read(hit.getSource(), "$.content"), containsString("words"));
}
}
}
} I'm wondering what is happening here. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
When i set ocr path as bellow (end with tesseract.exe), it'll correct with pdf file.
Job Settings
But it's not working for image files (example .png). So i changed ocr path as bellow (remove tesseract.exe):
==> It's error for pdf file.
Logs
Versions:
The text was updated successfully, but these errors were encountered: