Tika parser may not be parsing embedded documents #358

tballison · 2016-10-12T17:12:44Z

If calling Tika's parse() with 4 parameters (using the ParseContext), you need to add a Parser.class to the ParseContext to handle embedded documents.

See TIKA-2096 for a proposal to fix our API in Tika 2.0, and for a list of other projects (including Tika!) that fell victim to this.

I'd open a PR, but I can't quickly see how to test Tika. I'd recommend grabbing our test_recursive_embedded.docx.

jnioche · 2016-10-13T12:47:42Z

Thanks @tballison, I've implemented your suggestion and created a test environment for Tika, which will be useful for other things as well.

Just to check that I understand it properly : when dealing with embedded docs, is there a way to separate each individual subdoc or does it all get lumped up in a single object? StormCrawler can generate discrete subdocuments from an original one so we could use that.

tballison · 2016-10-13T13:16:35Z

Well, now that you mention it, :), see #361. Let me know if you have any questions.

jnioche closed this as completed in 3a3a8ab Oct 13, 2016

jnioche added this to the 1.2 milestone Oct 13, 2016

jnioche added the parser label Oct 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tika parser may not be parsing embedded documents #358

Tika parser may not be parsing embedded documents #358

tballison commented Oct 12, 2016

jnioche commented Oct 13, 2016

tballison commented Oct 13, 2016

Tika parser may not be parsing embedded documents #358

Tika parser may not be parsing embedded documents #358

Comments

tballison commented Oct 12, 2016

jnioche commented Oct 13, 2016

tballison commented Oct 13, 2016