You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks @tballison, I've implemented your suggestion and created a test environment for Tika, which will be useful for other things as well.
Just to check that I understand it properly : when dealing with embedded docs, is there a way to separate each individual subdoc or does it all get lumped up in a single object? StormCrawler can generate discrete subdocuments from an original one so we could use that.
If calling Tika's parse() with 4 parameters (using the ParseContext), you need to add a Parser.class to the ParseContext to handle embedded documents.
See TIKA-2096 for a proposal to fix our API in Tika 2.0, and for a list of other projects (including Tika!) that fell victim to this.
I'd open a PR, but I can't quickly see how to test Tika. I'd recommend grabbing our test_recursive_embedded.docx.
The text was updated successfully, but these errors were encountered: