Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tika parser may not be parsing embedded documents #358

Closed
tballison opened this issue Oct 12, 2016 · 2 comments
Closed

Tika parser may not be parsing embedded documents #358

tballison opened this issue Oct 12, 2016 · 2 comments
Labels
Milestone

Comments

@tballison
Copy link
Contributor

If calling Tika's parse() with 4 parameters (using the ParseContext), you need to add a Parser.class to the ParseContext to handle embedded documents.

See TIKA-2096 for a proposal to fix our API in Tika 2.0, and for a list of other projects (including Tika!) that fell victim to this.

I'd open a PR, but I can't quickly see how to test Tika. I'd recommend grabbing our test_recursive_embedded.docx.

@jnioche jnioche added this to the 1.2 milestone Oct 13, 2016
@jnioche jnioche added the parser label Oct 13, 2016
@jnioche
Copy link
Contributor

jnioche commented Oct 13, 2016

Thanks @tballison, I've implemented your suggestion and created a test environment for Tika, which will be useful for other things as well.

Just to check that I understand it properly : when dealing with embedded docs, is there a way to separate each individual subdoc or does it all get lumped up in a single object? StormCrawler can generate discrete subdocuments from an original one so we could use that.

@tballison
Copy link
Contributor Author

Well, now that you mention it, :), see #361. Let me know if you have any questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants