Skip to content

Conversation

yoavhhh
Copy link

@yoavhhh yoavhhh commented Jun 17, 2025

Thanks for your contribution to Apache Tika! Your help is appreciated!

Before opening the pull request, please verify that

  • there is an open issue on the Tika issue tracker which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
  • the issue ID (TIKA-XXXX)
    • is referenced in the title of the pull request
    • and placed in front of your commit messages surrounded by square brackets ([TIKA-XXXX] Issue or pull request title)
  • commits are squashed into a single one (or few commits for larger changes)
  • Tika is successfully built and unit tests pass by running mvn clean test
  • there should be no conflicts when merging the pull request branch into the recent main branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled main branch
  • if you add new module that downstream users will depend upon add it to relevant group in tika-bom/pom.xml.

We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!

@tballison
Copy link
Contributor

Thank you for opening this. I'm not sure that it fits within the current design goals of Tika, but there may be ways forward.

I deeply respect sleuthkit and would be interested in pursuing whatever we can to work together.

I also may misunderstand your use case and this PR. Please bear with me.

My major concern is that Tika is intended to process individual files one at a time. Even with a single large docx or PDF, Tika can go out of memory.

If we treat an entire filesystem as a file (obv with embedded files), I think we're aiming for serious problems.

There are two ways I could see some kind of integration point with Tika.

  1. Create a pipesiterator and fetchers so that Tika could iterate through ntfs or any other format handled by sleuthkit.

  2. Create standardized "Unpackaging" api in Tika that would use sleuthkit commandline(s?) to extract binary files for further processing. There are lots of use cases I've seen where "unpackaging" is required rather than the usual parsing. This is typically a pre-parsing step required to unpackage a bundle of files that someone packages for transfer. For example, this can be useful with zips, PSTs, mbox etc.

@tballison
Copy link
Contributor

tballison commented Jul 9, 2025

The above is all high-level. At a lower level, I'm concerned about platform dependent binary code in Tika. We definitely have it in tika-parsers-extended (sqlite3) and in tika-parsers-ml. Another way to handle that is to require users to install the binaries on their system (or in Docker) first, as we do with tesseract (and why we opted not to integrate with tess4j).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants