Skip to content

Conversation

gracetyy
Copy link

This PR adds a new script, PDF Image Extractor, which recursively scans a directory tree for PDF files and extracts all embedded images from each document.

  • All extracted images are saved in a subfolder named PDF within the input root directory by default (customizable via --out).
  • Each PDF file is organized into its own folder, containing all images extracted from that document.
  • The script supports an optional --dedup flag to enable per-PDF deduplication of images.

Additional notes:

  • Please let me know if you’d like any changes to the folder naming or CLI options.
  • Happy to update documentation or add more examples if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant