Skip to content

feat: add pypdfium2 as optional PDF parser#298

Open
Arvuno wants to merge 12 commits into
VectifyAI:mainfrom
Arvuno:add-pypdfium2-parser
Open

feat: add pypdfium2 as optional PDF parser#298
Arvuno wants to merge 12 commits into
VectifyAI:mainfrom
Arvuno:add-pypdfium2-parser

Conversation

@Arvuno
Copy link
Copy Markdown

@Arvuno Arvuno commented May 25, 2026

Summary

Add pypdfium2 as an optional PDF parser, providing 3-5x faster parsing with cleaner text extraction (no broken words, correct Unicode).

Changes

  • Add pypdfium2 as optional PDF parser (lazy-imported, not required)
  • Make PageIndexClient parser-agnostic, pdf_parser configurable per index() call
  • Move pdf_parser off doc dict, pass via call args
  • Centralize default parser as DEFAULT_PDF_PARSER constant
  • Keep pdf_parser default in code, not config.yaml

Testing

Default behavior unchanged. Users can opt in via pdf_parser="pypdfium2".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants