Skip to content
This repository was archived by the owner on Jul 17, 2025. It is now read-only.

Commit 2578148

Browse files
authored
refactor: pdf extractor (#18)
* feat: Update langfuse dependency to version 3.0.0 and adjust related imports - Updated langfuse version in pyproject.toml and poetry.lock files. - Modified import statements in langfuse_ragas_evaluator.py to reflect new package structure. - Adjusted langfuse_manager.py to use labels instead of is_active for prompt management. - Refactored langfuse_traced_chain.py to utilize the new CallbackHandler import. - Enhanced traced_chain.py to initialize langfuse client and update tracing logic. * Add comprehensive tests for PDFExtractor functionality - Introduced test suite for enhanced PDF extraction capabilities in `test_enhanced_pdfs.py`. - Created new test files for various PDF types including text-based, mixed content, and scanned documents. - Implemented detailed tests for PDFExtractor's classification, extraction, and linking functionalities in `test_pdf_extractorv2_new.py`. - Added quick functionality verification tests in `test_pdf_functionality.py` to ensure correct operation with real PDF files. - Established mock classes and fixtures to facilitate unit testing of PDF extraction methods. * feat: Update dependencies and modify PDF extractor import - Added a new source for PyTorch and its related packages with CPU support in pyproject.toml. - Included additional dependencies: camelot-py, tabula, and easyocr. - Changed the import statement for PDFExtractor to use the new version (pdf_extractorv2) in dependency_container.py. * feat: add pytest-asyncio support for asynchronous testing * Refactor PDF extractor tests: remove old test files and implement comprehensive test suite for PDFExtractor class - Deleted outdated test files: test_pdf_extractorv2.py, test_pdf_extractorv2_new.py, and test_pdf_functionality.py. - Introduced a new comprehensive test suite for the PDFExtractor class, covering various functionalities including content extraction from different PDF types, error handling, and performance testing. - Added mock dependencies and fixtures to streamline testing processes. - Implemented tests for text extraction, table extraction, language detection, and related ID mapping. - Ensured compatibility with multiple PDF formats and validated metadata completeness in extracted content. * refactor: Moved tests from test_pdf_extractor.py to pdf_extractor_test.py, ensuring comprehensive coverage and maintaining functionality. Removed old test file to streamline the testing structure. * refactor: update flake8 exclusions and clean up PDFExtractor tests for improved readability and maintainability * chore: add pdf files using git lfs * refactor: update parameter names in PDFExtractor class for clarity and consistency; enhance test suite with additional logging and assertions * chore: remove PyTorch and related dependencies from pyproject.toml * refactor: remove unused text-based PDF document from test data * chore: add sample PDF document for testing in extractor-api-lib * refactor: remove unused test methods and main execution block from pdf_extractor_test.py * chore: add pytest-asyncio as a development dependency * Remove unused dependencies: tabula and easyocr from pyproject.toml
1 parent f0666c7 commit 2578148

File tree

11 files changed

+2755
-1083
lines changed

11 files changed

+2755
-1083
lines changed

extractor-api-lib/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
__pycache__/
55
*.py[cod]
66
*$py.class
7+
**/.DS_Store
78

89
# C extensions
910
*.so

extractor-api-lib/poetry.lock

Lines changed: 1059 additions & 1005 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

extractor-api-lib/pyproject.toml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,13 @@ description = "Extracts the content of documents, websites, etc and maps it to a
99
authors = ["STACKIT Data and AI Consulting <[email protected]>"]
1010
packages = [{ include = "extractor_api_lib", from = "src" }]
1111

12+
[[tool.poetry.source]]
13+
name = "pytorch_cpu"
14+
url = "https://download.pytorch.org/whl/cpu"
15+
priority = "explicit"
16+
1217
[tool.flake8]
13-
exclude = [".eggs", "./src/extractor_api_lib/models/*", ".git", ".hg", ".mypy_cache", ".tox", ".venv", ".devcontainer", "venv", "_build", "buck-out", "build", "dist", "**/__init__.py"]
18+
exclude = [".eggs", "./src/extractor_api_lib/models/*", ".git", ".hg", ".mypy_cache", ".tox", ".venv", ".devcontainer", "venv", "_build", "buck-out", "build", "dist", "**/__init__.py", "tests/test_data/generate_test_pdfs.py"]
1419
statistics = true
1520
show-source = false
1621
max-complexity = 10
@@ -93,10 +98,12 @@ langchain-community = "^0.3.23"
9398
atlassian-python-api = "^4.0.3"
9499
markdownify = "^1.1.0"
95100
langchain-core = "0.3.63"
101+
camelot-py = {extras = ["cv"], version = "^1.0.0"}
96102
fake-useragent = "^2.2.0"
97103

98104
[tool.poetry.group.dev.dependencies]
99105
pytest = "^8.3.5"
106+
pytest-asyncio = "^0.26.0"
100107
coverage = "^7.8.0"
101108
flake8 = "^7.2.0"
102109
flake8-black = "^0.3.6"

0 commit comments

Comments
 (0)