Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: convenience unstructured-get-json.sh update #3971

Merged
merged 37 commits into from
Mar 31, 2025

Conversation

cragwolfe
Copy link
Contributor

@cragwolfe cragwolfe commented Mar 28, 2025

  • script now supports:
    • the --vlm flag, to process the document with the VLM strategy
    • optionally takes --vlm-model, --vlm-provider args
    • optionally also writes .html outputs by converting unstructured .json output
    • optionally opens those .html outputs in a browser

Tested with:

unstructured-get-json.sh --write-html --open-html --fast                                                                layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --hi-res                                                              layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --ocr-only                                                            layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --vlm                                                                 layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider openai    --vlm-model gpt-4o                     layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider vertexai  --vlm-model gemini-2.0-flash-001       layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider anthropic --vlm-model claude-3-5-sonnet-20241022 layout-parser-paper-p2.pdf

layout-parser-paper-p2.pdf

@cragwolfe cragwolfe changed the title feat: more permissive conversion to html, script updates feat: convenience unstructured-get-json.sh update Mar 29, 2025
@cragwolfe cragwolfe merged commit 19fc1fc into main Mar 31, 2025
43 checks passed
@cragwolfe cragwolfe deleted the crag/unstructured-get-json-update branch March 31, 2025 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants