-
Notifications
You must be signed in to change notification settings - Fork 7
[RHAIENG-1095] Add information extraction example notebook #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds a new notebook that demonstrates end-to-end document information extraction with Docling: installation with VLM support, DocumentExtractor configuration (allowed formats), four template formats (JSON string, dict, Pydantic class, Pydantic instance), invoice extraction examples, result interpretation, validation, and best-practices guidance. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Notebook as Notebook (cells)
participant Docling as Docling Engine
participant VLM as VLM Support
participant Extractor as DocumentExtractor
participant Template as Template Formats
User->>Notebook: run cells
Notebook->>Docling: install & import
Notebook->>VLM: configure VLM support
Notebook->>Extractor: init (allowed formats)
User->>Notebook: define template
Notebook->>Template: JSON / dict / Pydantic class / instance
User->>Notebook: request extraction (invoice URL)
Notebook->>Extractor: submit document
Extractor->>Docling: parse & extract
Docling-->>Extractor: extracted fields
Extractor->>Template: map & validate
Template-->>Extractor: validated output
Extractor-->>Notebook: return results
Notebook-->>User: display results & guidance
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
f2517bc
to
c2edf36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
notebooks/use-cases/information-extraction.ipynb (4)
22-23
: Pin or externalize Docling install for reproducibility.Unpinned
%pip install -q docling[vlm]
can break silently as APIs change. Pin a version range or move deps to a requirements file used by CI images.Option A (inline pin):
-%pip install -q docling[vlm] # Install the Docling package with VLM support +%pip install -q "docling[vlm]>=X.Y,<X.(Y+1)" # pin minor series to avoid breaking changesOption B (preferred): add
notebooks/requirements-info-extraction.txt
and use:-%pip install -q docling[vlm] +%pip install -q -r notebooks/requirements-info-extraction.txt
74-81
: Stabilize external sample asset; add offline/CI fallback.Embedding a remote PDF via iframe can fail (network/X-Frame-Options) and breaks offline CI. Download to a temp file and use the local path if fetch succeeds; otherwise keep the link.
import os, tempfile, urllib.request invoice_url = "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf" local_invoice_path = None try: fd, tmp = tempfile.mkstemp(suffix=".pdf"); os.close(fd) urllib.request.urlretrieve(invoice_url, tmp) local_invoice_path = tmp except Exception: pass # fallback to URL invoice_input = local_invoice_path or invoice_url display.HTML(f'<a href="{invoice_url}" target="_blank">Open invoice</a>')Then use
invoice_input
below.
100-104
: Extractor init: confirm API surface and resource notes.Docling’s extractor APIs evolve; verify
DocumentExtractor(...).extract(str_url, template=...)
is supported in your pinned version. Consider adding a short note on VLM model size/compute needs to set expectations for users.
169-178
: Pydantic defaults may mislead extraction; make required fields explicit.
total
defaulting to10
biases results; make required fields required and keep optional truly optional.class Invoice(BaseModel): - invoice_number: str = Field(examples=["INV-001", "12345"]) - total: float = Field(default=10, examples=[100.0, 250.50]) - vendor_name: Optional[str] = Field(default=None, examples=["ACME Corp", "Tech Solutions Inc"]) + invoice_number: str = Field(..., examples=["INV-001", "12345"]) + total: float = Field(..., examples=[100.0, 250.50]) + vendor_name: Optional[str] = Field(None, examples=["ACME Corp", "Tech Solutions Inc"])If targeting Pydantic v2 only, you may prefer
json_schema_extra={"examples":[...]}
for forward-compat metadata.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
notebooks/use-cases/information-extraction.ipynb
(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Smoke Tests for Notebooks
notebooks/use-cases/information-extraction.ipynb
[error] 1-1: Notebook information-extraction.ipynb does not have any cell tagged with 'parameters'.
🔇 Additional comments (2)
notebooks/use-cases/information-extraction.ipynb (2)
127-129
: Template type keywords: verify accepted schema names.Confirm whether Docling expects
"float"
or"number"
(and whether"integer"
/nested objects are supported). Use consistent, supported type names across examples to avoid confusion.Also applies to: 148-150
197-201
: Good illustration of instance templates.Using a model instance to set fallbacks is clear and useful.
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "d58fb60e", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%pip install -q docling[vlm] # Install the Docling package with VLM support" | ||
] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
CI blocker: add a 'parameters' cell (Papermill smoke test).
Insert a code cell at the top tagged with 'parameters' and move runtime-tunable values there (e.g., invoice_url, allowed formats). This unblocks the "Smoke Tests for Notebooks".
Example cell (ensure cell metadata has tags: ["parameters"]):
# Parameters
invoice_url = "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf"
ALLOWED_FORMATS = ["PDF", "IMAGE"] # used below
Verification script:
🏁 Script executed:
#!/bin/bash
# Fails if no cell has the 'parameters' tag
jq -e '[.cells[].metadata.tags? // [] | index("parameters")] | any' notebooks/use-cases/information-extraction.ipynb >/dev/null || {
echo "Missing 'parameters' tag"; exit 1; }
Length of output: 24
🏁 Script executed:
#!/bin/bash
# Extract notebook structure: cells, metadata, kernelspec, and content summary
jq '{
kernelspec: .metadata.kernelspec,
cell_count: (.cells | length),
cells_summary: [.cells[] | {
type: .cell_type,
has_parameters_tag: ((.metadata.tags // []) | contains(["parameters"])),
tags: (.metadata.tags // []),
content_preview: (if .cell_type == "code" then (.source | join("") | .[0:100]) else "markdown" end)
}]
}' notebooks/use-cases/information-extraction.ipynb
Length of output: 3296
CI blocker confirmed: add a 'parameters' cell (Papermill smoke test).
Verification confirms no cell is tagged 'parameters'. Insert a code cell at the top with metadata tag 'parameters' containing runtime-tunable values (e.g., invoice_url, allowed formats). This unblocks the "Smoke Tests for Notebooks" pipeline.
Example cell (ensure metadata has tags: ["parameters"]):
# Parameters
invoice_url = "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf"
ALLOWED_FORMATS = ["PDF", "IMAGE"]
🤖 Prompt for AI Agents
In notebooks/use-cases/information-extraction.ipynb around lines 16 to 24, the
notebook is missing a Papermill 'parameters' cell which blocks CI; add a new
code cell at the top of the notebook whose metadata includes tags:
["parameters"] and define runtime-tunable variables such as invoice_url and
ALLOWED_FORMATS (example values as in the review) so the Papermill smoke test
can execute; ensure the cell is the first code cell and the metadata tags field
exactly contains "parameters".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabianofranz I added # Parameters
to line 74 but still having some trouble passing the test. Is this the right intent of that check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alinaryan It checks for a code cell containing the parameters
tag, set as specified here: https://papermill.readthedocs.io/en/latest/usage-parameterize.html. The intention is to make sure we have a cell that can be parameterized to run e.g. in CI in headless mode by papermill. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I think I added the correct tag this time, it's now passing :)
"from IPython import display\n", | ||
"from pydantic import BaseModel, Field\n", | ||
"from rich import print" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid extra dependency on rich or ensure it’s installed.
from rich import print
isn’t necessary here and may not exist in CI images.
-from rich import print
+# use built-in print to avoid extra dependency
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In notebooks/use-cases/information-extraction.ipynb around lines 42 to 45,
remove the unnecessary third-party import "from rich import print" (or replace
it with a safe optional import pattern) so the notebook doesn't rely on a
dependency that may not be present in CI; either delete that line and use the
built-in print throughout, or wrap the import in a try/except that falls back to
Python's print and update requirements only if you choose to keep rich.
c2edf36
to
5dc3323
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
notebooks/use-cases/information-extraction.ipynb (3)
42-45
: Remove unnecessaryrich
dependency.The import
from rich import print
on line 44 adds an unnecessary dependency that may not be available in CI environments. Use Python's built-inApply this diff:
from IPython import display from pydantic import BaseModel, Field -from rich import print
68-72
: CI blocker: Add 'parameters' tag to cell metadata.This cell has the
# Parameters
comment but the metadata lacks the required"tags": ["parameters"]
entry. This blocks the Papermill smoke tests in CI.Update the cell metadata to include:
"metadata": { "tags": ["parameters"] }
256-273
: Align notebook Python version with CI environment.The notebook metadata specifies Python 3.13.7 (line 272), but CI uses Python 3.12. Either update the version to
"3.12"
or remove theversion
field to prevent drift and potential compatibility issues.Apply this diff to align with CI:
"pygments_lexer": "ipython3", - "version": "3.13.7" + "version": "3.12"
🧹 Nitpick comments (2)
notebooks/use-cases/information-extraction.ipynb (2)
15-34
: Reorder cells for better narrative flow.The pip install cell (lines 16-24) executes before the markdown cell (lines 25-34) that explains the installation. Swap these cells so the explanation precedes the action.
172-175
: Consider using0.0
as the default fortotal
.Line 174 sets
default=10
for thetotal
field, which seems arbitrary. A default of0.0
or no default (making it required) would be more intuitive for a monetary amount.Apply this diff if you prefer a zero default:
- total: float = Field(default=10, examples=[100.0, 250.50]) + total: float = Field(default=0.0, examples=[100.0, 250.50])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
notebooks/use-cases/information-extraction.ipynb
(1 hunks)
🔇 Additional comments (3)
notebooks/use-cases/information-extraction.ipynb (3)
95-105
: LGTM!The
DocumentExtractor
setup correctly specifies allowed formats (IMAGE and PDF) for the extraction workflow.
121-202
: Excellent template format coverage!The notebook effectively demonstrates all four template formats (string, dict, Pydantic class, Pydantic instance) with clear examples and progressive complexity. This provides users with multiple options for different use cases.
209-227
: Well-structured guidance for users.The template selection guidelines and extraction tips provide clear, actionable advice that helps users choose the appropriate template format and improve extraction accuracy.
"display.HTML(f'''\n", | ||
"<iframe src=\"{invoice_url}\" width=\"100%\" height=\"600px\">\n", | ||
" <p>Your browser does not support iframes. <a href=\"{invoice_url}\">Click here to view the invoice</a></p>\n", | ||
"</iframe>\n", | ||
"''')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix display.HTML()
to properly render the iframe.
The display.HTML()
call constructs the HTML object but doesn't explicitly display it. In a notebook cell, wrap it with display.display()
or ensure it's the last expression returned.
Apply this diff:
-display.HTML(f'''
+display.display(display.HTML(f'''
<iframe src="{invoice_url}" width="100%" height="600px">
<p>Your browser does not support iframes. <a href="{invoice_url}">Click here to view the invoice</a></p>
</iframe>
-''')
+'''))
🤖 Prompt for AI Agents
In notebooks/use-cases/information-extraction.ipynb around lines 77 to 81, the
call constructs an HTML object via display.HTML(...) but does not actually
render it; wrap the HTML object in display.display(display.HTML(...)) (or make
the display.HTML(...) call the last expression in the cell so it is returned) so
the iframe is rendered; ensure display is imported from IPython.display if not
already.
5dc3323
to
f188f91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a comprehensive tutorial notebook demonstrating how to extract structured information from documents using Docling's extraction API. The notebook walks users through different template formats for information extraction, from simple JSON strings to type-safe Pydantic models.
Key Changes:
- Added a complete tutorial notebook covering information extraction workflows
- Demonstrated four template approaches: JSON strings, Python dictionaries, Pydantic classes, and Pydantic instances
- Included practical guidance on template selection, best practices, and result interpretation
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%pip install -q docling[vlm] # Install the Docling package with VLM support" |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This installation cell appears before the 'Installation' section header (line 30). Consider removing this duplicate installation command or moving the markdown section above it to maintain logical flow.
Copilot uses AI. Check for mistakes.
"\n", | ||
"class Invoice(BaseModel):\n", | ||
" invoice_number: str = Field(examples=[\"INV-001\", \"12345\"])\n", | ||
" total: float = Field(default=10, examples=[100.0, 250.50])\n", |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value of 10 for 'total' seems inconsistent with the provided examples (100.0, 250.50). Consider using a default value that aligns better with the examples, such as 0.0 or removing the default to make it required.
Copilot uses AI. Check for mistakes.
This notebook provides an example of how to extract structured information from complex business documents using Docling's extraction API. Signed-off-by: Alina Ryan <[email protected]>
f188f91
to
64d919f
Compare
"source": [ | ||
"invoice_url = \"https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf\"\n", | ||
"\n", | ||
"display.HTML(f'''\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use whichever option you prefer, but IPython
's display
has something to draw iFrames:
display.IFrame(invoice_url, width="100%", height=600)
"source": [ | ||
"## Information Extraction with Templates\n", | ||
"\n", | ||
"Docling supports different template formats for information extraction. Templates define the structure and data types of the information you want to extract from documents. Let's explore the different approaches:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If documentation about this exists in the Docling docs, maybe worth adding a link here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Docling Advanced Pydantic Model is a super cool example that would be worth mentioning in the Pydantic Model templatess section.
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"res = extractor.extract(invoice_url, template={\"invoice_number\": \"string\", \"total\": \"float\"})\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extractor.extract
takes some time to run on CPU so maybe make this a comment in the instructions of the previous code cell, or something like that. To avoid extractor.extract
to run twice with the same params if I'm running all cells.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nevermind, I realize it's doing the same thing multiple times for educational purposes.
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"res = extractor.extract(invoice_url, template='{\"invoice_number\": \"string\", \"total\": \"float\"}')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only have one call of extractor.extract
. It's great to point out all of the different ways you can include a template but I could see users calling extract()
over and over on accident the way it's laid out now.
Having a cell after all of the different template types are initialized where you can just specify which template you want to use would be great. Something like what we're doing in the (conversion pipeline notebook)[https://github.com/opendatahub-io/odh-data-processing/blob/main/notebooks/use-cases/document-conversion-standard.ipynb?short_path=316e413#L176] is what I think would be more user friendly.
"id": "ls58xx8sbwp", | ||
"metadata": {}, | ||
"source": [ | ||
"## Understanding the Results\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Template Selection Guidelines
and Tips for Better Extraction
are great, but in my opinion be more effective if they are incorporated where the templates and extraction classes are initialized above. I would move the information in this section above so that the guidelines and tips are associated with code cell blocks.
"source": [ | ||
"### Pydantic Instance Template Format\n", | ||
"\n", | ||
"You can also use a Pydantic model instance as a template, which allows you to override the defaults and provide specific fallback values:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only after reading the docling example for this did I understand why you added this section.
Can you add something grounded in the invoice we're using in the notebook similar to this snippet in the notebook had when they mentioned:
This can be very useful in scenarios where we happen to have available context that is more relevant than the default values predefined in the model definition.
E.g. in the example below:
bill_no and total are actually set from the value extracted from the data,
there was no tax_id to be extracted, so the updated default we provided was applied
This notebook provides an example of how to extract structured information from
complex business documents using Docling's extraction API.
Adapted from https://docling-project.github.io/docling/examples/extraction/
Description
How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit