Skip to content

Conversation

alinaryan
Copy link
Contributor

@alinaryan alinaryan commented Oct 13, 2025

This notebook provides an example of how to extract structured information from
complex business documents using Docling's extraction API.
Adapted from https://docling-project.github.io/docling/examples/extraction/

Description

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Documentation
    • Added a tutorial notebook demonstrating end-to-end document information extraction.
    • Includes setup for VLM-enabled extraction, extractor configuration, and supported formats.
    • Demonstrates four template approaches (JSON string, dict, Pydantic class, Pydantic instance) with extraction examples.
    • Shows sample field extraction (e.g., invoice number, total), rendering guidance, result interpretation, validation tips, and best practices.

Copy link

coderabbitai bot commented Oct 13, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a new notebook that demonstrates end-to-end document information extraction with Docling: installation with VLM support, DocumentExtractor configuration (allowed formats), four template formats (JSON string, dict, Pydantic class, Pydantic instance), invoice extraction examples, result interpretation, validation, and best-practices guidance.

Changes

Cohort / File(s) Change Summary
New Information Extraction Notebook
notebooks/use-cases/information-extraction.ipynb
Adds a comprehensive notebook demonstrating Docling installation with VLM support, DocumentExtractor initialization and allowed-format configuration, four template formats (JSON string, dict, Pydantic model class, Pydantic model instance), example invoice extraction (sample URL), iframe rendering snippet, result interpretation, Pydantic validation examples, and best-practices guidance.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User
    participant Notebook as Notebook (cells)
    participant Docling as Docling Engine
    participant VLM as VLM Support
    participant Extractor as DocumentExtractor
    participant Template as Template Formats

    User->>Notebook: run cells
    Notebook->>Docling: install & import
    Notebook->>VLM: configure VLM support
    Notebook->>Extractor: init (allowed formats)

    User->>Notebook: define template
    Notebook->>Template: JSON / dict / Pydantic class / instance

    User->>Notebook: request extraction (invoice URL)
    Notebook->>Extractor: submit document
    Extractor->>Docling: parse & extract
    Docling-->>Extractor: extracted fields
    Extractor->>Template: map & validate
    Template-->>Extractor: validated output
    Extractor-->>Notebook: return results
    Notebook-->>User: display results & guidance
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I hopped through notebooks, sniffed the lines,

Found invoice numbers between the signs,
Templates neat, validation true,
Fields returned — a joyful view,
A little rabbit's extraction rhyme.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "[RHAIENG-1095] Add information extraction example notebook" directly and clearly describes the main change in the changeset, which is the addition of a new notebook demonstrating information extraction from documents using Docling. The title is concise, specific, and uses clear language without vague terms or unnecessary noise. The inclusion of the ticket number is standard practice and helps with tracking. A teammate scanning the repository history would immediately understand that this PR adds a new example notebook for information extraction functionality.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f188f91 and 64d919f.

📒 Files selected for processing (1)
  • notebooks/use-cases/information-extraction.ipynb (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • notebooks/use-cases/information-extraction.ipynb
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@alinaryan alinaryan force-pushed the info-extraction-nb branch 2 times, most recently from f2517bc to c2edf36 Compare October 17, 2025 20:42
@alinaryan alinaryan marked this pull request as ready for review October 17, 2025 20:42
@alinaryan alinaryan requested a review from a team as a code owner October 17, 2025 20:42
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
notebooks/use-cases/information-extraction.ipynb (4)

22-23: Pin or externalize Docling install for reproducibility.

Unpinned %pip install -q docling[vlm] can break silently as APIs change. Pin a version range or move deps to a requirements file used by CI images.

Option A (inline pin):

-%pip install -q docling[vlm] # Install the Docling package with VLM support
+%pip install -q "docling[vlm]>=X.Y,<X.(Y+1)"  # pin minor series to avoid breaking changes

Option B (preferred): add notebooks/requirements-info-extraction.txt and use:

-%pip install -q docling[vlm]
+%pip install -q -r notebooks/requirements-info-extraction.txt

74-81: Stabilize external sample asset; add offline/CI fallback.

Embedding a remote PDF via iframe can fail (network/X-Frame-Options) and breaks offline CI. Download to a temp file and use the local path if fetch succeeds; otherwise keep the link.

import os, tempfile, urllib.request
invoice_url = "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf"
local_invoice_path = None
try:
    fd, tmp = tempfile.mkstemp(suffix=".pdf"); os.close(fd)
    urllib.request.urlretrieve(invoice_url, tmp)
    local_invoice_path = tmp
except Exception:
    pass  # fallback to URL

invoice_input = local_invoice_path or invoice_url
display.HTML(f'<a href="{invoice_url}" target="_blank">Open invoice</a>')

Then use invoice_input below.


100-104: Extractor init: confirm API surface and resource notes.

Docling’s extractor APIs evolve; verify DocumentExtractor(...).extract(str_url, template=...) is supported in your pinned version. Consider adding a short note on VLM model size/compute needs to set expectations for users.


169-178: Pydantic defaults may mislead extraction; make required fields explicit.

total defaulting to 10 biases results; make required fields required and keep optional truly optional.

 class Invoice(BaseModel):
-    invoice_number: str = Field(examples=["INV-001", "12345"])
-    total: float = Field(default=10, examples=[100.0, 250.50])
-    vendor_name: Optional[str] = Field(default=None, examples=["ACME Corp", "Tech Solutions Inc"])
+    invoice_number: str = Field(..., examples=["INV-001", "12345"])
+    total: float = Field(..., examples=[100.0, 250.50])
+    vendor_name: Optional[str] = Field(None, examples=["ACME Corp", "Tech Solutions Inc"])

If targeting Pydantic v2 only, you may prefer json_schema_extra={"examples":[...]} for forward-compat metadata.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a9b9cb8 and c2edf36.

📒 Files selected for processing (1)
  • notebooks/use-cases/information-extraction.ipynb (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Smoke Tests for Notebooks
notebooks/use-cases/information-extraction.ipynb

[error] 1-1: Notebook information-extraction.ipynb does not have any cell tagged with 'parameters'.

🔇 Additional comments (2)
notebooks/use-cases/information-extraction.ipynb (2)

127-129: Template type keywords: verify accepted schema names.

Confirm whether Docling expects "float" or "number" (and whether "integer"/nested objects are supported). Use consistent, supported type names across examples to avoid confusion.

Also applies to: 148-150


197-201: Good illustration of instance templates.

Using a model instance to set fallbacks is clear and useful.

Comment on lines +16 to +24
"cell_type": "code",
"execution_count": null,
"id": "d58fb60e",
"metadata": {},
"outputs": [],
"source": [
"%pip install -q docling[vlm] # Install the Docling package with VLM support"
]
},
Copy link

@coderabbitai coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

CI blocker: add a 'parameters' cell (Papermill smoke test).

Insert a code cell at the top tagged with 'parameters' and move runtime-tunable values there (e.g., invoice_url, allowed formats). This unblocks the "Smoke Tests for Notebooks".

Example cell (ensure cell metadata has tags: ["parameters"]):

# Parameters
invoice_url = "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf"
ALLOWED_FORMATS = ["PDF", "IMAGE"]  # used below

Verification script:


🏁 Script executed:

#!/bin/bash
# Fails if no cell has the 'parameters' tag
jq -e '[.cells[].metadata.tags? // [] | index("parameters")] | any' notebooks/use-cases/information-extraction.ipynb >/dev/null || {
  echo "Missing 'parameters' tag"; exit 1; }

Length of output: 24


🏁 Script executed:

#!/bin/bash
# Extract notebook structure: cells, metadata, kernelspec, and content summary
jq '{
  kernelspec: .metadata.kernelspec,
  cell_count: (.cells | length),
  cells_summary: [.cells[] | {
    type: .cell_type,
    has_parameters_tag: ((.metadata.tags // []) | contains(["parameters"])),
    tags: (.metadata.tags // []),
    content_preview: (if .cell_type == "code" then (.source | join("") | .[0:100]) else "markdown" end)
  }]
}' notebooks/use-cases/information-extraction.ipynb

Length of output: 3296


CI blocker confirmed: add a 'parameters' cell (Papermill smoke test).

Verification confirms no cell is tagged 'parameters'. Insert a code cell at the top with metadata tag 'parameters' containing runtime-tunable values (e.g., invoice_url, allowed formats). This unblocks the "Smoke Tests for Notebooks" pipeline.

Example cell (ensure metadata has tags: ["parameters"]):

# Parameters
invoice_url = "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf"
ALLOWED_FORMATS = ["PDF", "IMAGE"]
🤖 Prompt for AI Agents
In notebooks/use-cases/information-extraction.ipynb around lines 16 to 24, the
notebook is missing a Papermill 'parameters' cell which blocks CI; add a new
code cell at the top of the notebook whose metadata includes tags:
["parameters"] and define runtime-tunable variables such as invoice_url and
ALLOWED_FORMATS (example values as in the review) so the Papermill smoke test
can execute; ensure the cell is the first code cell and the metadata tags field
exactly contains "parameters".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianofranz I added # Parameters to line 74 but still having some trouble passing the test. Is this the right intent of that check?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alinaryan It checks for a code cell containing the parameters tag, set as specified here: https://papermill.readthedocs.io/en/latest/usage-parameterize.html. The intention is to make sure we have a cell that can be parameterized to run e.g. in CI in headless mode by papermill. ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I think I added the correct tag this time, it's now passing :)

Comment on lines +42 to +45
"from IPython import display\n",
"from pydantic import BaseModel, Field\n",
"from rich import print"
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid extra dependency on rich or ensure it’s installed.

from rich import print isn’t necessary here and may not exist in CI images.

-from rich import print
+# use built-in print to avoid extra dependency

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In notebooks/use-cases/information-extraction.ipynb around lines 42 to 45,
remove the unnecessary third-party import "from rich import print" (or replace
it with a safe optional import pattern) so the notebook doesn't rely on a
dependency that may not be present in CI; either delete that line and use the
built-in print throughout, or wrap the import in a try/except that falls back to
Python's print and update requirements only if you choose to keep rich.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
notebooks/use-cases/information-extraction.ipynb (3)

42-45: Remove unnecessary rich dependency.

The import from rich import print on line 44 adds an unnecessary dependency that may not be available in CI environments. Use Python's built-in print instead.

Apply this diff:

 from IPython import display
 from pydantic import BaseModel, Field
-from rich import print

68-72: CI blocker: Add 'parameters' tag to cell metadata.

This cell has the # Parameters comment but the metadata lacks the required "tags": ["parameters"] entry. This blocks the Papermill smoke tests in CI.

Update the cell metadata to include:

"metadata": {
  "tags": ["parameters"]
}

256-273: Align notebook Python version with CI environment.

The notebook metadata specifies Python 3.13.7 (line 272), but CI uses Python 3.12. Either update the version to "3.12" or remove the version field to prevent drift and potential compatibility issues.

Apply this diff to align with CI:

    "pygments_lexer": "ipython3",
-   "version": "3.13.7"
+   "version": "3.12"
🧹 Nitpick comments (2)
notebooks/use-cases/information-extraction.ipynb (2)

15-34: Reorder cells for better narrative flow.

The pip install cell (lines 16-24) executes before the markdown cell (lines 25-34) that explains the installation. Swap these cells so the explanation precedes the action.


172-175: Consider using 0.0 as the default for total.

Line 174 sets default=10 for the total field, which seems arbitrary. A default of 0.0 or no default (making it required) would be more intuitive for a monetary amount.

Apply this diff if you prefer a zero default:

-    total: float = Field(default=10, examples=[100.0, 250.50])
+    total: float = Field(default=0.0, examples=[100.0, 250.50])
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c2edf36 and 5dc3323.

📒 Files selected for processing (1)
  • notebooks/use-cases/information-extraction.ipynb (1 hunks)
🔇 Additional comments (3)
notebooks/use-cases/information-extraction.ipynb (3)

95-105: LGTM!

The DocumentExtractor setup correctly specifies allowed formats (IMAGE and PDF) for the extraction workflow.


121-202: Excellent template format coverage!

The notebook effectively demonstrates all four template formats (string, dict, Pydantic class, Pydantic instance) with clear examples and progressive complexity. This provides users with multiple options for different use cases.


209-227: Well-structured guidance for users.

The template selection guidelines and extraction tips provide clear, actionable advice that helps users choose the appropriate template format and improve extraction accuracy.

Comment on lines +77 to +89
"display.HTML(f'''\n",
"<iframe src=\"{invoice_url}\" width=\"100%\" height=\"600px\">\n",
" <p>Your browser does not support iframes. <a href=\"{invoice_url}\">Click here to view the invoice</a></p>\n",
"</iframe>\n",
"''')"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix display.HTML() to properly render the iframe.

The display.HTML() call constructs the HTML object but doesn't explicitly display it. In a notebook cell, wrap it with display.display() or ensure it's the last expression returned.

Apply this diff:

-display.HTML(f'''
+display.display(display.HTML(f'''
 <iframe src="{invoice_url}" width="100%" height="600px">
   <p>Your browser does not support iframes. <a href="{invoice_url}">Click here to view the invoice</a></p>
 </iframe>
-''')
+'''))
🤖 Prompt for AI Agents
In notebooks/use-cases/information-extraction.ipynb around lines 77 to 81, the
call constructs an HTML object via display.HTML(...) but does not actually
render it; wrap the HTML object in display.display(display.HTML(...)) (or make
the display.HTML(...) call the last expression in the cell so it is returned) so
the iframe is rendered; ensure display is imported from IPython.display if not
already.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a comprehensive tutorial notebook demonstrating how to extract structured information from documents using Docling's extraction API. The notebook walks users through different template formats for information extraction, from simple JSON strings to type-safe Pydantic models.

Key Changes:

  • Added a complete tutorial notebook covering information extraction workflows
  • Demonstrated four template approaches: JSON strings, Python dictionaries, Pydantic classes, and Pydantic instances
  • Included practical guidance on template selection, best practices, and result interpretation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

"metadata": {},
"outputs": [],
"source": [
"%pip install -q docling[vlm] # Install the Docling package with VLM support"
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This installation cell appears before the 'Installation' section header (line 30). Consider removing this duplicate installation command or moving the markdown section above it to maintain logical flow.

Copilot uses AI. Check for mistakes.

"\n",
"class Invoice(BaseModel):\n",
" invoice_number: str = Field(examples=[\"INV-001\", \"12345\"])\n",
" total: float = Field(default=10, examples=[100.0, 250.50])\n",
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value of 10 for 'total' seems inconsistent with the provided examples (100.0, 250.50). Consider using a default value that aligns better with the examples, such as 0.0 or removing the default to make it required.

Copilot uses AI. Check for mistakes.

This notebook provides an example of how to extract structured information from
complex business documents using Docling's extraction API.

Signed-off-by: Alina Ryan <[email protected]>
"source": [
"invoice_url = \"https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf\"\n",
"\n",
"display.HTML(f'''\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use whichever option you prefer, but IPython's display has something to draw iFrames:

display.IFrame(invoice_url, width="100%", height=600)

"source": [
"## Information Extraction with Templates\n",
"\n",
"Docling supports different template formats for information extraction. Templates define the structure and data types of the information you want to extract from documents. Let's explore the different approaches:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If documentation about this exists in the Docling docs, maybe worth adding a link here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Docling Advanced Pydantic Model is a super cool example that would be worth mentioning in the Pydantic Model templatess section.

"metadata": {},
"outputs": [],
"source": [
"res = extractor.extract(invoice_url, template={\"invoice_number\": \"string\", \"total\": \"float\"})\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractor.extract takes some time to run on CPU so maybe make this a comment in the instructions of the previous code cell, or something like that. To avoid extractor.extract to run twice with the same params if I'm running all cells.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nevermind, I realize it's doing the same thing multiple times for educational purposes.

"metadata": {},
"outputs": [],
"source": [
"res = extractor.extract(invoice_url, template='{\"invoice_number\": \"string\", \"total\": \"float\"}')\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only have one call of extractor.extract. It's great to point out all of the different ways you can include a template but I could see users calling extract() over and over on accident the way it's laid out now.

Having a cell after all of the different template types are initialized where you can just specify which template you want to use would be great. Something like what we're doing in the (conversion pipeline notebook)[https://github.com/opendatahub-io/odh-data-processing/blob/main/notebooks/use-cases/document-conversion-standard.ipynb?short_path=316e413#L176] is what I think would be more user friendly.

"id": "ls58xx8sbwp",
"metadata": {},
"source": [
"## Understanding the Results\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Template Selection Guidelines and Tips for Better Extraction are great, but in my opinion be more effective if they are incorporated where the templates and extraction classes are initialized above. I would move the information in this section above so that the guidelines and tips are associated with code cell blocks.

"source": [
"### Pydantic Instance Template Format\n",
"\n",
"You can also use a Pydantic model instance as a template, which allows you to override the defaults and provide specific fallback values:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only after reading the docling example for this did I understand why you added this section.

Can you add something grounded in the invoice we're using in the notebook similar to this snippet in the notebook had when they mentioned:

This can be very useful in scenarios where we happen to have available context that is more relevant than the default values predefined in the model definition.

E.g. in the example below:

    bill_no and total are actually set from the value extracted from the data,
    there was no tax_id to be extracted, so the updated default we provided was applied

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants