Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
375 changes: 375 additions & 0 deletions notebooks/use-cases/information-extraction.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,375 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "7c83150f-fa8b-42a1-8974-ce9483912fba",
"metadata": {},
"source": [
"# Information Extraction with Docling\n",
"\n",
"This notebook demonstrates how to extract structured information from unstructured documents using [Docling](https://docling-project.github.io/docling/). You'll learn how to use different template formats to extract specific data fields from documents like invoices.\n",
"\n",
"*Note: The extraction API is currently in beta and may change without prior notice.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d58fb60e",
"metadata": {},
"outputs": [],
"source": [
"%pip install -q docling[vlm] # Install the Docling package with VLM support"
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This installation cell appears before the 'Installation' section header (line 30). Consider removing this duplicate installation command or moving the markdown section above it to maintain logical flow.

Copilot uses AI. Check for mistakes.
]
},
Comment on lines +16 to +24
Copy link

@coderabbitai coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

CI blocker: add a 'parameters' cell (Papermill smoke test).

Insert a code cell at the top tagged with 'parameters' and move runtime-tunable values there (e.g., invoice_url, allowed formats). This unblocks the "Smoke Tests for Notebooks".

Example cell (ensure cell metadata has tags: ["parameters"]):

# Parameters
invoice_url = "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf"
ALLOWED_FORMATS = ["PDF", "IMAGE"]  # used below

Verification script:


🏁 Script executed:

#!/bin/bash
# Fails if no cell has the 'parameters' tag
jq -e '[.cells[].metadata.tags? // [] | index("parameters")] | any' notebooks/use-cases/information-extraction.ipynb >/dev/null || {
  echo "Missing 'parameters' tag"; exit 1; }

Length of output: 24


🏁 Script executed:

#!/bin/bash
# Extract notebook structure: cells, metadata, kernelspec, and content summary
jq '{
  kernelspec: .metadata.kernelspec,
  cell_count: (.cells | length),
  cells_summary: [.cells[] | {
    type: .cell_type,
    has_parameters_tag: ((.metadata.tags // []) | contains(["parameters"])),
    tags: (.metadata.tags // []),
    content_preview: (if .cell_type == "code" then (.source | join("") | .[0:100]) else "markdown" end)
  }]
}' notebooks/use-cases/information-extraction.ipynb

Length of output: 3296


CI blocker confirmed: add a 'parameters' cell (Papermill smoke test).

Verification confirms no cell is tagged 'parameters'. Insert a code cell at the top with metadata tag 'parameters' containing runtime-tunable values (e.g., invoice_url, allowed formats). This unblocks the "Smoke Tests for Notebooks" pipeline.

Example cell (ensure metadata has tags: ["parameters"]):

# Parameters
invoice_url = "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf"
ALLOWED_FORMATS = ["PDF", "IMAGE"]
🤖 Prompt for AI Agents
In notebooks/use-cases/information-extraction.ipynb around lines 16 to 24, the
notebook is missing a Papermill 'parameters' cell which blocks CI; add a new
code cell at the top of the notebook whose metadata includes tags:
["parameters"] and define runtime-tunable variables such as invoice_url and
ALLOWED_FORMATS (example values as in the review) so the Papermill smoke test
can execute; ensure the cell is the first code cell and the metadata tags field
exactly contains "parameters".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianofranz I added # Parameters to line 74 but still having some trouble passing the test. Is this the right intent of that check?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alinaryan It checks for a code cell containing the parameters tag, set as specified here: https://papermill.readthedocs.io/en/latest/usage-parameterize.html. The intention is to make sure we have a cell that can be parameterized to run e.g. in CI in headless mode by papermill. ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I think I added the correct tag this time, it's now passing :)

Comment on lines +16 to +24
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove duplicate installation cell or reorganize sections.

This installation cell appears before the "Installation" section header (line 30). Either remove this cell or move the Installation markdown header above it to maintain logical flow.

Based on learnings

🤖 Prompt for AI Agents
In notebooks/use-cases/information-extraction.ipynb around lines 16 to 24,
there's a pip install code cell placed before the "Installation" section
header—duplicate or out of order; remove this early installation cell or move it
so the "Installation" markdown header comes before the install cell, ensuring
only one installation cell remains and the notebook sections flow logically.

{
"cell_type": "markdown",
"id": "jfbjatzrbg",
"metadata": {},
"source": [
"## Installation\n",
"\n",
"First, we need to install the Docling package with VLM (Vision Language Model) support for information extraction:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8cf4340c-cfd4-418c-955b-be8c0d544e67",
"metadata": {},
"outputs": [],
"source": [
"from IPython import display\n",
"from pydantic import BaseModel, Field\n",
"from rich import print"
]
Comment on lines +42 to +45
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid extra dependency on rich or ensure it’s installed.

from rich import print isn’t necessary here and may not exist in CI images.

-from rich import print
+# use built-in print to avoid extra dependency

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In notebooks/use-cases/information-extraction.ipynb around lines 42 to 45,
remove the unnecessary third-party import "from rich import print" (or replace
it with a safe optional import pattern) so the notebook doesn't rely on a
dependency that may not be present in CI; either delete that line and use the
built-in print throughout, or wrap the import in a try/except that falls back to
Python's print and update requirements only if you choose to keep rich.

},
{
"cell_type": "markdown",
"id": "99a1kzl8aw8",
"metadata": {},
"source": [
"## Setup and Configuration\n",
"\n",
"Let's import the necessary libraries and set up our environment:"
]
},
{
"cell_type": "markdown",
"id": "6f5e2659-e626-4235-97fc-f311adf8f5b7",
"metadata": {},
"source": [
"### Sample Document\n",
"\n",
"For this demonstration, we'll work with a sample invoice document. This will help us understand how information extraction works with real-world documents:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0daf568-dd36-4e07-9d61-f2a4f196c449",
"metadata": {
"editable": true,
"raw_mimetype": "",
"slideshow": {
"slide_type": ""
},
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"invoice_url = \"https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf\"\n",
"\n",
"display.IFrame(invoice_url, width=\"100%\", height=600)"
]
},
{
"cell_type": "markdown",
"id": "20fa56da",
"metadata": {},
"source": [
"### Document Extractor Setup\n",
"\n",
"Now let's configure the document extractor to handle PDF and image formats. The extractor is the main component that will process our documents and extract information using the templates we define:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46dd4e6e",
"metadata": {},
"outputs": [],
"source": [
"from docling.datamodel.base_models import InputFormat\n",
"from docling.document_extractor import DocumentExtractor\n",
"\n",
"extractor = DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF])"
]
},
{
"cell_type": "markdown",
"id": "extraction-templates",
"metadata": {},
"source": [
"## Information Extraction with Templates\n",
"\n",
"Docling supports different template formats for [information extraction](https://docling-project.github.io/docling/examples/extraction/). Templates define the structure and data types of the information you want to extract from documents.\n",
"\n",
"### Configure extraction templates\n",
"\n",
"The next cell configures different template formats available for information extraction. Each template format has its own advantages and use cases:\n",
"\n",
"### Template Selection Guidelines\n",
"\n",
"- **String templates**: Simple JSON format, fastest to write and good for basic extractions\n",
" - Best for: Quick prototyping, simple data structures, minimal setup\n",
" - Example use case: Extracting just a few basic fields like invoice number and total\n",
"\n",
"\n",
"- **Dictionary templates**: Python dictionaries, provides better integration with Python code \n",
" - Best for: Structured data with nested objects, better Python integration\n",
" - Example use case: When you need nested data structures or complex field relationships\n",
"\n",
"\n",
"- **Pydantic model templates**: Recommended for production use with type validation\n",
" - Best for: Production applications, type safety, complex data structures, documentation\n",
" - Example use case: Enterprise applications where data validation and type safety are critical\n",
"\n",
"\n",
"- **Pydantic instance templates**: Useful when you need specific default values that override the model defaults\n",
" - Best for: When you have contextual information that should be used as fallbacks\n",
" - Example use case: Processing invoices from a specific vendor where you know the vendor name but want to extract it if present, or using a known invoice series number as a fallback\n",
" - **Why this matters**: In our sample invoice, if the vendor name isn't clearly extractable but you're processing a batch from \"WordPress Invoice Plugin\", you can set that as the default while still allowing extraction to override it\n",
"\n",
"In a later cell you'll choose which template format to use for the actual extraction."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "template-definition",
"metadata": {},
"outputs": [],
"source": [
"from typing import Optional\n",
"\n",
"# String template format - simple JSON string\n",
"string_template = '{\"invoice_number\": \"string\", \"total\": \"float\"}'\n",
"\n",
"# Dictionary template format - Python dictionary\n",
"dict_template = {\"invoice_number\": \"string\", \"total\": \"float\"}\n",
"\n",
"# Pydantic model template format - class definition with validation\n",
"# Notice how we use Field() with examples and defaults - this improves extraction accuracy!\n",
"class Invoice(BaseModel):\n",
" invoice_number: str = Field(examples=[\"INV-001\", \"12345\"])\n",
" total: float = Field(default=10, examples=[100.0, 250.50])\n",
" vendor_name: Optional[str] = Field(default=None, examples=[\"ACME Corp\", \"Tech Solutions Inc\"])\n",
"\n",
"pydantic_model_template = Invoice\n",
"\n",
"# Pydantic instance template format - model instance with specific defaults\n",
"# This is useful when you have contextual information about the documents you're processing.\n",
"# In this example, imagine you're processing a batch of invoices from a specific context:\n",
"pydantic_instance_template = Invoice(\n",
" invoice_number=\"WP-UNKNOWN\", # Fallback for WordPress invoice plugin documents\n",
" total=0.0, # Safe default when total can't be extracted\n",
" vendor_name=\"WordPress Invoice Plugin\" # Known vendor for this document batch\n",
")\n",
"\n",
"# Why use Pydantic instances? Consider this scenario with our sample invoice:\n",
"# - invoice_number and total will be extracted from the document if found\n",
"# - If vendor_name isn't clearly extractable, \"WordPress Invoice Plugin\" will be used\n",
"# - If invoice_number is missing, \"WP-UNKNOWN\" provides a meaningful fallback\n",
"# - This is very useful for batch processing where you have context about the document source\n",
"\n",
"def get_extraction_template(template_name: str = \"string\"):\n",
" \"\"\"Get the configured extraction template based on name.\n",
"\n",
" Args:\n",
" template_name: One of \"string\", \"dict\", \"pydantic_model\", or \"pydantic_instance\"\n",
" \n",
" Returns:\n",
" Template for extraction\n",
" \n",
" Raises:\n",
" ValueError: If template_name is not recognized\n",
" \"\"\"\n",
" templates = {\n",
" \"string\": string_template,\n",
" \"dict\": dict_template,\n",
" \"pydantic_model\": pydantic_model_template,\n",
" \"pydantic_instance\": pydantic_instance_template\n",
" }\n",
"\n",
" if template_name not in templates:\n",
" raise ValueError(\n",
" f\"Unknown template name: '{template_name}'. \"\n",
" f\"Choose from {list(templates.keys())}\"\n",
" )\n",
" \n",
" return templates[template_name]\n",
"\n",
"# Tips for Better Extraction:\n",
"print(\"💡 Tips for Better Extraction:\")\n",
"print(\"1. Use descriptive field names that clearly indicate what information you're looking for\")\n",
"print(\"2. Provide examples in Pydantic Field definitions to guide the extraction\") \n",
"print(\"3. Specify appropriate data types (string, float, int, etc.) for better accuracy\")\n",
"print(\"4. Use optional fields for data that might not always be present\")\n",
"print(\"5. Test with different template formats to find what works best for your use case\")\n",
"print(\"\")\n",
"print(\"🔧 Pydantic Instance Use Case:\")\n",
"print(\"In our sample invoice, the Pydantic instance template provides:\")\n",
"print(\"- Known vendor fallback: 'WordPress Invoice Plugin' (useful if vendor name is unclear)\")\n",
"print(\"- Meaningful invoice number fallback: 'WP-UNKNOWN' (better than generic defaults)\")\n",
"print(\"- Safe total fallback: 0.0 (prevents errors if extraction fails)\")\n",
"print(\"- Extracted data will still override these defaults when found in the document\")"
]
},
{
"cell_type": "markdown",
"id": "vlm-pipeline",
"metadata": {},
"source": [
"### Choose an extraction template\n",
"\n",
"Next we choose the template format to be used for information extraction.\n",
"\n",
"Each template format has different characteristics:\n",
"\n",
"- **string**: Simple JSON format, fastest to write and good for basic extractions\n",
"- **dict**: Python dictionary, provides better integration with Python code \n",
"- **pydantic_model**: Pydantic model class, recommended for production use with type validation\n",
"- **pydantic_instance**: Pydantic model instance, useful when you need specific default values\n",
"\n",
"Just set `template_to_use` to one of the available template formats."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "vlm-setup",
"metadata": {},
"outputs": [],
"source": [
"# Set the template to use (choose from: \"string\", \"dict\", \"pydantic_model\", \"pydantic_instance\")\n",
"template_to_use = \"pydantic_model\"\n",
"\n",
"extraction_template = get_extraction_template(template_to_use)\n",
"\n",
"print(f\"✓ Using '{template_to_use}' template format\")\n",
"print(f\"Template: {extraction_template}\")"
]
},
{
"cell_type": "markdown",
"id": "extraction-demo",
"metadata": {},
"source": [
"## ✨ Information Extraction\n",
"\n",
"Now we'll perform the information extraction using the selected template format. The extractor will analyze the invoice document and extract the structured information according to the template we configured."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "run-extraction",
"metadata": {},
"outputs": [],
"source": [
"# Perform information extraction using the selected template\n",
"print(f\"Extracting information using '{template_to_use}' template...\")\n",
"\n",
"result = extractor.extract(invoice_url, template=extraction_template)\n",
"\n",
"print(f\"✓ Extraction completed successfully!\")\n",
"print(f\"Extracted data:\")\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "ls58xx8sbwp",
"metadata": {},
"source": [
"## Understanding the Results\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Template Selection Guidelines and Tips for Better Extraction are great, but in my opinion be more effective if they are incorporated where the templates and extraction classes are initialized above. I would move the information in this section above so that the guidelines and tips are associated with code cell blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

"\n",
"The extraction results contain the structured data extracted from the document according to your selected template. The extractor uses vision-language models to understand the document content and map it to the requested fields.\n",
"\n",
"### Interpreting Results\n",
"\n",
"- **Successful extraction**: When the extractor finds the requested information, it will return the structured data in the format specified by your template\n",
"- **Missing fields**: Optional fields may be `None` or use default values if the information isn't found in the document\n",
"- **Data types**: Results will be converted to the specified types (string, float, int, etc.) when possible\n",
"- **Confidence**: The accuracy depends on document quality, field descriptiveness, and template complexity\n",
"\n",
"### Pydantic Instance Template Results Explained\n",
"\n",
"If you chose the `pydantic_instance` template, you'll see how contextual defaults work in practice:\n",
"\n",
"- **`invoice_number` and `total`**: These will be extracted from the actual invoice document if found\n",
"- **`vendor_name`**: If the vendor name isn't clearly visible or extractable from our sample invoice, the fallback \"WordPress Invoice Plugin\" will be used instead of `None`\n",
"\n",
"This can be very useful in scenarios where we happen to have available context that is more relevant than the default values predefined in the model definition.\n",
"\n",
"For example, in the results below:\n",
"- `invoice_number` and `total` are actually set from the values extracted from the document data\n",
"- If there was no clear `vendor_name` to be extracted, the updated default \"WordPress Invoice Plugin\" we provided would be applied instead of the model's default `None`\n",
"\n",
"### Experimenting with Different Templates\n",
"\n",
"You can easily experiment with different template formats by going back to the template selection cell and changing the `template_to_use` variable, then re-running the extraction. Try comparing the `pydantic_model` vs `pydantic_instance` results to see how the contextual defaults affect the output."
]
},
{
"cell_type": "markdown",
"id": "of2v47p5rd",
"metadata": {},
"source": [
"## Additional Resources\n",
"\n",
"### Documentation\n",
"- [Docling Project Documentation](https://docling-project.github.io/docling/)\n",
"- [Pydantic Documentation](https://docs.pydantic.dev/latest/)\n",
"- [Open Data Hub Data Processing Repository](https://github.com/opendatahub-io/odh-data-processing)\n",
"\n",
"### Next Steps\n",
"- Try extracting information from your own documents\n",
"- Experiment with more complex Pydantic models\n",
"- Explore batch processing of multiple documents\n",
"- Integrate extraction into your data processing pipelines\n",
"\n",
"### Feedback and Contributions\n",
"We welcome feedback and contributions! Please visit the [ODH Data Processing repository](https://github.com/opendatahub-io/odh-data-processing) to:\n",
"- Report issues or bugs\n",
"- Suggest improvements\n",
"- Contribute examples and documentation\n",
"- Share your use cases"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.7"
}
},
Comment on lines +354 to +372
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify Python version metadata aligns with CI.

The notebook metadata shows Python 3.13.7 (line 370), but a previous review indicated CI uses Python 3.12. Although marked as addressed in commit f188f91, the version still shows 3.13.7. Please verify the current CI Python version and update the metadata accordingly or remove the version field to avoid drift.

Run the following script to check CI Python version:


🏁 Script executed:

#!/bin/bash
# Check Python version specified in CI workflows
fd -e yml -e yaml . .github/workflows -x grep -H "python-version" {} \;

# Check for other Python version specifications
rg -n "python.*version|setup-python" .github/

Length of output: 211


Update notebook metadata to match CI Python version (3.12).

The CI workflow specifies Python 3.12, but the notebook metadata shows Python 3.13.7 (line 370). This mismatch should be corrected by updating the notebook's metadata "version": "3.13.7" to "version": "3.12" to align with the CI environment and prevent potential compatibility issues.

🤖 Prompt for AI Agents
In notebooks/use-cases/information-extraction.ipynb around lines 354 to 372, the
notebook metadata lists Python version "3.13.7"; update the metadata entry
"version": "3.13.7" to "3.12" so the notebook matches the CI Python 3.12
environment; edit the JSON metadata block under "language_info" -> "version" to
"3.12" and save the notebook.

"nbformat": 4,
"nbformat_minor": 5
}
Loading