-
Couldn't load subscription status.
- Fork 8
[RHAIENG-1095] Add information extraction example notebook #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,375 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "7c83150f-fa8b-42a1-8974-ce9483912fba", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Information Extraction with Docling\n", | ||
| "\n", | ||
| "This notebook demonstrates how to extract structured information from unstructured documents using [Docling](https://docling-project.github.io/docling/). You'll learn how to use different template formats to extract specific data fields from documents like invoices.\n", | ||
| "\n", | ||
| "*Note: The extraction API is currently in beta and may change without prior notice.*" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "d58fb60e", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "%pip install -q docling[vlm] # Install the Docling package with VLM support" | ||
| ] | ||
| }, | ||
|
Comment on lines
+16
to
+24
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chainCI blocker: add a 'parameters' cell (Papermill smoke test). Insert a code cell at the top tagged with 'parameters' and move runtime-tunable values there (e.g., invoice_url, allowed formats). This unblocks the "Smoke Tests for Notebooks". Example cell (ensure cell metadata has tags: ["parameters"]): # Parameters
invoice_url = "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf"
ALLOWED_FORMATS = ["PDF", "IMAGE"] # used belowVerification script: 🏁 Script executed: #!/bin/bash
# Fails if no cell has the 'parameters' tag
jq -e '[.cells[].metadata.tags? // [] | index("parameters")] | any' notebooks/use-cases/information-extraction.ipynb >/dev/null || {
echo "Missing 'parameters' tag"; exit 1; }Length of output: 24 🏁 Script executed: #!/bin/bash
# Extract notebook structure: cells, metadata, kernelspec, and content summary
jq '{
kernelspec: .metadata.kernelspec,
cell_count: (.cells | length),
cells_summary: [.cells[] | {
type: .cell_type,
has_parameters_tag: ((.metadata.tags // []) | contains(["parameters"])),
tags: (.metadata.tags // []),
content_preview: (if .cell_type == "code" then (.source | join("") | .[0:100]) else "markdown" end)
}]
}' notebooks/use-cases/information-extraction.ipynbLength of output: 3296 CI blocker confirmed: add a 'parameters' cell (Papermill smoke test). Verification confirms no cell is tagged 'parameters'. Insert a code cell at the top with metadata tag 'parameters' containing runtime-tunable values (e.g., invoice_url, allowed formats). This unblocks the "Smoke Tests for Notebooks" pipeline. Example cell (ensure metadata has tags: ["parameters"]): # Parameters
invoice_url = "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf"
ALLOWED_FORMATS = ["PDF", "IMAGE"]🤖 Prompt for AI AgentsThere was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fabianofranz I added There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @alinaryan It checks for a code cell containing the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it! I think I added the correct tag this time, it's now passing :)
Comment on lines
+16
to
+24
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove duplicate installation cell or reorganize sections. This installation cell appears before the "Installation" section header (line 30). Either remove this cell or move the Installation markdown header above it to maintain logical flow. Based on learnings 🤖 Prompt for AI Agents |
||
| { | ||
| "cell_type": "markdown", | ||
| "id": "jfbjatzrbg", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Installation\n", | ||
| "\n", | ||
| "First, we need to install the Docling package with VLM (Vision Language Model) support for information extraction:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "8cf4340c-cfd4-418c-955b-be8c0d544e67", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from IPython import display\n", | ||
| "from pydantic import BaseModel, Field\n", | ||
| "from rich import print" | ||
| ] | ||
|
Comment on lines
+42
to
+45
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Avoid extra dependency on rich or ensure it’s installed.
-from rich import print
+# use built-in print to avoid extra dependency
🤖 Prompt for AI Agents |
||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "99a1kzl8aw8", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Setup and Configuration\n", | ||
| "\n", | ||
| "Let's import the necessary libraries and set up our environment:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "6f5e2659-e626-4235-97fc-f311adf8f5b7", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Sample Document\n", | ||
| "\n", | ||
| "For this demonstration, we'll work with a sample invoice document. This will help us understand how information extraction works with real-world documents:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "f0daf568-dd36-4e07-9d61-f2a4f196c449", | ||
| "metadata": { | ||
| "editable": true, | ||
| "raw_mimetype": "", | ||
| "slideshow": { | ||
| "slide_type": "" | ||
| }, | ||
| "tags": [ | ||
| "parameters" | ||
| ] | ||
| }, | ||
| "outputs": [], | ||
| "source": [ | ||
| "invoice_url = \"https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf\"\n", | ||
| "\n", | ||
| "display.IFrame(invoice_url, width=\"100%\", height=600)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "20fa56da", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Document Extractor Setup\n", | ||
| "\n", | ||
| "Now let's configure the document extractor to handle PDF and image formats. The extractor is the main component that will process our documents and extract information using the templates we define:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "46dd4e6e", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from docling.datamodel.base_models import InputFormat\n", | ||
| "from docling.document_extractor import DocumentExtractor\n", | ||
| "\n", | ||
| "extractor = DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF])" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "extraction-templates", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Information Extraction with Templates\n", | ||
| "\n", | ||
| "Docling supports different template formats for [information extraction](https://docling-project.github.io/docling/examples/extraction/). Templates define the structure and data types of the information you want to extract from documents.\n", | ||
| "\n", | ||
| "### Configure extraction templates\n", | ||
| "\n", | ||
| "The next cell configures different template formats available for information extraction. Each template format has its own advantages and use cases:\n", | ||
| "\n", | ||
| "### Template Selection Guidelines\n", | ||
| "\n", | ||
| "- **String templates**: Simple JSON format, fastest to write and good for basic extractions\n", | ||
| " - Best for: Quick prototyping, simple data structures, minimal setup\n", | ||
| " - Example use case: Extracting just a few basic fields like invoice number and total\n", | ||
| "\n", | ||
| "\n", | ||
| "- **Dictionary templates**: Python dictionaries, provides better integration with Python code \n", | ||
| " - Best for: Structured data with nested objects, better Python integration\n", | ||
| " - Example use case: When you need nested data structures or complex field relationships\n", | ||
| "\n", | ||
| "\n", | ||
| "- **Pydantic model templates**: Recommended for production use with type validation\n", | ||
| " - Best for: Production applications, type safety, complex data structures, documentation\n", | ||
| " - Example use case: Enterprise applications where data validation and type safety are critical\n", | ||
| "\n", | ||
| "\n", | ||
| "- **Pydantic instance templates**: Useful when you need specific default values that override the model defaults\n", | ||
| " - Best for: When you have contextual information that should be used as fallbacks\n", | ||
| " - Example use case: Processing invoices from a specific vendor where you know the vendor name but want to extract it if present, or using a known invoice series number as a fallback\n", | ||
| " - **Why this matters**: In our sample invoice, if the vendor name isn't clearly extractable but you're processing a batch from \"WordPress Invoice Plugin\", you can set that as the default while still allowing extraction to override it\n", | ||
| "\n", | ||
| "In a later cell you'll choose which template format to use for the actual extraction." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "template-definition", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from typing import Optional\n", | ||
| "\n", | ||
| "# String template format - simple JSON string\n", | ||
| "string_template = '{\"invoice_number\": \"string\", \"total\": \"float\"}'\n", | ||
| "\n", | ||
| "# Dictionary template format - Python dictionary\n", | ||
| "dict_template = {\"invoice_number\": \"string\", \"total\": \"float\"}\n", | ||
| "\n", | ||
| "# Pydantic model template format - class definition with validation\n", | ||
| "# Notice how we use Field() with examples and defaults - this improves extraction accuracy!\n", | ||
| "class Invoice(BaseModel):\n", | ||
| " invoice_number: str = Field(examples=[\"INV-001\", \"12345\"])\n", | ||
| " total: float = Field(default=10, examples=[100.0, 250.50])\n", | ||
| " vendor_name: Optional[str] = Field(default=None, examples=[\"ACME Corp\", \"Tech Solutions Inc\"])\n", | ||
| "\n", | ||
| "pydantic_model_template = Invoice\n", | ||
| "\n", | ||
| "# Pydantic instance template format - model instance with specific defaults\n", | ||
| "# This is useful when you have contextual information about the documents you're processing.\n", | ||
| "# In this example, imagine you're processing a batch of invoices from a specific context:\n", | ||
| "pydantic_instance_template = Invoice(\n", | ||
| " invoice_number=\"WP-UNKNOWN\", # Fallback for WordPress invoice plugin documents\n", | ||
| " total=0.0, # Safe default when total can't be extracted\n", | ||
| " vendor_name=\"WordPress Invoice Plugin\" # Known vendor for this document batch\n", | ||
| ")\n", | ||
| "\n", | ||
| "# Why use Pydantic instances? Consider this scenario with our sample invoice:\n", | ||
| "# - invoice_number and total will be extracted from the document if found\n", | ||
| "# - If vendor_name isn't clearly extractable, \"WordPress Invoice Plugin\" will be used\n", | ||
| "# - If invoice_number is missing, \"WP-UNKNOWN\" provides a meaningful fallback\n", | ||
| "# - This is very useful for batch processing where you have context about the document source\n", | ||
| "\n", | ||
| "def get_extraction_template(template_name: str = \"string\"):\n", | ||
| " \"\"\"Get the configured extraction template based on name.\n", | ||
| "\n", | ||
| " Args:\n", | ||
| " template_name: One of \"string\", \"dict\", \"pydantic_model\", or \"pydantic_instance\"\n", | ||
| " \n", | ||
| " Returns:\n", | ||
| " Template for extraction\n", | ||
| " \n", | ||
| " Raises:\n", | ||
| " ValueError: If template_name is not recognized\n", | ||
| " \"\"\"\n", | ||
| " templates = {\n", | ||
| " \"string\": string_template,\n", | ||
| " \"dict\": dict_template,\n", | ||
| " \"pydantic_model\": pydantic_model_template,\n", | ||
| " \"pydantic_instance\": pydantic_instance_template\n", | ||
| " }\n", | ||
| "\n", | ||
| " if template_name not in templates:\n", | ||
| " raise ValueError(\n", | ||
| " f\"Unknown template name: '{template_name}'. \"\n", | ||
| " f\"Choose from {list(templates.keys())}\"\n", | ||
| " )\n", | ||
| " \n", | ||
| " return templates[template_name]\n", | ||
| "\n", | ||
| "# Tips for Better Extraction:\n", | ||
| "print(\"💡 Tips for Better Extraction:\")\n", | ||
| "print(\"1. Use descriptive field names that clearly indicate what information you're looking for\")\n", | ||
| "print(\"2. Provide examples in Pydantic Field definitions to guide the extraction\") \n", | ||
| "print(\"3. Specify appropriate data types (string, float, int, etc.) for better accuracy\")\n", | ||
| "print(\"4. Use optional fields for data that might not always be present\")\n", | ||
| "print(\"5. Test with different template formats to find what works best for your use case\")\n", | ||
| "print(\"\")\n", | ||
| "print(\"🔧 Pydantic Instance Use Case:\")\n", | ||
| "print(\"In our sample invoice, the Pydantic instance template provides:\")\n", | ||
| "print(\"- Known vendor fallback: 'WordPress Invoice Plugin' (useful if vendor name is unclear)\")\n", | ||
| "print(\"- Meaningful invoice number fallback: 'WP-UNKNOWN' (better than generic defaults)\")\n", | ||
| "print(\"- Safe total fallback: 0.0 (prevents errors if extraction fails)\")\n", | ||
| "print(\"- Extracted data will still override these defaults when found in the document\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "vlm-pipeline", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Choose an extraction template\n", | ||
| "\n", | ||
| "Next we choose the template format to be used for information extraction.\n", | ||
| "\n", | ||
| "Each template format has different characteristics:\n", | ||
| "\n", | ||
| "- **string**: Simple JSON format, fastest to write and good for basic extractions\n", | ||
| "- **dict**: Python dictionary, provides better integration with Python code \n", | ||
| "- **pydantic_model**: Pydantic model class, recommended for production use with type validation\n", | ||
| "- **pydantic_instance**: Pydantic model instance, useful when you need specific default values\n", | ||
| "\n", | ||
| "Just set `template_to_use` to one of the available template formats." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "vlm-setup", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Set the template to use (choose from: \"string\", \"dict\", \"pydantic_model\", \"pydantic_instance\")\n", | ||
| "template_to_use = \"pydantic_model\"\n", | ||
| "\n", | ||
| "extraction_template = get_extraction_template(template_to_use)\n", | ||
| "\n", | ||
| "print(f\"✓ Using '{template_to_use}' template format\")\n", | ||
| "print(f\"Template: {extraction_template}\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "extraction-demo", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## ✨ Information Extraction\n", | ||
| "\n", | ||
| "Now we'll perform the information extraction using the selected template format. The extractor will analyze the invoice document and extract the structured information according to the template we configured." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "run-extraction", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Perform information extraction using the selected template\n", | ||
| "print(f\"Extracting information using '{template_to_use}' template...\")\n", | ||
| "\n", | ||
| "result = extractor.extract(invoice_url, template=extraction_template)\n", | ||
| "\n", | ||
| "print(f\"✓ Extraction completed successfully!\")\n", | ||
| "print(f\"Extracted data:\")\n", | ||
| "print(result)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "ls58xx8sbwp", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Understanding the Results\n", | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added |
||
| "\n", | ||
| "The extraction results contain the structured data extracted from the document according to your selected template. The extractor uses vision-language models to understand the document content and map it to the requested fields.\n", | ||
| "\n", | ||
| "### Interpreting Results\n", | ||
| "\n", | ||
| "- **Successful extraction**: When the extractor finds the requested information, it will return the structured data in the format specified by your template\n", | ||
| "- **Missing fields**: Optional fields may be `None` or use default values if the information isn't found in the document\n", | ||
| "- **Data types**: Results will be converted to the specified types (string, float, int, etc.) when possible\n", | ||
| "- **Confidence**: The accuracy depends on document quality, field descriptiveness, and template complexity\n", | ||
| "\n", | ||
| "### Pydantic Instance Template Results Explained\n", | ||
| "\n", | ||
| "If you chose the `pydantic_instance` template, you'll see how contextual defaults work in practice:\n", | ||
| "\n", | ||
| "- **`invoice_number` and `total`**: These will be extracted from the actual invoice document if found\n", | ||
| "- **`vendor_name`**: If the vendor name isn't clearly visible or extractable from our sample invoice, the fallback \"WordPress Invoice Plugin\" will be used instead of `None`\n", | ||
| "\n", | ||
| "This can be very useful in scenarios where we happen to have available context that is more relevant than the default values predefined in the model definition.\n", | ||
| "\n", | ||
| "For example, in the results below:\n", | ||
| "- `invoice_number` and `total` are actually set from the values extracted from the document data\n", | ||
| "- If there was no clear `vendor_name` to be extracted, the updated default \"WordPress Invoice Plugin\" we provided would be applied instead of the model's default `None`\n", | ||
| "\n", | ||
| "### Experimenting with Different Templates\n", | ||
| "\n", | ||
| "You can easily experiment with different template formats by going back to the template selection cell and changing the `template_to_use` variable, then re-running the extraction. Try comparing the `pydantic_model` vs `pydantic_instance` results to see how the contextual defaults affect the output." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "of2v47p5rd", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Additional Resources\n", | ||
| "\n", | ||
| "### Documentation\n", | ||
| "- [Docling Project Documentation](https://docling-project.github.io/docling/)\n", | ||
| "- [Pydantic Documentation](https://docs.pydantic.dev/latest/)\n", | ||
| "- [Open Data Hub Data Processing Repository](https://github.com/opendatahub-io/odh-data-processing)\n", | ||
| "\n", | ||
| "### Next Steps\n", | ||
| "- Try extracting information from your own documents\n", | ||
| "- Experiment with more complex Pydantic models\n", | ||
| "- Explore batch processing of multiple documents\n", | ||
| "- Integrate extraction into your data processing pipelines\n", | ||
| "\n", | ||
| "### Feedback and Contributions\n", | ||
| "We welcome feedback and contributions! Please visit the [ODH Data Processing repository](https://github.com/opendatahub-io/odh-data-processing) to:\n", | ||
| "- Report issues or bugs\n", | ||
| "- Suggest improvements\n", | ||
| "- Contribute examples and documentation\n", | ||
| "- Share your use cases" | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "Python 3 (ipykernel)", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.13.7" | ||
| } | ||
| }, | ||
coderabbitai[bot] marked this conversation as resolved.
Show resolved
Hide resolved
Comment on lines
+354
to
+372
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chainVerify Python version metadata aligns with CI. The notebook metadata shows Python 3.13.7 (line 370), but a previous review indicated CI uses Python 3.12. Although marked as addressed in commit f188f91, the version still shows 3.13.7. Please verify the current CI Python version and update the metadata accordingly or remove the version field to avoid drift. Run the following script to check CI Python version: 🏁 Script executed: #!/bin/bash
# Check Python version specified in CI workflows
fd -e yml -e yaml . .github/workflows -x grep -H "python-version" {} \;
# Check for other Python version specifications
rg -n "python.*version|setup-python" .github/Length of output: 211 Update notebook metadata to match CI Python version (3.12). The CI workflow specifies Python 3.12, but the notebook metadata shows Python 3.13.7 (line 370). This mismatch should be corrected by updating the notebook's metadata 🤖 Prompt for AI Agents |
||
| "nbformat": 4, | ||
| "nbformat_minor": 5 | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This installation cell appears before the 'Installation' section header (line 30). Consider removing this duplicate installation command or moving the markdown section above it to maintain logical flow.