|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "id": "7c83150f-fa8b-42a1-8974-ce9483912fba", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "# Information Extraction with Docling\n", |
| 9 | + "\n", |
| 10 | + "This notebook demonstrates how to extract structured information from unstructured documents using [Docling](https://docling-project.github.io/docling/). You'll learn how to use different template formats to extract specific data fields from documents like invoices.\n", |
| 11 | + "\n", |
| 12 | + "*Note: The extraction API is currently in beta and may change without prior notice.*" |
| 13 | + ] |
| 14 | + }, |
| 15 | + { |
| 16 | + "cell_type": "code", |
| 17 | + "execution_count": null, |
| 18 | + "id": "d58fb60e", |
| 19 | + "metadata": {}, |
| 20 | + "outputs": [], |
| 21 | + "source": [ |
| 22 | + "%pip install -q docling[vlm] # Install the Docling package with VLM support" |
| 23 | + ] |
| 24 | + }, |
| 25 | + { |
| 26 | + "cell_type": "markdown", |
| 27 | + "id": "jfbjatzrbg", |
| 28 | + "metadata": {}, |
| 29 | + "source": [ |
| 30 | + "## Installation\n", |
| 31 | + "\n", |
| 32 | + "First, we need to install the Docling package with VLM (Vision Language Model) support for information extraction:" |
| 33 | + ] |
| 34 | + }, |
| 35 | + { |
| 36 | + "cell_type": "code", |
| 37 | + "execution_count": null, |
| 38 | + "id": "8cf4340c-cfd4-418c-955b-be8c0d544e67", |
| 39 | + "metadata": {}, |
| 40 | + "outputs": [], |
| 41 | + "source": [ |
| 42 | + "from IPython import display\n", |
| 43 | + "from pydantic import BaseModel, Field\n", |
| 44 | + "from rich import print" |
| 45 | + ] |
| 46 | + }, |
| 47 | + { |
| 48 | + "cell_type": "markdown", |
| 49 | + "id": "99a1kzl8aw8", |
| 50 | + "metadata": {}, |
| 51 | + "source": [ |
| 52 | + "## Setup and Configuration\n", |
| 53 | + "\n", |
| 54 | + "Let's import the necessary libraries and set up our environment:" |
| 55 | + ] |
| 56 | + }, |
| 57 | + { |
| 58 | + "cell_type": "markdown", |
| 59 | + "id": "6f5e2659-e626-4235-97fc-f311adf8f5b7", |
| 60 | + "metadata": {}, |
| 61 | + "source": [ |
| 62 | + "### Sample Document\n", |
| 63 | + "\n", |
| 64 | + "For this demonstration, we'll work with a sample invoice document. This will help us understand how information extraction works with real-world documents:" |
| 65 | + ] |
| 66 | + }, |
| 67 | + { |
| 68 | + "cell_type": "code", |
| 69 | + "execution_count": null, |
| 70 | + "id": "d8d12166-8b40-4c46-9147-27cfc1c8b09a", |
| 71 | + "metadata": {}, |
| 72 | + "outputs": [], |
| 73 | + "source": [ |
| 74 | + "# Parameters\n", |
| 75 | + "invoice_url = \"https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf\"\n", |
| 76 | + "\n", |
| 77 | + "display.HTML(f'''\n", |
| 78 | + "<iframe src=\"{invoice_url}\" width=\"100%\" height=\"600px\">\n", |
| 79 | + " <p>Your browser does not support iframes. <a href=\"{invoice_url}\">Click here to view the invoice</a></p>\n", |
| 80 | + "</iframe>\n", |
| 81 | + "''')" |
| 82 | + ] |
| 83 | + }, |
| 84 | + { |
| 85 | + "cell_type": "markdown", |
| 86 | + "id": "20fa56da", |
| 87 | + "metadata": {}, |
| 88 | + "source": [ |
| 89 | + "### Document Extractor Setup\n", |
| 90 | + "\n", |
| 91 | + "Now let's configure the document extractor to handle PDF and image formats. The extractor is the main component that will process our documents and extract information using the templates we define:" |
| 92 | + ] |
| 93 | + }, |
| 94 | + { |
| 95 | + "cell_type": "code", |
| 96 | + "execution_count": null, |
| 97 | + "id": "46dd4e6e", |
| 98 | + "metadata": {}, |
| 99 | + "outputs": [], |
| 100 | + "source": [ |
| 101 | + "from docling.datamodel.base_models import InputFormat\n", |
| 102 | + "from docling.document_extractor import DocumentExtractor\n", |
| 103 | + "\n", |
| 104 | + "extractor = DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF])" |
| 105 | + ] |
| 106 | + }, |
| 107 | + { |
| 108 | + "cell_type": "markdown", |
| 109 | + "id": "extraction-templates", |
| 110 | + "metadata": {}, |
| 111 | + "source": [ |
| 112 | + "## Information Extraction with Templates\n", |
| 113 | + "\n", |
| 114 | + "Docling supports different template formats for information extraction. Templates define the structure and data types of the information you want to extract from documents. Let's explore the different approaches:\n", |
| 115 | + "\n", |
| 116 | + "### String Template Format\n", |
| 117 | + "\n", |
| 118 | + "Templates can be string literals in JSON format. This is the simplest approach for basic extraction needs:" |
| 119 | + ] |
| 120 | + }, |
| 121 | + { |
| 122 | + "cell_type": "code", |
| 123 | + "execution_count": null, |
| 124 | + "id": "template-definition", |
| 125 | + "metadata": {}, |
| 126 | + "outputs": [], |
| 127 | + "source": [ |
| 128 | + "res = extractor.extract(invoice_url, template='{\"invoice_number\": \"string\", \"total\": \"float\"}')\n", |
| 129 | + "print(res)" |
| 130 | + ] |
| 131 | + }, |
| 132 | + { |
| 133 | + "cell_type": "markdown", |
| 134 | + "id": "vlm-pipeline", |
| 135 | + "metadata": {}, |
| 136 | + "source": [ |
| 137 | + "### Dictionary Template Format\n", |
| 138 | + "\n", |
| 139 | + "Templates can also be Python dictionaries, which provides better integration with Python code and allows for more complex data structures:" |
| 140 | + ] |
| 141 | + }, |
| 142 | + { |
| 143 | + "cell_type": "code", |
| 144 | + "execution_count": null, |
| 145 | + "id": "vlm-setup", |
| 146 | + "metadata": {}, |
| 147 | + "outputs": [], |
| 148 | + "source": [ |
| 149 | + "res = extractor.extract(invoice_url, template={\"invoice_number\": \"string\", \"total\": \"float\"})\n", |
| 150 | + "print(res)" |
| 151 | + ] |
| 152 | + }, |
| 153 | + { |
| 154 | + "cell_type": "markdown", |
| 155 | + "id": "extraction-demo", |
| 156 | + "metadata": {}, |
| 157 | + "source": [ |
| 158 | + "### Pydantic Model Template Format\n", |
| 159 | + "\n", |
| 160 | + "For more advanced use cases, templates can be [Pydantic](https://docs.pydantic.dev/latest/) model classes or instances. This approach provides type validation, default values, and better documentation:" |
| 161 | + ] |
| 162 | + }, |
| 163 | + { |
| 164 | + "cell_type": "code", |
| 165 | + "execution_count": null, |
| 166 | + "id": "run-extraction", |
| 167 | + "metadata": {}, |
| 168 | + "outputs": [], |
| 169 | + "source": [ |
| 170 | + "from typing import Optional\n", |
| 171 | + "\n", |
| 172 | + "class Invoice(BaseModel):\n", |
| 173 | + " invoice_number: str = Field(examples=[\"INV-001\", \"12345\"])\n", |
| 174 | + " total: float = Field(default=10, examples=[100.0, 250.50])\n", |
| 175 | + " vendor_name: Optional[str] = Field(default=None, examples=[\"ACME Corp\", \"Tech Solutions Inc\"])\n", |
| 176 | + "\n", |
| 177 | + "res = extractor.extract(invoice_url, template=Invoice)\n", |
| 178 | + "print(res)" |
| 179 | + ] |
| 180 | + }, |
| 181 | + { |
| 182 | + "cell_type": "markdown", |
| 183 | + "id": "evaluation", |
| 184 | + "metadata": {}, |
| 185 | + "source": [ |
| 186 | + "### Pydantic Instance Template Format\n", |
| 187 | + "\n", |
| 188 | + "You can also use a Pydantic model instance as a template, which allows you to override the defaults and provide specific fallback values:" |
| 189 | + ] |
| 190 | + }, |
| 191 | + { |
| 192 | + "cell_type": "code", |
| 193 | + "execution_count": null, |
| 194 | + "id": "evaluate-results", |
| 195 | + "metadata": {}, |
| 196 | + "outputs": [], |
| 197 | + "source": [ |
| 198 | + "template_instance = Invoice(invoice_number=\"999\", total=999.99, vendor_name=\"Default Vendor\")\n", |
| 199 | + "\n", |
| 200 | + "res = extractor.extract(invoice_url, template=template_instance)\n", |
| 201 | + "print(res)" |
| 202 | + ] |
| 203 | + }, |
| 204 | + { |
| 205 | + "cell_type": "markdown", |
| 206 | + "id": "ls58xx8sbwp", |
| 207 | + "metadata": {}, |
| 208 | + "source": [ |
| 209 | + "## Understanding the Results\n", |
| 210 | + "\n", |
| 211 | + "The extraction results contain the structured data extracted from the document according to your template. The extractor uses vision-language models to understand the document content and map it to the requested fields.\n", |
| 212 | + "\n", |
| 213 | + "### Template Selection Guidelines\n", |
| 214 | + "\n", |
| 215 | + "- **String templates**: Best for simple, quick extractions with basic data types\n", |
| 216 | + "- **Dictionary templates**: Good for structured data with nested objects\n", |
| 217 | + "- **Pydantic models**: Recommended for production use, providing type safety and validation\n", |
| 218 | + "- **Pydantic instances**: Useful when you need specific default values or fallbacks\n", |
| 219 | + "\n", |
| 220 | + "### Tips for Better Extraction\n", |
| 221 | + "\n", |
| 222 | + "1. **Use descriptive field names** that clearly indicate what information you're looking for\n", |
| 223 | + "2. **Provide examples** in Pydantic Field definitions to guide the extraction\n", |
| 224 | + "3. **Specify appropriate data types** (string, float, int, etc.) for better accuracy\n", |
| 225 | + "4. **Use optional fields** for data that might not always be present\n", |
| 226 | + "5. **Test with different template formats** to find what works best for your use case" |
| 227 | + ] |
| 228 | + }, |
| 229 | + { |
| 230 | + "cell_type": "markdown", |
| 231 | + "id": "of2v47p5rd", |
| 232 | + "metadata": {}, |
| 233 | + "source": [ |
| 234 | + "## Additional Resources\n", |
| 235 | + "\n", |
| 236 | + "### Documentation\n", |
| 237 | + "- [Docling Project Documentation](https://docling-project.github.io/docling/)\n", |
| 238 | + "- [Pydantic Documentation](https://docs.pydantic.dev/latest/)\n", |
| 239 | + "- [Open Data Hub Data Processing Repository](https://github.com/opendatahub-io/odh-data-processing)\n", |
| 240 | + "\n", |
| 241 | + "### Next Steps\n", |
| 242 | + "- Try extracting information from your own documents\n", |
| 243 | + "- Experiment with more complex Pydantic models\n", |
| 244 | + "- Explore batch processing of multiple documents\n", |
| 245 | + "- Integrate extraction into your data processing pipelines\n", |
| 246 | + "\n", |
| 247 | + "### Feedback and Contributions\n", |
| 248 | + "We welcome feedback and contributions! Please visit the [ODH Data Processing repository](https://github.com/opendatahub-io/odh-data-processing) to:\n", |
| 249 | + "- Report issues or bugs\n", |
| 250 | + "- Suggest improvements\n", |
| 251 | + "- Contribute examples and documentation\n", |
| 252 | + "- Share your use cases" |
| 253 | + ] |
| 254 | + } |
| 255 | + ], |
| 256 | + "metadata": { |
| 257 | + "kernelspec": { |
| 258 | + "display_name": "Python 3 (ipykernel)", |
| 259 | + "language": "python", |
| 260 | + "name": "python3" |
| 261 | + }, |
| 262 | + "language_info": { |
| 263 | + "codemirror_mode": { |
| 264 | + "name": "ipython", |
| 265 | + "version": 3 |
| 266 | + }, |
| 267 | + "file_extension": ".py", |
| 268 | + "mimetype": "text/x-python", |
| 269 | + "name": "python", |
| 270 | + "nbconvert_exporter": "python", |
| 271 | + "pygments_lexer": "ipython3", |
| 272 | + "version": "3.13.7" |
| 273 | + } |
| 274 | + }, |
| 275 | + "nbformat": 4, |
| 276 | + "nbformat_minor": 5 |
| 277 | +} |
0 commit comments