Skip to content

Example of how to create to extract PDFs from an Azure Storage Account, process them using Azure Document Intelligence, and store the results in Cosmos DB for further analysis.

License

Notifications You must be signed in to change notification settings

MicrosoftCloudEssentials-LearningHub/PDF-Processing-Fapp-DocIntelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated PDF Invoice Processing (full-code approach)

Azure Storage + Document Intelligence + Function App + Cosmos DB

Costa Rica

GitHub GitHub brown9804

Last updated: 2025-05-16


Important

This example is based on a public network site and is intended for demonstration purposes only. It showcases how several Azure resources can work together to achieve the desired result. Consider the section below about Important Considerations for Production Environment. Please note that these demos are intended as a guide and are based on my personal experiences. For official guidance, support, or more detailed information, please refer to Microsoft's official documentation or contact Microsoft directly: Microsoft Sales and Support

How to parse PDFs from an Azure Storage Account, process them using Azure Document Intelligence, and store the results in Cosmos DB for further analysis.

  1. Upload your PDFs to an Azure Blob Storage container.
  2. An Azure Function is triggered by the upload, which calls the Azure Document Intelligence API to analyze the PDFs.
  3. The extracted data is parsed and subsequently stored in a Cosmos DB database, ensuring a seamless and automated workflow from document upload to data storage.

Note

Advantages of Document Intelligence for organizations handling with large volumes of documents:

  • Utilizes natural language processing, computer vision, deep learning, and machine learning.
  • Handles structured, semi-structured, and unstructured documents.
  • Automates the extraction and transformation of data into usable formats like JSON or CSV
image
List of References (Click to expand)
Table of Content (Click to expand)

Important Considerations for Production Environment

Private Network Configuration

For enhanced security, consider configuring your Azure resources to operate within a private network. This can be achieved using Azure Virtual Network (VNet) to isolate your resources and control inbound and outbound traffic. Implementing private endpoints for services like Azure Blob Storage and Azure Functions can further secure your data by restricting access to your VNet.

Security

Ensure that you implement appropriate security measures when deploying this solution in a production environment. This includes:

  • Securing Access: Use Azure Entra ID (formerly known as Azure Active Directory or Azure AD) for authentication and role-based access control (RBAC) to manage permissions.
  • Managing Secrets: Store sensitive information such as connection strings and API keys in Azure Key Vault.
  • Data Encryption: Enable encryption for data at rest and in transit to protect sensitive information.
Scalability

While this example provides a basic setup, you may need to scale the resources based on your specific requirements. Azure services offer various scaling options to handle increased workloads. Consider using:

  • Auto-scaling: Configure auto-scaling for Azure Functions and other services to automatically adjust based on demand.
  • Load Balancing: Use Azure Load Balancer or Application Gateway to distribute traffic and ensure high availability.
Cost Management

Monitor and manage the costs associated with your Azure resources. Use Azure Cost Management and Billing to track usage and optimize resource allocation.

Compliance

Ensure that your deployment complies with relevant regulations and standards. Use Azure Policy to enforce compliance and governance policies across your resources.

Disaster Recovery

Implement a disaster recovery plan to ensure business continuity in case of failures. Use Azure Site Recovery and backup solutions to protect your data and applications.

Overview

Azure Document Intelligence, formerly known as Form Recognizer, is a powerful AI service that extracts structured data from documents. It uses machine learning models to analyze and process various types of documents, such as invoices, receipts, business cards, and more.

Key Features Details
Prebuilt Models - Invoice Model: Extracts fields like invoice ID, date, vendor information, line items, totals, and more.
- Receipt Model: Extracts merchant name, transaction date, total amount, and line items.
- Business Card Model: Extracts contact information such as name, company, phone number, and email.
Custom Models - Training: You can train custom models using labeled data. This involves uploading a set of documents and manually labeling the fields you want to extract.
- Model Management: Manage versions of your custom models, retrain them with new data, and evaluate their performance.
APIs and SDKs - REST API: Provides endpoints for analyzing documents, managing models, and retrieving results.
- SDKs: Available in multiple languages (e.g., Python, C#, JavaScript) to simplify integration into your applications.

Important

Regarding Networking, this example will cover Public access configuration, and system-managed identity. However, please ensure you review your privacy requirements and adjust network and access settings as necessary for your specific case.

Step 1: Set Up Your Azure Environment

An Azure Resource Group is a container that holds related resources for an Azure solution. It can include all the resources for the solution or only those you want to manage as a group. Typically, resources that share the same lifecycle are added to the same resource group, allowing for easier deployment, updating, and deletion as a unit. Resource groups also store metadata about the resources, and you can apply access control, locks, and tags to them for better management and organization.

  1. Create an Azure Account: If you don't have one, sign up for an Azure account.
  2. Create a Resource Group:
    • Go to the Azure portal.

    • Navigate to Resource groups.

    • Click + Create.

      image
    • Enter the Resource Group name (e.g., RGContosoAIDoc) and select a region (e.g., East US 2). You can add tags if needed.

    • Click Review + create and then Create.

      image

Step 2: Set Up Azure Blob Storage for PDF Ingestion

Create a Storage Account

An Azure Storage Account provides a unique namespace in Azure for your data, allowing you to store and manage various types of data such as blobs, files, queues, and tables. It serves as the foundation for all Azure Storage services, ensuring high availability, scalability, and security for your data.

  • In the Azure portal, navigate to your Resource Group.

  • Click + Create.

    image
  • Search for Storage Account.

    image
  • Select the Resource Group you created.

  • Enter a Storage Account name (e.g., invoicecontosostorage).

  • Choose the region and performance options, and click Next to continue.

    image
  • If you need to modify anything related to Security, Access protocols, Blob Storage Tier, you can do that in the Advanced tab.

    image
  • Regarding Networking, this example will cover Public access configuration. However, please ensure you review your privacy requirements and adjust network and access settings as necessary for your specific case.

    image
  • Click Review + create and then Create. Once is done, you'll be able to see it in your Resource Group.

    image

Create a Blob Container

A Blob Container is a logical grouping of blobs within an Azure Storage Account, similar to a directory in a file system. Containers help organize and manage blobs, which can be any type of unstructured data like text or binary data. Each container can store an unlimited number of blobs, and you must create a container before uploading any blobs.

Within the Storage Account, create a Blob Container to store your PDFs.

  • Go to your Storage Account.

  • Under Data storage, select Containers.

  • Click + Container.

  • Enter a name for the container (e.g., pdfinvoices) and set the public access level to Private.

  • Click Create.

    image

Allow storage account key access

If you plan to use access keys, please ensure that the setting "Allow storage account key access" is enabled. When this setting is disabled, any requests to the account authorized with Shared Key, including shared access signatures (SAS), will be denied. Click here to learn more

image

Step 3: Set Up Azure Cosmos DB

Create a Cosmos DB Account

Azure Cosmos DB is a globally distributed,multi-model database service provided by Microsoft Azure. It is designed to offer high availability, scalability, and low-latency access to data for modern applications. Unlike traditional relational databases, Cosmos DB is a NoSQL database, meaning it can handle unstructured, semi-structured, and structured data types. It supports multiple data models, including document, key-value, graph, and column-family, making it versatile for various use cases.

  • In the Azure portal, navigate to your Resource Group.

  • Click + Create.

  • Search for Cosmos DB, click on Create:

    image
  • Choose your desired API type, for this will be using Azure Cosmos DB for NoSQL. This option supports a SQL-like query language, which is familiar and powerful for querying and analyzing your invoice data. It also integrates well with various client libraries, making development easier and more flexible.

    image
  • Please enter an account name (e.g., contosoinvoiceaicosmos). As with the previously configured resources, we will use the Public network for this example. Ensure that you adjust the architecture to include your networking requirements.

  • Select the region and other settings.

  • Click Review + create and then Create.

    image

Create a Database and Container

An Azure Cosmos DB container is a logical unit within a Cosmos DB database where data is stored. Containers are schema-agnostic, meaning they can store items with different structures. Each container is automatically partitioned to scale out across multiple servers, providing virtually unlimited throughput and storage. Containers are the primary scalability unit in Cosmos DB, and they use a partition key to distribute data efficiently across partitions.

  • Go to your Cosmos DB account.

  • Under Data Explorer, click New Database.

    image
  • Enter a database name (e.g., ContosoDBDocIntellig) and click OK.

    image
  • Click New Container.

    image
  • Enter a container name (e.g., Invoices) and set the partition key (e.g., /transactionId).

  • Click OK.

    image image

Step 4: Set Up Azure Document Intelligence

Azure Document Intelligence offers robust capabilities for extracting structured data from various document types using advanced machine learning models. Technically, it provides prebuilt models for common documents like invoices, receipts, and business cards, which can quickly extract key information without custom training. For more specific needs, it allows training custom models using labeled data, enabling precise extraction tailored to unique document formats. The service is accessible via REST APIs and SDKs in multiple languages, facilitating seamless integration into applications. It supports key-value pair extraction, table recognition, and text extraction, making it a powerful tool for automating data entry, enhancing document management systems, and streamlining business processes.

Create Document Intelligence Resource

  • Go to the Azure Portal.

  • Create a New Resource:

    • Click on Create a resource and search for document intelligence.

    • Select Document Intelligence and click Create.

      image
  • Configure the Resource:

    • Subscription: Select your Azure subscription.
    • Resource Group: Choose an existing resource group or create a new one.
    • Region: Select the region closest to your location.
    • Name: Provide a unique name for your Form Recognizer resource.
    • Pricing Tier: Choose the pricing tier that fits your needs (e.g., Standard S0).
  • Review your settings and click Create to deploy the resource.

    image

Configure Models

Using Prebuilt Models

  • Access Form Recognizer Studio:

    • Navigate to your Form Recognizer resource in the Azure Portal.

    • Check your Resource Group if needed:

      image
    • Under Overview, click on Go to Document Intelligence Studio:

      image
  • Select Prebuilt Models: Choose the prebuilt model that matches your document type (e.g., "Invoices" for your PDF example).

    image
  • If the service resource for usage and billing is not configured, a window will appear requesting the resource information. In this case, we will use the one we recently created.

    image
  • Analyze Document:

    • Upload your PDF document to the Form Recognizer Studio.

      image
    • Click on Run analysis, the prebuilt model will automatically extract fields such as invoice ID, date, vendor information, line items, and totals.

      image
    • Validate your results:

      image

    Training Custom Models (optional/if needed)

  • Prepare Training Data:

  • Upload Training Data: Upload the labeled documents to an Azure Blob Storage container.

  • Grant the necessary role (Storage Blob Data Reader) to the Document Intelligence Account for the Storage Account to access the information. Otherwise, you may encounter an error like this:

    image
    • For this example we'll be using the system assigned identity to do that. Under Identy within your Document Intelligence Account, change the status to On, and click on Save:

      A system assigned managed identity is restricted to one per resource and is tied to the lifecycle of this resource. You can grant permissions to the managed identity by using Azure role-based access control (Azure RBAC). The managed identity is authenticated with Microsoft Entra ID, so you don’t have to store any credentials in code.

      image
    • Go to your Storage Account, under Access Control (IAM) click on + Add, and then Add role assigment:

      image
    • Search for Storage Blob Data Reader, click Next. Then, click on select members and search for your Document intelligence identity. Finally click on Review + assign:

      image
  • In the Form Recognizer Studio, select Custom extraction model.

    image
  • Scroll down, and click on Create a project (e.g, pdfinvoiceproject, Extract information from pdf invoices):

    image
  • Configure the service resource for the project, choose subscription, resource group, Document Intelligence or Cognitive Service Resource and the api version.

    image
  • Connect training data source: Provide the information of the Azure Blob Storage account and the folder that contains your training data.

    image
  • You can also Auto label if it's required:

    image
  • Test the Model:

    • Upload a new document to test the custom model.
    • Verify that the model correctly extracts the desired fields.

Step 5: Set Up Azure Functions for Document Ingestion and Processing

An Azure Function App is a container for hosting individual Azure Functions. It provides the execution context for your functions, allowing you to manage, deploy, and scale them together. Each function app can host multiple functions, which are small pieces of code that run in response to various triggers or events, such as HTTP requests, timers, or messages from other Azure services.

Azure Functions are designed to be lightweight and event-driven, enabling you to build scalable and serverless applications. You only pay for the resources your functions consume while they are running, making it a cost-effective solution for many scenarios.

Create a Function App

  • In the Azure portal, go to your Resource Group.

  • Click + Create.

    image
  • Search for Function App, click on Create:

    image
  • Choose a hosting option; for this example, we will use Functions Premium. Click here for a quick overview of hosting options:

    image
  • Enter a name for the Function App (e.g., ContosoFAaiDocIntellig).

  • Choose your runtime stack (e.g., .NET or Python).

  • Select the region and other settings.

    image
  • Select Review + create and then Create. Verify the resources created in your Resource Group.

    image

Important

This example is using system-assigned managed identity to assign RBACs (Role-based Access Control). image

  • Please assign the Storage Blob Data Contributor and Storage File Data SMB Share Contributor roles to the Function App within the Storage Account related to the runtime (the one created with the function app).

    image
  • Assign Storage Blob Data Reader to the Function App within the Storage Account that will contains the invoices, click Next. Then, click on select members and search for your Function App identity. Finally click on Review + assign:

    image
  • Also add Cosmos DB Operator, DocumentDB Account Contributor, Azure AI Administrator, Cosmos DB Account Reader Role, Contributor:

    image
  • To assign the Microsoft.DocumentDB/databaseAccounts/readMetadata permission, you need to create a custom role in Azure Cosmos DB. This permission is required for accessing metadata in Cosmos DB. Click here to understand more about it.

    Aspect Data Plane Access Control Plane Access
    Scope Focuses on data operations within databases and containers. This includes actions such as reading, writing, and querying data in your databases and containers. Focuses on management operations at the account level. This includes actions such as creating, deleting, and configuring databases and containers.
    Roles - Cosmos DB Built-in Data Reader: Provides read-only access to data within the databases and containers.
    - Cosmos DB Built-in Data Contributor: Allows read and write access to data within the databases and containers.
    - Cosmos DB Built-in Data Owner: Grants full access to manage data within the databases and containers.
    - Contributor: Grants full access to manage all Azure resources, including Cosmos DB.
    - Owner: Grants full access to manage all resources, including the ability to assign roles in Azure RBAC.
    - Cosmos DB Account Contributor: Allows management of Cosmos DB accounts, including creating and deleting databases and containers.
    - Cosmos DB Account Reader: Provides read-only access to Cosmos DB account metadata.
    Permissions - Reading documents
    - Writing documents
    - Managing data within containers.
    - Creating or deleting databases and containers
    - Configuring settings
    - Managing account-level configurations.
    Authentication Uses Azure Active Directory (AAD) tokens or resource tokens for authentication. Uses Azure Active Directory (AAD) for authentication.

Steps to assing it:

  1. Open Azure CLI: Go to the Azure portal and click on the icon for the Azure CLI.

    image
  2. List Role Definitions: Run the following command to list all of the role definitions associated with your Azure Cosmos DB for NoSQL account. Review the output and locate the role definition named Cosmos DB Built-in Data Contributor.

    az cosmosdb sql role definition list \
        --resource-group "<your-resource-group>" \
        --account-name "<your-account-name>"
    image
  3. Get Cosmos DB Account ID: Run this command to get the ID of your Cosmos DB account. Record the value of the id property as it is required for the next step.

    az cosmosdb show --resource-group "<your-resource-group>" --name "<your-account-name>" --query "{id:id}"

    Example output:

    {                                                               
     "id": "/subscriptions/{subscription-id}/resourceGroups/{resource-group-name}/providers/Microsoft.DocumentDB/databaseAccounts/{cosmos-account-name}"
    }     
    image
  4. Assign the Role: Assign the new role using az cosmosdb sql role assignment create. Use the previously recorded role definition ID for the --role-definition-id argument, the unique identifier for your identity for the --principal-id argument, and your account's ID for the --scope argument.

    You can extract the principal-id, from Identity of the Function App:

    image
    az cosmosdb sql role assignment create \
        --resource-group "<your-resource-group>" \
        --account-name "<your-account-name>" \
        --role-definition-id "<role-definition-id>" \
        --principal-id "<principal-id>" \
        --scope "/subscriptions/{subscriptions-id}/resourceGroups/{resource-group-name}/providers/Microsoft.DocumentDB/databaseAccounts/{cosmos-account-name}"
    image

    After a few minutes, you will see something like this:

    image
  5. Verify Role Assignment: Use az cosmosdb sql role assignment list to list all role assignments for your Azure Cosmos DB for NoSQL account. Review the output to ensure your role assignment was created.

    az cosmosdb sql role assignment list \
        --resource-group "<your-resource-group>" \
        --account-name "<your-account-name>"
    image

Configure/Validate the Environment variables

  • Under Settings, go to Environment variables. And + Add the following variables:

    • COSMOS_DB_ENDPOINT: Your Cosmos DB account endpoint.

    • COSMOS_DB_KEY: Your Cosmos DB account key.

    • COSMOS_DB_CONNECTION_STRING: Your Cosmos DB connection string.

    • invoicecontosostorage_STORAGE: Your Storage Account connection string.

    • FORM_RECOGNIZER_ENDPOINT: For example: https://<your-form-recognizer-endpoint>.cognitiveservices.azure.com/

    • FORM_RECOGNIZER_KEY: Your Documment Intelligence Key (Form Recognizer).

    • FUNCTIONS_EXTENSION_VERSION: ~4 (Review the existence of this, if not create it)

    • FUNCTIONS_NODE_BLOCK_ON_ENTRY_POINT_ERROR: true (This setting ensures that all entry point errors are visible in your application insights logs).

      image image image image
    • Click on Apply to save your configuration.

      image

Develop the Function

  • You need to install VSCode

  • Install python from Microsoft store:

    image
  • Open VSCode, and install some extensions: python, and Azure Tools.

    image image
  • Click on the Azure icon, and sign in into your account. Allow the extension Azure Resources to sign in using Microsoft, it will open a browser window. After doing so, you will be able to see your subscription and resources.

    image
  • Under Workspace, click on Create Function Project, and choose a path in your local computer to develop your function.

    image
  • Choose the language, in this case is python:

    image
  • Select the model version, for this example let's use v2:

    image
  • For the python interpreter, let's use the one installed via Microsoft Store:

    image
  • Choose a template (e.g., Blob trigger) and configure it to trigger on new PDF uploads in your Blob container.

    image
  • Provide a function name, like BlobTriggerContosoPDFInvoicesDocIntelligence:

    image
  • Next, it will prompt you for the path of the blob container where you expect the function to be triggered after a file is uploaded. In this case is pdfinvoices as was previously created.

    image
  • Click on Create new local app settings, and then choose your subscription.

    image
  • Choose Azure Storage Account for remote storage, and select one. I'll be using the invoicecontosostorage.

    image
  • Then click on Open in the current window. You will see something like this:

    image
  • Now we need to update the function code to extract data from PDFs and store it in Cosmos DB, use this an example:

    1. PDF Upload: A PDF is uploaded to the Azure Blob Storage container named pdfinvoices.
    2. Trigger Azure Function: The upload triggers the Azure Function BlobTriggerContosoPDFInvoicesDocIntelligence.
    3. Initialize Clients: Sets up connections to Document Intelligence and Cosmos DB.
      • The function initializes the DocumentAnalysisClient to interact with Azure Document Intelligence.
      • It also initializes the CosmosClient to interact with Cosmos DB.
    4. Read PDF from Blob Storage: The function reads the PDF content from the Blob Storage into a byte stream.
    5. Analyze PDF: Uses Document Intelligence to extract data.
      • The function calls the begin_analyze_document method of the DocumentAnalysisClient using the prebuilt invoice model to analyze the PDF.
      • It waits for the analysis to complete and retrieves the results.
    6. Extract Data: Structures the extracted data.
      • The function extracts relevant fields from the analysis result, such as customer name, email, address, company name, phone, address, and rental details.
      • It structures this extracted data into a dictionary (invoice_data).
    7. Save Data to Cosmos DB: Inserts the data into Cosmos DB.
      • The function calls save_invoice_data_to_cosmos to save the structured data into Cosmos DB.
      • It ensures the database and container exist, then inserts the extracted data.
    8. Logging (process and errors): Throughout the process, the function logs various steps and any errors encountered for debugging and monitoring purposes.
    • Update the function_app.py:

      Template Blob Trigger Function Code updated
      image image
      Function Code (Click to expand)
       import logging
       import azure.functions as func
       from azure.ai.formrecognizer import DocumentAnalysisClient
       from azure.core.credentials import AzureKeyCredential
       from azure.cosmos import CosmosClient, PartitionKey, exceptions
       from azure.identity import DefaultAzureCredential
       import os
       import uuid
       
       app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)
       
       ## DEFINITIONS 
       def initialize_form_recognizer_client():
           endpoint = os.getenv("FORM_RECOGNIZER_ENDPOINT")
           key = os.getenv("FORM_RECOGNIZER_KEY")
           if not isinstance(key, str):
               raise ValueError("FORM_RECOGNIZER_KEY must be a string")
           logging.info(f"Form Recognizer endpoint: {endpoint}")
           return DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
       
       def read_pdf_content(myblob):
           logging.info(f"Reading PDF content from blob: {myblob.name}")
           return myblob.read()
       
       def analyze_pdf(form_recognizer_client, pdf_bytes):
           logging.info("Starting PDF analysis.")
           poller = form_recognizer_client.begin_analyze_document(
               model_id="prebuilt-invoice",
               document=pdf_bytes
           )
           logging.info("PDF analysis in progress.")
           return poller.result()
       
       def extract_invoice_data(result):
           logging.info("Extracting invoice data from analysis result.")
           invoice_data = {
               "id": str(uuid.uuid4()),
               "customer_name": "",
               "customer_email": "",
               "customer_address": "",
               "company_name": "",
               "company_phone": "",
               "company_address": "",
               "rentals": []
           }
       
           def serialize_field(field):
               if field:
                   return str(field.value)  # Convert to string
               return ""
           
           for document in result.documents:
               fields = document.fields
               invoice_data["customer_name"] = serialize_field(fields.get("CustomerName"))
               invoice_data["customer_email"] = serialize_field(fields.get("CustomerEmail"))
               invoice_data["customer_address"] = serialize_field(fields.get("CustomerAddress"))
               invoice_data["company_name"] = serialize_field(fields.get("VendorName"))
               invoice_data["company_phone"] = serialize_field(fields.get("VendorPhoneNumber"))
               invoice_data["company_address"] = serialize_field(fields.get("VendorAddress"))
       
               items = fields.get("Items").value if fields.get("Items") else []
               for item in items:
                   item_value = item.value if item.value else {}
                   rental = {
                       "rental_date": serialize_field(item_value.get("Date")),
                       "title": serialize_field(item_value.get("Description")),
                       "description": serialize_field(item_value.get("Description")),
                       "quantity": serialize_field(item_value.get("Quantity")),
                       "total_price": serialize_field(item_value.get("TotalPrice"))
                   }
                   invoice_data["rentals"].append(rental)
       
           logging.info(f"Successfully extracted invoice data: {invoice_data}")
           return invoice_data
       
       def save_invoice_data_to_cosmos(invoice_data):
           try:
               endpoint = os.getenv("COSMOS_DB_ENDPOINT")
               key = os.getenv("COSMOS_DB_KEY")
               aad_credentials = DefaultAzureCredential()
               client = CosmosClient(endpoint, credential=aad_credentials, consistency_level='Session')
               logging.info("Successfully connected to Cosmos DB using AAD default credential")
           except Exception as e:
               logging.error(f"Error connecting to Cosmos DB: {e}")
               return
           
           database_name = "ContosoDBDocIntellig"
           container_name = "Invoices"
       
           
           try: # Check if the database exists
               # If the database does not exist, create it
               database = client.create_database_if_not_exists(database_name)
               logging.info(f"Database '{database_name}' does not exist. Creating it.")
           except exceptions.CosmosResourceExistsError: # If error get name, keep going 
               database = client.get_database_client(database_name)
               logging.info(f"Database '{database_name}' already exists.")
       
           database.read()
           logging.info(f"Reading into '{database_name}' DB")
       
           try: # Check if the container exists
               # If the container does not exist, create it
               container = database.create_container(
                   id=container_name,
                   partition_key=PartitionKey(path="/transactionId"),
                   offer_throughput=400
               )
               logging.info(f"Container '{container_name}' does not exist. Creating it.")
           except exceptions.CosmosResourceExistsError:
               container = database.get_container_client(container_name)
               logging.info(f"Container '{container_name}' already exists.")
           except exceptions.CosmosHttpResponseError:
               raise
       
           container.read()
           logging.info(f"Reading into '{container}' container")
       
           try:
               response = container.upsert_item(invoice_data)
               logging.info(f"Saved processed invoice data to Cosmos DB: {response}")
           except Exception as e:
               logging.error(f"Error inserting item into Cosmos DB: {e}")
       
       ## MAIN 
       @app.blob_trigger(arg_name="myblob", path="pdfinvoices/{name}",
                         connection="invoicecontosostorage_STORAGE")
       def BlobTriggerContosoPDFInvoicesDocIntelligence(myblob: func.InputStream):
           logging.info(f"Python blob trigger function processed blob\n"
                        f"Name: {myblob.name}\n"
                        f"Blob Size: {myblob.length} bytes")
       
           try:
               form_recognizer_client = initialize_form_recognizer_client()
               pdf_bytes = read_pdf_content(myblob)
               logging.info("Successfully read PDF content from blob.")
           except Exception as e:
               logging.error(f"Error reading PDF: {e}")
               return
       
           try:
               result = analyze_pdf(form_recognizer_client, pdf_bytes)
               logging.info("Successfully analyzed PDF using Document Intelligence.")
           except Exception as e:
               logging.error(f"Error analyzing PDF: {e}")
               return
       
           try:
               invoice_data = extract_invoice_data(result)
               logging.info(f"Extracted invoice data: {invoice_data}")
           except Exception as e:
               logging.error(f"Error extracting invoice data: {e}")
               return
       
           try:
               save_invoice_data_to_cosmos(invoice_data)
               logging.info("Successfully saved invoice data to Cosmos DB.")
           except Exception as e:
               logging.error(f"Error saving invoice data to Cosmos DB: {e}")
    • Now, let's update the requirements.txt:

      Template requirements.txt Updated requirements.txt
      image image
       azure-functions
       azure-ai-formrecognizer
       azure-core
       azure-cosmos==4.3.0
       azure-identity==1.7.0
      
    • Since this function has already been tested, you can deploy your code to the function app in your subscription. If you want to test, you can use run your function locally for testing.

      • Click on the Azure icon.

      • Under workspace, click on the Function App icon.

      • Click on Deploy to Azure.

         <img width="550" alt="image" src="https://github.com/user-attachments/assets/12405c04-fa43-4f09-817d-f6879fbff035">
        
      • Select your subscription, your function app, and accept the prompt to overwrite:

         <img width="550" alt="image" src="https://github.com/user-attachments/assets/1882e777-6ba0-4e18-9d7b-5937204c7217">
        
      • After completing, you see the status in your terminal:

         <img width="550" alt="image" src="https://github.com/user-attachments/assets/aa090cfc-f5b3-4ef2-9c2d-6be4f00b83b8">
        
         <img width="550" alt="image" src="https://github.com/user-attachments/assets/369ecfc7-cc31-403c-a625-bb1f6caa271c">
        

Important

If you need further assistance with the code, please click here to view all the function code.

Note

Please ensure that all specified roles are assigned to the Function App. The provided example used System assigned for the Function App to facilitate the role assignment.

Step 6: Test the solution

Important

Please ensure that the user/system admin responsible for uploading the PDFs to the blob container has the necessary permissions. The error below illustrates what might occur if these roles are missing.
image
In that case, go to Access Control (IAM), click on + Add, and Add role assignment:
image
Search for Storage Blob Data Contributor, click Next.
image
Then, click on select members and search for your user/systen admin. Finally click on Review + assign.

Upload sample PDF invoices to the Blob container and verify that data is correctly ingested and stored in Cosmos DB.

  • Click on Upload, then select Browse for files and choose your PDF invoices to be stored in the blob container, which will trigger the function app to parse them.

    image
  • Check the logs, and traces from your function with Application Insights:

    image
  • Under Investigate, click on Performance. Filter by time range, and drill into the samples. Sort the results by date (if you have many, like in my case) and click on the last one.

    image
  • Click on View all:

    image
  • Check all the logs, and traces generated. Also review the information parsed:

    image
  • Validate that the information was uploaded to the Cosmos DB. Under Data Explorer, check your Database.

    image

Total Visitors

Visitor Count

About

Example of how to create to extract PDFs from an Azure Storage Account, process them using Azure Document Intelligence, and store the results in Cosmos DB for further analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages