Skip to content

Incorrect Figure Extraction Using Databricks ai_parse_document bbox Coordinates #18

@moryachok

Description

@moryachok

Problem:
When attempting to extract figures from a PDF using the bbox coordinates from Databricks ai_parse_document output stored in treasury_bulletin_2025_09.json the extracted figures were consistently misaligned or incorrectly cropped, despite multiple attempts with different DPI settings and coordinate conversion approaches.

Root Cause Analysis:

Ambiguous Coordinate System: The ai_parse_document bbox format provides coordinates as [x0, y0, x1, y1] integers without explicit unit specification. The documentation states coordinates are integers but doesn't clearly define if they represent:

Pixels at a specific DPI
Points (1/72 inch)
Normalized coordinates
Or some other unit system
Inconsistent Coordinate Space Across Pages: Analysis revealed that the maximum coordinates varied significantly by page, suggesting the coordinate system wasn't uniform:

This suggested different pages might have been analyzed at different resolutions or with different coordinate spaces.

Failed Approaches Attempted:

Direct pixel usage at 2x zoom (144 DPI): Coordinates exceeded page bounds
2.5x zoom (180 DPI): Based on max coordinates analysis, still produced misaligned cuts
300 DPI rendering: Standard document analysis DPI, figures were off
200 DPI rendering: Minimum DPI where coords fit bounds, still incorrect
Inch-based conversion: Attempted treating coords as if in inches like Azure Doc Intelligence, but values were too large
Dynamic per-page scaling: Calculated page-specific scale factors based on max coords, still failed
Why We Failed: The fundamental issue is that the Databricks ai_parse_document bbox coordinate system is undocumented and appears to be inconsistent. Without knowing:

The exact resolution/DPI used during document analysis
Whether coordinates represent the full rendered page or some subset
If there's any padding, offset, or transformation applied
Why coordinates vary in scale across different pages
...it's impossible to reliably convert the coordinates to accurate pixel positions for cropping.

Solution:
Switched to using Azure Document Intelligence output which provides:

Clear coordinate format: Polygon coordinates explicitly in inches
Documented standard: Azure Doc Intelligence uses inches as the unit (page_width/72, page_height/72)
Consistent conversion: Simple formula: pixel = (inch_coord / page_size_inches) * rendered_page_pixels
Reliable results: Produces perfectly aligned figure extractions
Recommendation:
The Databricks ai_parse_document documentation should be updated to clearly specify:

The unit system for bbox coordinates
The resolution/DPI at which documents are analyzed
Whether coordinates are relative to the full page or some other reference frame
Example code showing proper coordinate conversion for image extraction use cases

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions