Skip to content

Commit ad7d582

Browse files
authored
Merge pull request #159 from ArtifexSoftware/docs-GetTables
Adds documentation for GetTables.
2 parents 20c4922 + da6aa60 commit ad7d582

File tree

2 files changed

+76
-0
lines changed

2 files changed

+76
-0
lines changed

docs/classes/Page.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ In a nutshell, this is what you can do with MuPDF.NET:
6767
:meth:`Page.GetBboxlog` List of rectangles that envelop text, drawing or image objects
6868
:meth:`Page.GetContents` PDF only: return a list of content :data:`xref` numbers
6969
:meth:`Page.GetDisplayList` Create the page's display list
70+
:meth:`Page.GetTables` Extract the page's tables as a list
7071
:meth:`Page.GetTextBlocks` Extract text blocks as a list
7172
:meth:`Page.GetTextWords` Extract text words as a list
7273
:meth:`Page.ClusterDrawings` PDF only: bounding boxes of vector graphics
@@ -557,6 +558,41 @@ In a nutshell, this is what you can do with MuPDF.NET:
557558

558559
:rtype: :ref:`DisplayList`
559560
:returns: the display list of the page.
561+
562+
.. method:: GetTables(Rect clip: null, string strategy: null, string vertical_strategy: "lines", string horizontal_strategy: "lines", List<Line> add_lines: null, List<Edge> vertical_lines: null, List<Edge> horizontal_lines: null, float snap_tolerance: TableFlags.TABLE_DEFAULT_SNAP_TOLERANCE, float snap_x_tolerance: 0.0f, float snap_y_tolerance: 0.0f, float join_tolerance: TableFlags.TABLE_DEFAULT_JOIN_TOLERANCE, float join_x_tolerance: 0.0f, float join_y_tolerance: 0.0f, float edge_min_length: 3.0f, float min_words_vertical: TableFlags.TABLE_DEFAULT_MIN_WORDS_VERTICAL, float min_words_horizontal: TableFlags.TABLE_DEFAULT_MIN_WORDS_HORIZONTAL, float intersection_tolerance: 3.0f, float intersection_x_tolerance: 0.0f, float intersection_y_tolerance: 0.0f, float text_tolerance: 3.0f, float text_x_tolerance: 3.0f, float text_y_tolerance: 3.0f)
563+
564+
Find tables on the page and return a list with related information. Typically, the default values of the many parameters will be sufficient. Adjustments should ever only be needed in corner case situations.
565+
566+
:arg Rect clip: specify a region to consider within the page rectangle and ignore the rest. Default `null` is the full page.
567+
568+
:arg str strategy: Request a **table detection** strategy. Valid values are "lines", "lines_strict" and "text".
569+
570+
Default is **"lines"** which uses all vector graphics on the page to detect grid lines.
571+
572+
Strategy **"lines_strict"** ignores borderless rectangle vector graphics. Sometimes single text pieces have background colors which may lead to false columns or lines. This strategy ignores them and can thus increase detection precision.
573+
574+
If **"text"** is specified, text positions are used to generate "virtual" column and / or row boundaries. Use `min_words_*` to request the number of words for considering their coordinates.
575+
576+
Use parameters `vertical_strategy` and `horizontal_strategy` **instead** for a more fine-grained treatment of the dimensions.
577+
578+
:arg List<Line> add_lines: Specify a list of "lines" (i.e. pairs of `Line` objects) as **additional**, "virtual" vector graphics. These lines may help with table and / or cell detection and will not otherwise influence the detection strategy. Especially, in contrast to parameters `horizontal_lines` and `vertical_lines`, they will not prevent detecting rows or columns in other ways. These lines will be treated exactly like "real" vector graphics in terms of joining, snapping, intersectiing, minimum length and containment in the `clip` rectangle. Similarly, lines not parallel to any of the coordinate axes will be ignored.
579+
580+
:arg float snap_tolerance: Any two horizontal lines whose y-values differ by no more than this value will be **snapped** into one. Accordingly for vertical lines. Default is 3. Separate values can be specified instead for the dimensions, using `snap_x_tolerance` and `snap_y_tolerance`.
581+
582+
:arg float join_tolerance: Any two lines will be **joined** to one if the end and the start points differ by no more than this value (in points). Default is 3. Instead of this value, separate values can be specified for the dimensions using `join_x_tolerance` and `join_y_tolerance`.
583+
584+
:arg float edge_min_length: Ignore a line if its length does not exceed this value (points). Default is 3.
585+
586+
:arg int min_words_vertical: relevant for vertical strategy option "text": at least this many words must coincide to establish a **virtual column** boundary.
587+
588+
:arg int min_words_horizontal: relevant for horizontal strategy option "text": at least this many words must coincide to establish a **virtual row** boundary.
589+
590+
:arg float intersection_tolerance: When combining lines into cell borders, orthogonal lines must be within this value (points) to be considered intersecting. Default is 3. Instead of this value, separate values can be specified for the dimensions using `intersection_x_tolerance` and `intersection_y_tolerance`.
591+
592+
:arg float text_tolerance: Characters will be combined into words only if their distance is no larger than this value (points). Default is 3. Instead of this value, separate values can be specified for the dimensions using `text_x_tolerance` and `text_y_tolerance`.
593+
594+
:rtype: List
595+
:return: a list of `Table`
560596

561597
.. method:: GetTextBlocks(Rect clip: null, int flags: 0, TextPage textPage: null, bool sort: false)
562598

docs/glossary/Utils.rst

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ Yet others are handy, general-purpose utilities.
6666
:meth:`GetGlyphText` Adobe Glyph List function
6767
:meth:`GetImageExtension` Return extension for MuPDF image type
6868
:meth:`GetLinkText` Define skeletons for `/Annots` object texts
69+
:meth:`GetTables` Return the tables detected from a `Page` instance
6970
:meth:`GetWidgetProperties` Populate a Widget object with the values from a PDF form field
7071
:meth:`InsertContents` Insert a buffer as a new separate `/Contents` object of a page
7172
:meth:`Integer2Letter` Return letter sequence string for integer `i`
@@ -533,6 +534,45 @@ Yet others are handy, general-purpose utilities.
533534

534535
:returns: annot string.
535536

537+
-----
538+
539+
.. method:: GetTables(Page page, Rect clip: null, string strategy: null, string vertical_strategy: "lines", string horizontal_strategy: "lines", List<Line> add_lines: null, List<Edge> vertical_lines: null, List<Edge> horizontal_lines: null, float snap_tolerance: TableFlags.TABLE_DEFAULT_SNAP_TOLERANCE, float snap_x_tolerance: 0.0f, float snap_y_tolerance: 0.0f, float join_tolerance: TableFlags.TABLE_DEFAULT_JOIN_TOLERANCE, float join_x_tolerance: 0.0f, float join_y_tolerance: 0.0f, float edge_min_length: 3.0f, float min_words_vertical: TableFlags.TABLE_DEFAULT_MIN_WORDS_VERTICAL, float min_words_horizontal: TableFlags.TABLE_DEFAULT_MIN_WORDS_HORIZONTAL, float intersection_tolerance: 3.0f, float intersection_x_tolerance: 0.0f, float intersection_y_tolerance: 0.0f, float text_tolerance: 3.0f, float text_x_tolerance: 3.0f, float text_y_tolerance: 3.0f)
540+
541+
Find tables on the page and return a list with related information. Typically, the default values of the many parameters will be sufficient. Adjustments should ever only be needed in corner case situations.
542+
543+
:arg Page page: The page instance to use for table detection.
544+
545+
:arg Rect clip: specify a region to consider within the page rectangle and ignore the rest. Default `null` is the full page.
546+
547+
:arg str strategy: Request a **table detection** strategy. Valid values are "lines", "lines_strict" and "text".
548+
549+
Default is **"lines"** which uses all vector graphics on the page to detect grid lines.
550+
551+
Strategy **"lines_strict"** ignores borderless rectangle vector graphics. Sometimes single text pieces have background colors which may lead to false columns or lines. This strategy ignores them and can thus increase detection precision.
552+
553+
If **"text"** is specified, text positions are used to generate "virtual" column and / or row boundaries. Use `min_words_*` to request the number of words for considering their coordinates.
554+
555+
Use parameters `vertical_strategy` and `horizontal_strategy` **instead** for a more fine-grained treatment of the dimensions.
556+
557+
:arg List<Line> add_lines: Specify a list of "lines" (i.e. pairs of `Line` objects) as **additional**, "virtual" vector graphics. These lines may help with table and / or cell detection and will not otherwise influence the detection strategy. Especially, in contrast to parameters `horizontal_lines` and `vertical_lines`, they will not prevent detecting rows or columns in other ways. These lines will be treated exactly like "real" vector graphics in terms of joining, snapping, intersectiing, minimum length and containment in the `clip` rectangle. Similarly, lines not parallel to any of the coordinate axes will be ignored.
558+
559+
:arg float snap_tolerance: Any two horizontal lines whose y-values differ by no more than this value will be **snapped** into one. Accordingly for vertical lines. Default is 3. Separate values can be specified instead for the dimensions, using `snap_x_tolerance` and `snap_y_tolerance`.
560+
561+
:arg float join_tolerance: Any two lines will be **joined** to one if the end and the start points differ by no more than this value (in points). Default is 3. Instead of this value, separate values can be specified for the dimensions using `join_x_tolerance` and `join_y_tolerance`.
562+
563+
:arg float edge_min_length: Ignore a line if its length does not exceed this value (points). Default is 3.
564+
565+
:arg int min_words_vertical: relevant for vertical strategy option "text": at least this many words must coincide to establish a **virtual column** boundary.
566+
567+
:arg int min_words_horizontal: relevant for horizontal strategy option "text": at least this many words must coincide to establish a **virtual row** boundary.
568+
569+
:arg float intersection_tolerance: When combining lines into cell borders, orthogonal lines must be within this value (points) to be considered intersecting. Default is 3. Instead of this value, separate values can be specified for the dimensions using `intersection_x_tolerance` and `intersection_y_tolerance`.
570+
571+
:arg float text_tolerance: Characters will be combined into words only if their distance is no larger than this value (points). Default is 3. Instead of this value, separate values can be specified for the dimensions using `text_x_tolerance` and `text_y_tolerance`.
572+
573+
:rtype: List
574+
:return: a list of `Table`
575+
536576
-----
537577

538578
.. method:: GetWidgetProperties(Annot annot, Widget widget)

0 commit comments

Comments
 (0)