Skip to content

Commit

Permalink
parsing pages with multi-processing: close #43
Browse files Browse the repository at this point in the history
  • Loading branch information
dothinking committed Sep 5, 2020
2 parents 200f2e5 + e786c06 commit 0b7d231
Show file tree
Hide file tree
Showing 14 changed files with 251 additions and 132 deletions.
17 changes: 9 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,13 @@
- [x] shading style, i.e. background color
- [x] merged cells
- [x] vertical direction cell
- [x] table with partly hidden borders
- [x] Rebuild page layout in docx
- [x] text in horizontal direction: from left to right
- [x] text in vertical direction: from bottom to top
- [x] in-line image
- [x] paragraph layout: horizontal and vertical spacing
- [x] Parsing pages with multi-processing

*It can also be used as a tool to extract table contents since both table content and format/style is parsed.*

Expand All @@ -34,7 +36,6 @@
- horizontal/vertical paragraph/line/word
- no word transformation, e.g. rotation
- No floating images
- Full borders table only


## Installation
Expand Down Expand Up @@ -77,6 +78,7 @@ $ pdf2docx test.pdf test.docx --start=5 --end=10

```
$ pdf2docx test.pdf test.docx --pages=5,7,9
$ pdf2docx test.pdf --multi_processing=True
```

```
Expand All @@ -86,26 +88,25 @@ NAME
pdf2docx - Run the pdf2docx parser.
SYNOPSIS
pdf2docx PDF_FILE DOCX_FILE <flags>
pdf2docx PDF_FILE <flags>
DESCRIPTION
Run the pdf2docx parser.
POSITIONAL ARGUMENTS
PDF_FILE
PDF filename to read from
DOCX_FILE
DOCX filename to write to
FLAGS
--docx_file=DOCX_FILE
DOCX filename to write to
--start=START
first page to process, starting from zero
--end=END
last page to process, starting from zero
--pages=PAGES
range of pages
--debug=DEBUG
create illustration pdf showing layouts if True, else do nothing
--multi_processing=MULTI_PROCESSING
NOTES
You can also use flags syntax for POSITIONAL ARGUMENTS
Expand All @@ -118,7 +119,7 @@ NOTES
`pip install pdf2docx`, or `python setup.py install`.
'''

from pdf2docx.main import parse
from pdf2docx import parse

pdf_file = '/path/to/sample.pdf'
docx_file = 'path/to/sample.docx'
Expand All @@ -130,7 +131,7 @@ parse(pdf_file, docx_file, start=0, end=1)
Or just to extract tables,

```python
from pdf2docx.main import extract_tables
from pdf2docx import extract_tables

pdf_file = '/path/to/sample.pdf'

Expand Down
3 changes: 3 additions & 0 deletions pdf2docx/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .converter import Converter
from .layout.Layout import Layout
from .main import parse, extract_tables
2 changes: 1 addition & 1 deletion pdf2docx/common/BBox.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-

'''
Object with a boundary box, e.g. Block, Line, Span.
Object with a bounding box, e.g. Block, Line, Span.
Based on `PyMuPDF`, the coordinates are provided relative to the un-rotated page; while this
`pdf2docx` library works under real page coordinate system, i.e. with rotation considered.
Expand Down
Loading

0 comments on commit 0b7d231

Please sign in to comment.