Skip to content

Network and Hybrid parsers #153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 129 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
129 commits
Select commit Hold shift + click to select a range
816471e
Fix unit tests, lint, drop Python 2 support
Apr 19, 2020
697289e
Refactor base classes and improve plotting
Apr 19, 2020
5838687
Prep work for new hybrid parser introduction
Apr 19, 2020
cff7a96
Further refactor
Apr 19, 2020
50f1186
Lint, refactor
Apr 19, 2020
c27a802
More linting, refactor
Apr 19, 2020
d673a3b
Fix unit test with plotting
Apr 19, 2020
58823e5
More refactoring / linting
Apr 19, 2020
d520a77
Initial Hybrid parser, for now identical to Stream
Apr 19, 2020
e59b3f5
Fix unit test
Apr 19, 2020
89fe090
Linting
Apr 19, 2020
69c7728
More linting
Apr 20, 2020
dec8f2d
Try to silence bandit messages on valid asserts
Apr 20, 2020
d0bd1cf
More linting
Apr 20, 2020
57c5957
Interim check-in, test failing and lots of todos
Apr 20, 2020
175655d
Add support for region/area for hybrid
Apr 20, 2020
fb69bd9
Improve hybrid plotting
Apr 20, 2020
ad27a11
Refactor code in plotting
Apr 21, 2020
cd338ff
Draw parse constraints for easier debug
Apr 21, 2020
9a82408
Prettier plotting, improve gaps calculation
Apr 22, 2020
0be58de
Fix in table diff
Apr 22, 2020
df3d288
Loosen cells header expansion algorithm
Apr 22, 2020
fab13ee
Unit test fix
Apr 22, 2020
7b0ac03
Prefer showing diffs at the row level
Apr 22, 2020
6962c71
Unit test fix
Apr 22, 2020
ec0ca1e
Unit test fixes
Apr 22, 2020
489e996
Address last unit test
Apr 22, 2020
36d5a09
Refactor common code hybrid / stream
Apr 23, 2020
414708d
Move generic code to utils
Apr 23, 2020
adb14d3
Refactoring TextEdges code across hybrid and stream
Apr 23, 2020
5db49d4
More refactoring across stream and hybrid.
Apr 23, 2020
3ea8d81
Update test to reflect different order of edges
Apr 23, 2020
58b2c1d
Define TextEdge as a bounded TextAlignment
Apr 24, 2020
efe8129
Enforce text_edge as subcase of text_alignment
Apr 24, 2020
8ad9e56
Further simplification
Apr 24, 2020
5290fb6
Refactor out _text_bbox
Apr 24, 2020
f42557a
Common parent TextBaseParser for Stream and Hybrid
Apr 24, 2020
bb842f2
Further refactoring
Apr 25, 2020
22f4287
Improve edgeplot for hybrid
Apr 25, 2020
84ec5c6
Rename member for clarity, fixed unit test
Apr 26, 2020
0167769
Plot improvements, address 132
Apr 26, 2020
2624010
Remove f-strings, fix url based unit tests
Apr 26, 2020
56dd310
Remove another f-string
Apr 26, 2020
30a0b2e
Add Parser comparison notebook to help visualizing
Apr 26, 2020
a2c5ee7
Add parser comparizon notebook
Apr 26, 2020
c51c24a
Linting
Apr 26, 2020
6add19a
Prep for vertical text improvements
Apr 28, 2020
3220b02
Create notebook to help debug hybrid parser algo
Apr 28, 2020
918416e
Improve hybrid table body discovery algo
Apr 29, 2020
04fc542
Fix off by one error in column identification
Apr 29, 2020
c0903b8
Improve column detection for hybrid flavor
Apr 29, 2020
8a63e8e
Minor linting
Apr 29, 2020
f3aded5
Linting
Apr 29, 2020
d663dd1
Fix plotting unit tests
Apr 30, 2020
c7ab3a4
Raise tolerance of plot differences
May 1, 2020
6711f87
Rename WIP parser "network", actual Hybrid to come
May 2, 2020
77d289b
WIP: Introduce actual hybrid parser
May 4, 2020
79ea4ad
Add baseline test for hybrid
May 5, 2020
ae429fc
Hybrid parser fixes
May 5, 2020
ba5169b
Enable process_background option for hybrid
May 8, 2020
bd2aab5
Fix unit tests, lint, drop Python 2 support
Apr 19, 2020
161f712
Refactor base classes and improve plotting
Apr 19, 2020
37483ca
Prep work for new hybrid parser introduction
Apr 19, 2020
ff2ce6f
Further refactor
Apr 19, 2020
20f18b4
Lint, refactor
Apr 19, 2020
f37ed50
More linting, refactor
Apr 19, 2020
8ed4cdf
Fix unit test with plotting
Apr 19, 2020
64576fd
More refactoring / linting
Apr 19, 2020
f9a6543
Initial Hybrid parser, for now identical to Stream
Apr 19, 2020
e8e80a8
Fix unit test
Apr 19, 2020
07e2e16
Linting
Apr 19, 2020
878ef96
More linting
Apr 20, 2020
931b2f2
Try to silence bandit messages on valid asserts
Apr 20, 2020
c1c9358
More linting
Apr 20, 2020
f5fe92c
Interim check-in, test failing and lots of todos
Apr 20, 2020
e0e3ff4
Add support for region/area for hybrid
Apr 20, 2020
1ccaa06
Improve hybrid plotting
Apr 20, 2020
310a8cd
Refactor code in plotting
Apr 21, 2020
d2cf852
Draw parse constraints for easier debug
Apr 21, 2020
1a47c3d
Prettier plotting, improve gaps calculation
Apr 22, 2020
a2a8311
Fix in table diff
Apr 22, 2020
356af84
Loosen cells header expansion algorithm
Apr 22, 2020
549ab0e
Unit test fix
Apr 22, 2020
db64562
Prefer showing diffs at the row level
Apr 22, 2020
13268be
Unit test fix
Apr 22, 2020
d3d625a
Unit test fixes
Apr 22, 2020
bfc2719
Address last unit test
Apr 22, 2020
14cd328
Refactor common code hybrid / stream
Apr 23, 2020
7ad5b84
Move generic code to utils
Apr 23, 2020
92c8abd
Refactoring TextEdges code across hybrid and stream
Apr 23, 2020
8903ef7
More refactoring across stream and hybrid.
Apr 23, 2020
0b8aac9
Update test to reflect different order of edges
Apr 23, 2020
2d97fbc
Define TextEdge as a bounded TextAlignment
Apr 24, 2020
22b6e33
Enforce text_edge as subcase of text_alignment
Apr 24, 2020
87d95a0
Further simplification
Apr 24, 2020
a401d33
Refactor out _text_bbox
Apr 24, 2020
1858164
Common parent TextBaseParser for Stream and Hybrid
Apr 24, 2020
c9a73a1
Further refactoring
Apr 25, 2020
a0e4691
Improve edgeplot for hybrid
Apr 25, 2020
dbaab66
Rename member for clarity, fixed unit test
Apr 26, 2020
81de841
Plot improvements, address 132
Apr 26, 2020
9eb4f65
Remove f-strings, fix url based unit tests
Apr 26, 2020
15d99b1
Remove another f-string
Apr 26, 2020
90f8d11
Add Parser comparison notebook to help visualizing
Apr 26, 2020
f7aafcd
Add parser comparizon notebook
Apr 26, 2020
e1572a1
Linting
Apr 26, 2020
8f5e2bb
Prep for vertical text improvements
Apr 28, 2020
a04e770
Create notebook to help debug hybrid parser algo
Apr 28, 2020
21dc6a4
Improve hybrid table body discovery algo
Apr 29, 2020
e31e978
Fix off by one error in column identification
Apr 29, 2020
ada4809
Improve column detection for hybrid flavor
Apr 29, 2020
55fd459
Minor linting
Apr 29, 2020
4b3eee4
Linting
Apr 29, 2020
9e385bf
Fix plotting unit tests
Apr 30, 2020
2867aec
Raise tolerance of plot differences
May 1, 2020
edad1ef
Rename WIP parser "network", actual Hybrid to come
May 2, 2020
4a76161
WIP: Introduce actual hybrid parser
May 4, 2020
7fae107
Add baseline test for hybrid
May 5, 2020
63adfd5
Hybrid parser fixes
May 5, 2020
9abdd00
Enable process_background option for hybrid
May 8, 2020
529ea36
Updated comparison notebook
Jun 12, 2020
1813b80
Merge fix
Jun 13, 2020
4145361
Merge branch 'hybrid-parser' of https://github.com/FrancoisHuet/camel…
Jun 13, 2020
4fb1e93
Bump dev libraries requirements to avoid conflicts
Jun 13, 2020
b43aca8
Merge branch 'master' into hybrid-parser
Jun 14, 2020
92322e1
Address post-merge linting issues.
Jun 14, 2020
9c971a1
Linting
Jun 14, 2020
71805f9
Fix issues following pass across most test cases
Jun 16, 2020
42f8321
Clean up notebooks, address review comments
Jul 4, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ _build/
htmlcov/

# vscode
.vscode
.vscode
126 changes: 115 additions & 11 deletions camelot/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
logger.setLevel(logging.INFO)


class Config(object):
class Config():
def __init__(self):
self.config = {}

Expand All @@ -31,7 +31,8 @@ def set_config(self, key, value):

@click.group(name="camelot")
@click.version_option(version=__version__)
@click.option("-q", "--quiet", is_flag=False, help="Suppress logs and warnings.")
@click.option("-q", "--quiet", is_flag=False,
help="Suppress logs and warnings.")
@click.option(
"-p",
"--pages",
Expand All @@ -57,7 +58,7 @@ def set_config(self, key, value):
"-flag",
"--flag_size",
is_flag=True,
help="Flag text based on" " font size. Useful to detect super/subscripts.",
help="Flag text based on font size. Useful to detect super/subscripts.",
)
@click.option(
"-strip",
Expand Down Expand Up @@ -98,7 +99,8 @@ def cli(ctx, *args, **kwargs):
" where x1, y1 -> left-top and x2, y2 -> right-bottom.",
)
@click.option(
"-back", "--process_background", is_flag=True, help="Process background lines."
"-back", "--process_background", is_flag=True,
help="Process background lines."
)
@click.option(
"-scale",
Expand Down Expand Up @@ -127,7 +129,8 @@ def cli(ctx, *args, **kwargs):
"-l",
"--line_tol",
default=2,
help="Tolerance parameter used to merge close vertical" " and horizontal lines.",
help="Tolerance parameter used to merge close vertical"
" and horizontal lines.",
)
@click.option(
"-j",
Expand Down Expand Up @@ -197,12 +200,15 @@ def lattice(c, *args, **kwargs):
raise ImportError("matplotlib is required for plotting.")
else:
if output is None:
raise click.UsageError("Please specify output file path using --output")
raise click.UsageError(
"Please specify output file path using --output")
if f is None:
raise click.UsageError("Please specify output file format using --format")
raise click.UsageError(
"Please specify output file format using --format")

tables = read_pdf(
filepath, pages=pages, flavor="lattice", suppress_stdout=quiet, **kwargs
filepath, pages=pages, flavor="lattice", suppress_stdout=quiet,
**kwargs
)
click.echo(f"Found {tables.n} tables")
if plot_type is not None:
Expand Down Expand Up @@ -247,7 +253,8 @@ def lattice(c, *args, **kwargs):
"-r",
"--row_tol",
default=2,
help="Tolerance parameter" " used to combine text vertically, to generate rows.",
help="Tolerance parameter"
" used to combine text vertically, to generate rows.",
)
@click.option(
"-c",
Expand Down Expand Up @@ -288,9 +295,11 @@ def stream(c, *args, **kwargs):
raise ImportError("matplotlib is required for plotting.")
else:
if output is None:
raise click.UsageError("Please specify output file path using --output")
raise click.UsageError(
"Please specify output file path using --output")
if f is None:
raise click.UsageError("Please specify output file format using --format")
raise click.UsageError(
"Please specify output file format using --format")

tables = read_pdf(
filepath, pages=pages, flavor="stream", suppress_stdout=quiet, **kwargs
Expand All @@ -302,3 +311,98 @@ def stream(c, *args, **kwargs):
plt.show()
else:
tables.export(output, f=f, compress=compress)


@cli.command("network")
@click.option(
"-R",
"--table_regions",
default=[],
multiple=True,
help="Page regions to analyze. Example: x1,y1,x2,y2"
" where x1, y1 -> left-top and x2, y2 -> right-bottom.",
)
@click.option(
"-T",
"--table_areas",
default=[],
multiple=True,
help="Table areas to process. Example: x1,y1,x2,y2"
" where x1, y1 -> left-top and x2, y2 -> right-bottom.",
)
@click.option(
"-C",
"--columns",
default=[],
multiple=True,
help="X coordinates of column separators.",
)
@click.option(
"-e",
"--edge_tol",
default=50,
help="Tolerance parameter" " for extending textedges vertically.",
)
@click.option(
"-r",
"--row_tol",
default=2,
help="Tolerance parameter"
" used to combine text vertically, to generate rows.",
)
@click.option(
"-c",
"--column_tol",
default=0,
help="Tolerance parameter"
" used to combine text horizontally, to generate columns.",
)
@click.option(
"-plot",
"--plot_type",
type=click.Choice(["text", "grid", "contour", "textedge"]),
help="Plot elements found on PDF page for visual debugging.",
)
@click.argument("filepath", type=click.Path(exists=True))
@pass_config
def network(c, *args, **kwargs):
"""Use spaces between text to parse the table."""
conf = c.config
pages = conf.pop("pages")
output = conf.pop("output")
f = conf.pop("format")
compress = conf.pop("zip")
quiet = conf.pop("quiet")
plot_type = kwargs.pop("plot_type")
filepath = kwargs.pop("filepath")
kwargs.update(conf)

table_regions = list(kwargs["table_regions"])
kwargs["table_regions"] = None if not table_regions else table_regions
table_areas = list(kwargs["table_areas"])
kwargs["table_areas"] = None if not table_areas else table_areas
columns = list(kwargs["columns"])
kwargs["columns"] = None if not columns else columns

if plot_type is not None:
if not _HAS_MPL:
raise ImportError("matplotlib is required for plotting.")
else:
if output is None:
raise click.UsageError(
"Please specify output file path using --output")
if f is None:
raise click.UsageError(
"Please specify output file format using --format")

tables = read_pdf(
filepath, pages=pages, flavor="network",
suppress_stdout=quiet, **kwargs
)
click.echo(f"Found {tables.n} tables")
if plot_type is not None:
for table in tables:
plot(table, kind=plot_type)
plt.show()
else:
tables.export(output, f=f, compress=compress)
Loading