Post-OCR parsing: building simple and robust parser via BIO tagging

Wonseok Hwang, Seonghyeon Kim, Minjoon Seo, Jinyeong Yim, Seunghyun Park, Sungrae Park, Junyeop Lee, Bado Lee, Hwalsuk Lee

@inproceedings{hwang2019post,
title={Post-{\{}OCR{\}} parsing: building simple and robust parser via {\{}BIO{\}} tagging},
author={Wonseok Hwang and Seonghyeon Kim and Minjoon Seo and Jinyeong Yim and Seunghyun Park and Sungrae Park and Junyeop Lee and Bado Lee and Hwalsuk Lee},
booktitle={Workshop on Document Intelligence at NeurIPS 2019},
year={2019},
url={https://openreview.net/forum?id=SJgjf695UB}
}

Pipeline

Receipt detection	Receipt localization	Receipt normalization	Text line segmentation	Optical character recognition	Semantic analysis
❌	❌	❌	❌	❗	✔️

Optical character recognition

in-house OCR system consisting of CRAFT text detector and Comb.best text recognizer

Semantic analysis

Fields extracted (CORD-like annotations):
- store information
- menu:
  - name
  - unit price
  - total price
  - sub-menu
    - ...
- payment information
Serialization:
- uses lexical sort to rearrange the text segments according to their coordinates from top to down and left to right direction using y axis as a primary order
- group the text segments placed on the same line in the image
BIO Tagging:
- the text segments are tokenized and mapped to input vectors by adding token-, segment-, (sequential) position-, coordinate-, and line group-embeddings. The first three embeddings are prepared in identical way as BERT. The coordinate embedding represents the spatial information of visually embedded text segments. The line group embedding is prepared by embedding line number found in the serialization process
Parse generation:
- In receipt parsing task, there is an additional group tag (not to be confused with line group) to reflect the hierarchical structure of parses for example fields such as name, count, and price are grouped together based on the item they represent).
Refinements:
- various special symbols in cnt and price values, and (2) the thousands separator in price are refined to have unified representation

Notes

not much details - not sure how serialization works and how embeddings are calculated. Also it looks like it works on tokens, not characters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hwang2019post.md

hwang2019post.md

Post-OCR parsing: building simple and robust parser via BIO tagging

Wonseok Hwang, Seonghyeon Kim, Minjoon Seo, Jinyeong Yim, Seunghyun Park, Sungrae Park, Junyeop Lee, Bado Lee, Hwalsuk Lee

Pipeline

Optical character recognition

Semantic analysis

Notes

Files

hwang2019post.md

Latest commit

History

hwang2019post.md

File metadata and controls

Post-OCR parsing: building simple and robust parser via BIO tagging

Wonseok Hwang, Seonghyeon Kim, Minjoon Seo, Jinyeong Yim, Seunghyun Park, Sungrae Park, Junyeop Lee, Bado Lee, Hwalsuk Lee

Pipeline

Optical character recognition

Semantic analysis

Notes