Skip to content
This repository has been archived by the owner on Feb 11, 2024. It is now read-only.

Latest commit

 

History

History
89 lines (54 loc) · 3.62 KB

xu2019layout.md

File metadata and controls

89 lines (54 loc) · 3.62 KB

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou

Browse

@article{xu2019layout,
  title={LayoutLM: Pre-training of Text and Layout for Document Image Understanding},
  author={Yiheng Xu and Ming-Hao Li and Lei Cui and Shaohan Huang and Furu Wei and Ming Zhou},
  journal={ArXiv},
  year={2019},
  volume={abs/1912.13318}
}

Pipeline

Receipt detection Receipt localization Receipt normalization Text line segmentation Optical character recognition Semantic analysis
✔️

Semantic analysis

  • Fields extracted:

    • FUNSD:

      9,707 semantic entities and 31,485 words. These forms are organized as a list of semantic entities that are interlinked. Each semantic entity comprises a unique identifier, a label (i.e., question, answer, header or other), a bounding box, a list of links with other entities, and a list of words

    • SROIE:

      • company,
      • date,
      • address,
      • total
  • based on BERT model

    attention-based bidirectional language modeling

    tokens processed using WordPiece, the input embeddings are computed by summing the corresponding word embeddings, position embeddings, and segment embedding

  • When it comes to visually rich documents, there is much more information that can be encoded into the pre-trained model

    • Document Layout Information

      relative positions of words in a document contribute a lot to the semantic representation

    • Visual Information

  • add two types of new input embeddings: a 2-D position embedding and an image embedding

    • 2-D Position Embedding

    • Image Embedding

      the bounding box of each word from OCR results, we split the image into several pieces, and they have a one-to-one correspondence with the words. We generate the image region features with these pieces of images from the Faster R-CNN (Ren et al., 2015) model as the token image embeddings

      ResNet-101 model as the backbone network in the Faster R-CNN model, which is pre-trained on the Visual Genome dataset

  • Pretraining on two tasks:

    • Masked Visual-Language Model

      randomly mask some of the input tokens but keep the 2-D position embeddings and other text embeddings, then the model is trained to predict the masked tokens given the contexts

    • Multi-label Document Classification

  • Fine-tuning

    For the form and receipt understanding tasks, LayoutLM predicts {B, I, E, S, O} tags for each token and uses sequential labeling to detect each type of entity in the dataset

Notes

  • LayoutLM to jointly model the interaction between text and layout information across scanned document images

  • pre-training method of text and layout for document image understanding tasks

  • Limitation of other methods:
    • They only relied on a few human-labeled training samples, yet did not fully explore the possibility of using large-scale unlabeled training samples

    • They usually leveraged either pre-trained CV models or NLP models, but did not consider a joint training of textual and layout information

  • 8 NVIDIA Tesla V100 32GB GPUs with a total batch size of 80

  • 80 hours to finish one epoch on 11M documents

  • SOTA on SROIE dataset