|
1 |
| -# Attention-OCR |
2 |
| -Authours: [Qi Guo](http://qiguo.ml) and [Yuntian Deng](https://github.com/da03) |
| 1 | +# Attention-based OCR |
3 | 2 |
|
4 |
| -Visual Attention based OCR. The model first runs a sliding CNN on the image (images are resized to height 32 while preserving aspect ratio). Then an LSTM is stacked on top of the CNN. Finally, an attention model is used as a decoder for producing the final outputs. |
| 3 | +Visual attention-based OCR model for image recognition with additional tools for creating TFRecords datasets and exporting the trained model with weights as a [SavedModel](https://www.tensorflow.org/api_docs/python/tf/saved_model) or a frozen graph. |
5 | 4 |
|
6 |
| - |
| 5 | +## Acknowledgements |
7 | 6 |
|
8 |
| -# Prerequsites |
9 |
| -Most of our code is written based on Tensorflow, but we also use Keras for the convolution part of our model. Besides, we use python package distance to calculate edit distance for evaluation. (However, that is not mandatory, if distance is not installed, we will do exact match). |
| 7 | +This project is based on a model by [Qi Guo](http://qiguo.ml) and [Yuntian Deng](https://github.com/da03). You can find the original model in the [da03/Attention-OCR](https://github.com/da03/Attention-OCR) repository. |
10 | 8 |
|
11 |
| -### Tensorflow: [Installation Instructions](https://www.tensorflow.org/install/) (tested on 1.2.0) |
| 9 | +## The model |
12 | 10 |
|
13 |
| -### Distance (Optional): |
| 11 | +Authors: [Qi Guo](http://qiguo.ml) and [Yuntian Deng](https://github.com/da03). |
14 | 12 |
|
15 |
| -``` |
16 |
| -wget http://www.cs.cmu.edu/~yuntiand/Distance-0.1.3.tar.gz |
17 |
| -``` |
| 13 | +The model first runs a sliding CNN on the image (images are resized to height 32 while preserving aspect ratio). Then an LSTM is stacked on top of the CNN. Finally, an attention model is used as a decoder for producing the final outputs. |
18 | 14 |
|
19 |
| -``` |
20 |
| -tar zxf Distance-0.1.3.tar.gz |
21 |
| -``` |
| 15 | + |
| 16 | + |
| 17 | +## Installation |
22 | 18 |
|
23 | 19 | ```
|
24 |
| -cd distance; sudo python setup.py install |
| 20 | +pip install aocr |
25 | 21 | ```
|
26 | 22 |
|
27 |
| -# Usage: |
| 23 | +Note: Tensorflow 1.2 and Numpy will be installed as dependencies. Additional dependencies are `PIL`/`Pillow`, `distance`, and `six`. |
28 | 24 |
|
29 |
| -Note: We assume that the working directory is `Attention-OCR`. |
| 25 | +## Usage |
30 | 26 |
|
31 |
| -## Train |
32 |
| - |
33 |
| -### Data Preparation |
34 |
| -We need a file (specified by parameter `data-path`) containing the path of images and the corresponding characters, e.g.: |
| 27 | +### Create a dataset |
35 | 28 |
|
36 | 29 | ```
|
37 |
| -path/to/image1 abc |
38 |
| -path/to/image2 def |
| 30 | +aocr dataset datasets/annotations-training.txt datasets/training.tfrecords |
| 31 | +aocr dataset datasets/annotations-testing.txt datasets/testing.tfrecords |
39 | 32 | ```
|
40 | 33 |
|
41 |
| -And we also need to specify a `data-base-dir` parameter such that we read the images from path `data-base-dir/path/to/image`. If `data-path` contains absolute path of images, then `data-base-dir` needs to be set to `/`. |
42 |
| - |
43 |
| -### A Toy Example |
44 |
| - |
45 |
| -For a toy example, we have prepared a training dataset of the specified format, which is a subset of [Synth 90k](http://www.robots.ox.ac.uk/~vgg/data/text/) |
| 34 | +Annotations are simple text files containing the image paths (either absolute or relative to your working dir) and their corresponding labels: |
46 | 35 |
|
47 | 36 | ```
|
48 |
| -wget http://www.cs.cmu.edu/~yuntiand/sample.tgz |
| 37 | +datasets/images/hello.jpg hello |
| 38 | +datasets/images/world.jpg world |
49 | 39 | ```
|
50 | 40 |
|
| 41 | +### Train |
| 42 | + |
51 | 43 | ```
|
52 |
| -tar zxf sample.tgz |
| 44 | +aocr train datasets/training.tfrecords |
53 | 45 | ```
|
54 | 46 |
|
| 47 | +A new model will be created, and the training will start. Note that it takes quite a long time to reach convergence, since we are training the CNN and attention model simultaneously. |
| 48 | + |
| 49 | +The `--steps-per-checkpoint` parameter determines how often the model checkpoints will be saved (the default output dir is `checkpoints/`). |
| 50 | + |
| 51 | +**Important:** there is a lot of available training options. See the CLI help or the `parameters` section of this README. |
| 52 | + |
| 53 | +### Test and visualize |
| 54 | + |
55 | 55 | ```
|
56 |
| -python src/launcher.py --phase=train --data-path=sample/sample.txt --data-base-dir=sample --log-path=log.txt --no-load-model |
| 56 | +aocr test datasets/testing.tfrecords |
57 | 57 | ```
|
58 | 58 |
|
59 |
| -After a while, you will see something like the following output in `log.txt`: |
| 59 | +Additionally, you can visualize the attention results during testing (saved to `results/` by default): |
60 | 60 |
|
61 | 61 | ```
|
62 |
| -... |
63 |
| -2016-06-08 20:47:22,335 root INFO Created model with fresh parameters. |
64 |
| -2016-06-08 20:47:52,852 root INFO current_step: 0 |
65 |
| -2016-06-08 20:48:01,253 root INFO step_time: 8.400597, step perplexity: 38.998714 |
66 |
| -2016-06-08 20:48:01,385 root INFO current_step: 1 |
67 |
| -2016-06-08 20:48:07,166 root INFO step_time: 5.781749, step perplexity: 38.998445 |
68 |
| -2016-06-08 20:48:07,337 root INFO current_step: 2 |
69 |
| -2016-06-08 20:48:12,322 root INFO step_time: 4.984972, step perplexity: 39.006730 |
70 |
| -2016-06-08 20:48:12,347 root INFO current_step: 3 |
71 |
| -2016-06-08 20:48:16,821 root INFO step_time: 4.473902, step perplexity: 39.000267 |
72 |
| -2016-06-08 20:48:16,859 root INFO current_step: 4 |
73 |
| -2016-06-08 20:48:21,452 root INFO step_time: 4.593249, step perplexity: 39.009864 |
74 |
| -2016-06-08 20:48:21,530 root INFO current_step: 5 |
75 |
| -2016-06-08 20:48:25,878 root INFO step_time: 4.348195, step perplexity: 38.987707 |
76 |
| -2016-06-08 20:48:26,016 root INFO current_step: 6 |
77 |
| -2016-06-08 20:48:30,851 root INFO step_time: 4.835423, step perplexity: 39.022887 |
| 62 | +aocr test --visualize datasets/testing.tfrecords |
78 | 63 | ```
|
79 | 64 |
|
80 |
| -Note that it takes quite a long time to reach convergence, since we are training the CNN and attention model simultaneously. |
| 65 | +Example output images in `results/correct`: |
81 | 66 |
|
82 |
| -## Test and visualize attention results |
| 67 | +Image 0 (j/j): |
83 | 68 |
|
84 |
| -The test data format shall be the same as training data format. We have also prepared a test dataset of the specified format, which includes ICDAR03, ICDAR13, IIIT5k and SVT. |
| 69 | + |
85 | 70 |
|
86 |
| -``` |
87 |
| -wget http://www.cs.cmu.edu/~yuntiand/evaluation_data.tgz |
88 |
| -``` |
| 71 | +Image 1 (u/u): |
89 | 72 |
|
90 |
| -``` |
91 |
| -tar zxf evaluation_data.tgz |
92 |
| -``` |
| 73 | + |
93 | 74 |
|
94 |
| -We also provide a trained model on Synth 90K: |
| 75 | +Image 2 (n/n): |
95 | 76 |
|
96 |
| -``` |
97 |
| -wget http://www.cs.cmu.edu/~yuntiand/model.tgz |
98 |
| -``` |
| 77 | + |
99 | 78 |
|
100 |
| -``` |
101 |
| -tar zxf model.tgz |
102 |
| -``` |
| 79 | +Image 3 (g/g): |
| 80 | + |
| 81 | + |
| 82 | + |
| 83 | +Image 4 (l/l): |
| 84 | + |
| 85 | + |
| 86 | + |
| 87 | +Image 5 (e/e): |
| 88 | + |
| 89 | + |
| 90 | + |
| 91 | +### Export |
103 | 92 |
|
104 | 93 | ```
|
105 |
| -python src/launcher.py --phase=test --visualize --data-path=evaluation_data/svt/test.txt --data-base-dir=evaluation_data/svt --log-path=log.txt --load-model --model-dir=model --output-dir=results |
| 94 | +aocr export exported-model |
106 | 95 | ```
|
107 | 96 |
|
108 |
| -After a while, you will see something like the following output in `log.txt`: |
| 97 | +Load weights from the latest checkpoints and export the model into the `./exported-model` directory. |
109 | 98 |
|
110 |
| -``` |
111 |
| -2016-06-08 22:36:31,638 root INFO Reading model parameters from model/translate.ckpt-47200 |
112 |
| -2016-06-08 22:36:40,529 root INFO Compare word based on edit distance. |
113 |
| -2016-06-08 22:36:41,652 root INFO step_time: 1.119277, step perplexity: 1.056626 |
114 |
| -2016-06-08 22:36:41,660 root INFO 1.000000 out of 1 correct |
115 |
| -2016-06-08 22:36:42,358 root INFO step_time: 0.696687, step perplexity: 2.003350 |
116 |
| -2016-06-08 22:36:42,363 root INFO 1.666667 out of 2 correct |
117 |
| -2016-06-08 22:36:42,831 root INFO step_time: 0.466550, step perplexity: 1.501963 |
118 |
| -2016-06-08 22:36:42,835 root INFO 2.466667 out of 3 correct |
119 |
| -2016-06-08 22:36:43,402 root INFO step_time: 0.562091, step perplexity: 1.269991 |
120 |
| -2016-06-08 22:36:43,418 root INFO 3.366667 out of 4 correct |
121 |
| -2016-06-08 22:36:43,897 root INFO step_time: 0.477545, step perplexity: 1.072437 |
122 |
| -2016-06-08 22:36:43,905 root INFO 4.366667 out of 5 correct |
123 |
| -2016-06-08 22:36:44,107 root INFO step_time: 0.195361, step perplexity: 2.071796 |
124 |
| -2016-06-08 22:36:44,127 root INFO 5.144444 out of 6 correct |
| 99 | +## Google Cloud ML Engine |
| 100 | + |
| 101 | +To train the model in the [Google Cloud Machine Learning Engine](https://cloud.google.com/ml-engine/), upload the training dataset into a Google Cloud Storage bucket and start a training job with the `gcloud` tool. |
| 102 | + |
| 103 | +1. Set the environment variables: |
125 | 104 |
|
126 | 105 | ```
|
| 106 | +# Prefix for the job name. |
| 107 | +export JOB_PREFIX="aocr" |
127 | 108 |
|
128 |
| -Example output images in `results/correct` (the output directory is set via parameter `output-dir` and the default is `results`): (Look closer to see it clearly.) |
| 109 | +# Region to launch the training job in. |
| 110 | +# Should be the same as the storage bucket region. |
| 111 | +export REGION="us-central1" |
129 | 112 |
|
130 |
| -Format: Image `index` (`predicted`/`ground truth`) `Image file` |
| 113 | +# Your storage bucket. |
| 114 | +export GS_BUCKET="gs://aocr-bucket" |
131 | 115 |
|
132 |
| -Image 0 (j/j):  |
| 116 | +# Path to store your training dataset in the bucket. |
| 117 | +export DATASET_UPLOAD_PATH="training.tfrecords" |
| 118 | +``` |
133 | 119 |
|
134 |
| -Image 1 (u/u):  |
| 120 | +2. Upload the training dataset: |
135 | 121 |
|
136 |
| -Image 2 (n/n):  |
| 122 | +``` |
| 123 | +gsutil cp datasets/training.tfrecords $GS_BUCKET/$DATASET_UPLOAD_PATH |
| 124 | +``` |
137 | 125 |
|
138 |
| -Image 3 (g/g):  |
| 126 | +3. Launch the ML Engine job: |
139 | 127 |
|
140 |
| -Image 4 (l/l):  |
| 128 | +``` |
| 129 | +export NOW=$(date +"%Y%m%d_%H%M%S") |
| 130 | +export JOB_NAME="$JOB_PREFIX$NOW" |
| 131 | +export JOB_DIR="$GS_BUCKET/$JOB_NAME" |
141 | 132 |
|
142 |
| -Image 5 (e/e):  |
| 133 | +gcloud ml-engine jobs submit training $JOB_NAME \ |
| 134 | + --job-dir $JOB_DIR \ |
| 135 | + --module-name=aocr \ |
| 136 | + --package-path=aocr \ |
| 137 | + --region=$REGION \ |
| 138 | + --scale-tier=BASIC_GPU \ |
| 139 | + --runtime-version 1.2 \ |
| 140 | + -- \ |
| 141 | + train $GS_BUCKET/$DATASET_UPLOAD_PATH \ |
| 142 | + --steps-per-checkpoint=3000 |
| 143 | +``` |
143 | 144 |
|
| 145 | +## Parameters |
144 | 146 |
|
145 |
| -# Parameters: |
| 147 | +### Global |
| 148 | + * `log-path`: Path for the log file. |
146 | 149 |
|
147 |
| -- Control |
148 |
| - * `phase`: Determine whether to train or test. |
149 |
| - * `visualize`: Valid if `phase` is set to test. Output the attention maps on the original image. |
150 |
| - * `load-model`: Load model from `model-dir` or not. |
| 150 | +### Testing |
| 151 | + * `visualize`: Output the attention maps on the original image. |
151 | 152 |
|
152 |
| -- Input and output |
153 |
| - * `data-base-dir`: The base directory of the image path in `data-path`. If the image path in `data-path` is absolute path, set it to `/`. |
154 |
| - * `data-path`: The path containing data file names and labels. Format per line: `image_path characters`. |
155 |
| - * `model-dir`: The directory for saving and loading model parameters (structure is not stored). |
156 |
| - * `log-path`: The path to put log. |
157 |
| - * `output-dir`: The path to put visualization results if `visualize` is set to True. |
158 |
| - * `steps-per-checkpoint`: Checkpointing (print perplexity, save model) per how many steps |
| 153 | +### Exporting |
| 154 | + * `format`: Format for the export (either `savedmodel` or `frozengraph`). |
159 | 155 |
|
160 |
| -- Optimization |
| 156 | +### Training |
| 157 | + * `steps-per-checkpoint`: Checkpointing (print perplexity, save model) per how many steps |
161 | 158 | * `num-epoch`: The number of whole data passes.
|
162 |
| - * `batch-size`: Batch size. Only valid if `phase` is set to train. |
163 |
| - * `initial-learning-rate`: Initial learning rate, note the we use AdaDelta, so the initial value doe not matter much. |
164 |
| - |
165 |
| -- Network |
| 159 | + * `batch-size`: Batch size. |
| 160 | + * `initial-learning-rate`: Initial learning rate, note the we use AdaDelta, so the initial value does not matter much. |
166 | 161 | * `target-embedding-size`: Embedding dimension for each target.
|
167 | 162 | * `attn-use-lstm`: Whether or not use LSTM attention decoder cell.
|
168 | 163 | * `attn-num-hidden`: Number of hidden units in attention decoder cell.
|
169 | 164 | * `attn-num-layers`: Number of layers in attention decoder cell. (Encoder number of hidden units will be `attn-num-hidden`*`attn-num-layers`).
|
170 | 165 | * `target-vocab-size`: Target vocabulary size. Default is = 26+10+3 # 0: PADDING, 1: GO, 2: EOS, >2: 0-9, a-z
|
| 166 | + * `no-resume`: Create new weights even if there are checkpoints present. |
| 167 | + * `max-gradient-norm`: Clip gradients to this norm. |
| 168 | + * `no-gradient-clipping`: Do not perform gradient clipping. |
| 169 | + * `gpu-id`: GPU to use. |
| 170 | + * `use-gru`: Use GRU cells. |
171 | 171 |
|
172 |
| - |
173 |
| -# References |
| 172 | +## References |
174 | 173 |
|
175 | 174 | [Convert a formula to its LaTex source](https://github.com/harvardnlp/im2markup)
|
176 | 175 |
|
|
0 commit comments