-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathocr.html
More file actions
307 lines (276 loc) · 15.5 KB
/
ocr.html
File metadata and controls
307 lines (276 loc) · 15.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=edge;" />
<meta http-equiv="Cache-Control" content="no-store, must-revalidate" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Optical Character Recognition (OCR) with Grooper</title>
<link rel="stylesheet" href="styles/font-awesome/web-fonts-with-css/css/fontawesome-all.min.css">
<link rel="stylesheet" href="styles/main.css">
<link href="https://fonts.googleapis.com/css?family=Roboto:100,100i,300,300i,400,400i,500,500i,700,700i,900,900i" rel="stylesheet">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
<script id="mcjs">!function(c,h,i,m,p){m=c.createElement(h),p=c.getElementsByTagName(h)[0],m.async=1,m.src=i,p.parentNode.insertBefore(m,p)}(document,"script","https://chimpstatic.com/mcjs-connected/js/users/2e3eeca20c1420746bffa37d4/96f5c58e37a4abf4a30ce18a4.js");</script>
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-101038961-2"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-101038961-2');
</script>
</head>
<body>
<nav id="nav_primary">
<a href="./index.html"></a>
<input id="checkbox_click" type="checkbox" onclick="noScroll()"/>
<i class="fa fa-bars"></i>
<i class="fa fa-times"></i>
<ul>
<li><a href="./features.html">Features</a></li>
<li><a href="./roadmap.html">Roadmap</a></li>
<!--
<li><a href="./pricing.html">Pricing</a></li>
<li><a href="./training.html">Training</a></li>
-->
<li><a href="./support.html">Support</a></li>
<li><a href="./contact.html">Contact</a></li>
<li><a href="http://xchange.grooper.com">Grooper x Change</a></li>
</ul>
<div class="social">
<a href="https://www.facebook.com/BusinessImagingSystems"><i class="fab fa-facebook"></i></a>
<a href="https://plus.google.com/u/0/b/115220706750340433625/+Bisok/posts"><i class="fab fa-google-plus"></i></a>
<a href="https://twitter.com/BIS_Tweets"><i class="fab fa-twitter"></i></a>
<a href="https://www.linkedin.com/company/business-imaging-systems?trk=prof-exp-company-name"><i class="fab fa-linkedin"></i></a>
<a href="https://www.youtube.com/channel/UCiJPKqS_enHrFsX49cngqag"><i class="fab fa-youtube"></i></a>
</div>
</nav>
<header>
<canvas id="header_canvas"></canvas>
<h1>OCR and Electronic Text Collection</h1>
<p>Who said you can't teach an old dog new tricks? We've taken dated OCR technology and brought it into modern times. Grooper's patented <strong>Synthetic OCR</strong> generates the most accurate text from images and electronic files, regardless of which OCR engine you use. </p>
</header>
<div id="nav_secondary">
<nav id="nav_secondary_container">
<div id="nav_secondary_contents" class="nav_secondary_contents">
<a href="paper-capture.html">Modern Paper Capture</a>
<a href="electronic-documents.html">Electronic Document Processing</a>
<a href="image-optimization.html">Image Optimization</a>
<a href="ocr.html" aria-selected="true">Synthetic OCR</a>
<!--<a href="atomic-regex.html">Atomic RegEx</a>-->
<a href="classification.html">Document Classification</a>
<a href="natural-language-processing.html">Natural Language Processing (NLP)</a>
<a href="design-studio.html">Design Studio</a>
<a href="modern-architecture.html">Modern Architecture</a>
<span id="nav_secondary_indicator"></span>
</div>
</nav>
<button id="nav_secondary_left" type="button">
<svg class="nav_advancer_icon" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 551 1024"><path d="M445.44 38.183L-2.53 512l447.97 473.817 85.857-81.173-409.6-433.23v81.172l409.6-433.23L445.44 38.18z"/></svg>
</button>
<button id="nav_secondary_right" type="button">
<svg class="nav_advancer_icon" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 551 1024"><path d="M105.56 985.817L553.53 512 105.56 38.183l-85.857 81.173 409.6 433.23v-81.172l-409.6 433.23 85.856 81.174z"/></svg>
</button>
</div>
<main>
<section class="col-1 side_by_side figure-medium">
<h2>It all starts with image quality</h2>
<p>Before any OCR action takes place, you'll want to make sure you're handing the OCR activity an image that is straight and free of artifacts. The key is to remove everything from the page that isn't text. Grooper lets you process images through a growing arsenal of exclusive tools and out-of-the-box profiles specifically designed for this task. The best part is these tools won't alter the original version of the image you want to permanently retain.</p>
<figure>
<img class="img-med" src="images/features/image-processing/grooper-image-optimization-line-removal.jpg" />
<figcaption class="text_left">
<h3>Examples</h3>
<ul>
<li>Remove lines</li>
<li>Ensure edges are clean</li>
<li>Remove small specks</li>
<li>Remove large non-text objects</li>
<li>Invert white-on-black zones</li>
<li>Remove hole punches</li>
</ul>
</figcaption>
</figure>
</section>
<section class="col-1 side_by_side figure-medium">
<h5>If at first you don't succeed...</h5>
<h2>Use Synthetic OCR</h2>
<p>No matter how clean and pristine your images may appear, outdated OCR engines still have a difficult time collecting accurate text from images with multiple columns, different font sizes, and image shear. Grooper's patented OCR synthesis engine intelligently performs multiple passess of OCR on different portions of the image and <em>Groops</em> the results together as a single unit, keeping only the most accurate text results.</p>
<figure>
<figcaption>
<i class="fal fa-repeat" aria-hidden="true"></i>
<h3>Iterative OCR</h3>
<p>Iterative OCR is a technique we've developed as a way to capture text that the OCR engines simply miss the first time around. The idea is that we run a pass of OCR on the entire document, drop out any portions of the page where we were able to obtain text, then run another OCR iteration on the new image. With the new image having far less distractions, OCR is able to more clearly find text it missed during the previous passes.</p>
</figcaption>
<img class="img-med" src="images/features/ocr/grooper-ocr-iterative-processing.jpg" />
</figure>
<figure>
<figcaption>
<i class="fal fa-th" aria-hidden="true"></i>
<h3>Cellular Validation</h3>
<p>Multi-column layouts present a unique challenge for OCR. Text on each side of a document may have different font sizes or the lines of text may be slightly offset from each other. A standard OCR process will have a complete breakdown of accuracy in one of the two sides. Grooper's Cellular Validation OCR splits the image into a grid of multiple areas and OCRs them independently. The result, industry-leading accuracy when it comes to OCRing your documents.</p>
</figcaption>
<img class="img-med" src="images/features/ocr/grooper-ocr-cellular-validation.jpg" />
</figure>
<figure>
<figcaption>
<i class="fal fa-align-center" aria-hidden="true"></i>
<h3>Segment Reprocessing</h3>
<p>The final synthesis task is to perform segment analysis. A "segment" is a small block or line of text on a page. If any segment gets a low OCR confidence score, Grooper independently re-runs OCR on that segment to obtain optimum quality.</p>
</figcaption>
<img class="img-med" src="images/features/ocr/grooper-ocr-segment-reprocessing.jpg" />
</figure>
</section>
<section id="spell_correction" class="col-1 side_by_side figure-large">
<h5>And when all else fails</h5>
<h2>We've got spell-correction</h2>
<p>Powered by our <strong>Atomic RegEx</strong> engine, Grooper can perform OCR correction to fix some pretty ugly stuff.</p>
<figure>
<img class="img-lrg" src="images/features/ocr/grooper-ocr-spell-correction.jpg" />
<figcaption class="text_left">
<h3>Examples</h3>
<ul>
<li>Correct simple OCR mistakes in strings that don't match words in a language of your choice</li>
<li>Fix existing, human-generated typos on documents</li>
<li>Re-insert spaces where OCR falsely jammed multiple words together</li>
<li>Delete strings of non alpha-numeric characters that resemble sombody's attempt at censorship, like "$#@! ^&*"</li>
<li>Repair numeric values where overly-aggresive image cleanup has inadvertently removed commas and periods</li>
</ul>
</figcaption>
</figure>
</section>
<section class="col-3 align_top">
<figure>
<i class="fal fa-tachometer" aria-hidden="true"></i>
<figcaption>
<h3>Performance Balancing</h3>
<p>Grooper's new “Run Speed” option gives you control to achieve an ideal balanace between accuracy and performance.</p>
</figcaption>
</figure>
<figure>
<i class="fal fa-globe" aria-hidden="true"></i>
<figcaption>
<h3>Language Support</h3>
<p>Grooper supports 35 languages, which can be individually enabled or disabled. Performs automatic language detection.</p>
</figcaption>
</figure>
<figure>
<i class="fal fa-file-pdf" aria-hidden="true"></i>
<figcaption>
<h3>Electronic Text</h3>
<p>Grooper avoids OCR altogether when dealing with original text-based files like Word, Excel, and Text PDFs. Instead, Grooper pulls complete and perfect text directly out of the file.</p>
</figcaption>
</figure>
</section>
<section class="col-1 side_by_side figure-medium">
<h5>Our secret blend of</h5>
<h2>PDF Text Extraction</h2>
<p>PDF has become the most widely used document standard in the world. With that adoption comes a variety of challenges you'll have to face in order to get the best text from every page. Some PDFs are purely text-based, others just images re-packaged into a PDF format, and yet others have combinations of the two scattered throughout pages.</p>
<figure>
<img class="img-med" src="images/features/ocr/grooper-ocr-pdf-text-extraction.jpg" />
<figcaption>
<h3>Our hybrid approach</h3>
<ul>
<li>Grooper examines each page within a PDF to place the page into one of three categories: image-based, text-based, or mixed-content. Then each page is handled accordingly.</li>
<li>If a PDF page contains a single image which covers the entire page, it is considered an image-based page, and is processed using OCR.</li>
<li>If a PDF contains no images, we extract only the raw text-behind the page</li>
<li>For mixed-content pages, each image on the page is extracted to a temporary image. Each temporary image is processed through OCR. Then the OCR results are merged with the native text.</li>
</ul>
</figcaption>
</figure>
</section>
</main>
<div id="prev_next">
<div>
<a href="./image-optimization.html">
<i class="fa fa-arrow-left" aria-hidden="true"></i>
<span class="direction">Previous</span>
<span>Image Optimization</span>
</a>
</div>
<div>
<!--<a href="./atomic-regex.html">-->
<a href="./classification.html">
<i class="fa fa-arrow-right" aria-hidden="true"></i>
<span class="direction">Next</span>
<span>Classification</span>
</a>
</div>
</div>
<footer>
<section id="schedule_demo" class="col-1 bg-dark gray">
<a class="btn-lrg" href="https://goo.gl/forms/kwEe075qCFCWIl2X2" target="_blank">Schedule Demo</a>
<h2></h2>
<p>Meet with one of our Grooper Gurus to ask questions you have about Grooper's functionality and pricing model. Then, see a live presentation that demonstrates Grooper's uniqueness, the end-user experience, and how the magic works behind the scenes. No "smoke and mirrors" here.</p>
</section>
<div id="footer_links">
<div id="contact">
<h4>Contact</h4>
<ul>
<li><a href="mailto:info@grooper.com">info@grooper.com</a></li>
<li><a href="tel:1-800-408-5668">Support: 1-800-408-5668</a></li>
<li><a href="tel:1-405-507-7000">Sales: 1-405-507-7000</a></li>
<li>13900 N. Harvey Avenue</li>
<li>Edmond, OK 73013</li>
</ul>
</div>
<div id="discover">
<h4>Discover</h4>
<ul>
<li><a href="./features.html">Features</a></li>
<li><a href="./roadmap.html">Roadmap</a></li>
<!--
<li><a href="./pricing.html">Plans & Pricing</a></li>
<li><a href="./training.html">Training Program</a></li>
-->
<li><a href="http://xchange.grooper.com">Grooper x Change</a></li>
<li><a href="./support.html">Get Support</a></li>
<li><a href="./contact.html">Contact Us</a></li>
</ul>
</div>
<div id="signup">
<h4>Newsletter</h4>
<form action="https://bisok.us2.list-manage.com/subscribe/post?u=2e3eeca20c1420746bffa37d4&id=98680ff514" method="post" id="mc-embedded-subscribe-form" name="mc-embedded-subscribe-form" class="validate" target="_blank" novalidate>
<div id="mc_embed_signup_scroll">
<div class="mc-field-group">
<!--<label for="mce-EMAIL">Email Address </label>-->
<input type="email" value="" name="EMAIL" class="required email" id="mce-EMAIL" placeholder="Email Address">
</div>
<div class="mc-field-group">
<!--<label for="mce-FNAME">First Name </label>-->
<input type="text" value="" name="FNAME" class="" id="mce-FNAME" placeholder="First Name">
</div>
<div class="mc-field-group">
<!--<label for="mce-LNAME">Last Name </label>-->
<input type="text" value="" name="LNAME" class="" id="mce-LNAME" placeholder="Last Name">
</div>
<div class="mc-field-group">
<!--<label for="mce-COMPANY">Company </label>-->
<input type="text" value="" name="COMPANY" class="" id="mce-COMPANY" placeholder="Company">
</div>
<div id="mce-responses" class="clear">
<div class="response" id="mce-error-response" style="display:none"></div>
<div class="response" id="mce-success-response" style="display:none"></div>
</div> <!-- real people should not fill this in and expect good things - do not remove this or risk form bot signups-->
<div style="position: absolute; left: -5000px;" aria-hidden="true"><input type="text" name="b_2e3eeca20c1420746bffa37d4_98680ff514" tabindex="-1" value=""></div>
<div class="clear content__button"><input type="submit" value="Subscribe" name="subscribe" id="mc-embedded-subscribe" class="button"></div>
</div>
</form>
</div>
</div>
<div class="social">
<a href="https://www.facebook.com/BusinessImagingSystems"><i class="fab fa-facebook"></i></a>
<a href="https://plus.google.com/u/0/b/115220706750340433625/+Bisok/posts"><i class="fab fa-google-plus"></i></a>
<a href="https://twitter.com/BIS_Tweets"><i class="fab fa-twitter"></i></a>
<a href="https://www.linkedin.com/company/business-imaging-systems?trk=prof-exp-company-name"><i class="fab fa-linkedin"></i></a>
<a href="https://www.youtube.com/channel/UCiJPKqS_enHrFsX49cngqag"><i class="fab fa-youtube"></i></a>
</div>
<div id="copyright">
<span>©2018 Grooper, LLC. All Rights Reserved.</span>
<img src="images/g_symbol_64.png" />
</div>
</footer>
<script src="js/animate-header.js"></script>
<script src="js/nav-primary.js"></script>
<script src="js/nav-secondary.js"></script>
<script src="js/slideshow.js"></script>
</body>
</html>