OCR & OCR Engines
OCR Engines used in solutions we provide:
OpenText RecoStar OCR Engine
Nuance OCR Engine
Google Tesseract OCR Engine
OCR Stands for Optical Character Recognition. OCR application is able to recognize and extract text information out of scanned document, such as PDF, TIFF, or other document image files. A PDF Converter with OCR ability can converts scanned PDF document into editable text.
Optical Character Recognition (OCR) is a process by which text characters can be input to a computer by providing the computer with an image. The computer uses an OCR Engine -- a computer program with the specific function of making a guess which letter (recognizable to a computer) an image (recognizable to a human) represents.
Paperless includes an OCR Engine, which it uses to recognize text and numerical values. In order to understand how the OCR Engine in Paperless produces OCR results, it is useful also to understand how OCR Engines make these guesses.
How OCR Engines Work
An OCR engine scans an image for elements that resemble letters it is programmed to recognize. OCR engines use sets of parameters to discern one character from another. For example:
The letters E and F look a lot alike; the most-noticeable differences between the two characters is the horizontal bar at the bottom of the letter E.
The letters P and D look a lot alike. There are two primary differences between the letters: the letter P has a vertical line that extends beyond the loop shape at the top of the letter; the letter D does not.
The letters e and a look a lot alike.
The letters q and the number 9 look a lot alike in certain fonts.
The number 2 and the letter Z look a lot alike.
The semicolon (;) and the colon (:) symbols are nearly-identical.
The period (.) and the comma (,) symbols are nearly-identical.
At the most-basic level, OCR locates points in a text that resemble characters it is trained to recognize; once it finds what it believes is a match, it returns a letter (recognizable to the computer) to OCR results.
OCR engines are also programmed to recognize certain fonts. Thus, if the OCR engine is programmed to recognize Bit stream Vera Sans, it will recognize text characters rendered in Bit stream Vera Sans. If the OCR engine has been instructed to recognize Bit stream Vera Sans, but it has not been configured to recognize Linux Libertine, it may not recognize text in Linux Libertine with the same (or any) degree of accuracy compared to what it is able to recognize Bit stream Vera Sans with.
The OCR Process
Here is a very-basic overview of how an OCR engine processes an image to return text contained in it:
An image of the document is acquired by the computer.
The image is submitted as input to an OCR engine.
The OCR engine matches portions of the image to shapes it is instructed to recognize.
Given logic parameters that the OCR engine has been instructed to use, the OCR engine will make its best guess as to which letter a shape represents.
OCR results are returned as text.
