The first official alpha version of Google’s OCRopus scanning software for Linux was released yesterday. OCRopus is built on top of HP’s venerable open-source Tesseract optical character recognition (OCR) engine and is distributed under the Apache License 2.0.
OCRopus uses Tesseract for character recognition but has its own layout analysis system that is optimized for accuracy. The OpenFST library is used for language modeling, but it still has some performance issues. OCRopus is designed to be modular, so that other character recognition and language modeling components can be used to eventually add support for non-Latin languages. An embedded Lua interpreter is used for scripting and configuration. The developers chose Lua rather than Python because Lua is slimmer and easier to embed. This release also includes some new image cleanup and de-skewing code.