houqp/leptess

Productive and safe Rust bindings/wrappers for Tesseract and Leptonica

Make sure you have clang, Leptonica and Tesseract installed

Leptess

.

Build dependencies

.

Tesseract should be version 4.0.0 or above.

Ubuntu

sudo apt-get install libleptonica-dev libtesseract-dev clang

You will also need to install tesseract language data based on your OCR needs:

sudo apt-get install tesseract-ocr-eng

Mac

brew install tesseract leptonica

Windows

On Windows, this library uses Microsoft's vcpkg to provide tesseract.

Please install vcpkg and set up user wide integration or vcpkg crate won't be able to find the library.

To install tesseract:

REM from the vcpkg directory

REM 32 bit
.\vcpkg install tesseract:x86-windows

REM 64 bit
.\vcpkg install tesseract:x64-windows

To run the tests configure vcpkg-crate to find the tesseract library:

SET VCPKGRS_DYNAMIC=true
cargo test

Usage

let mut lt = leptess::LepTess::new(None, "eng").unwrap();
lt.set_image("path/to/page.bmp");
println!("{}", lt.get_utf8_text().unwrap());

For more examples, see docs and examples directory.

To run demos in examples directory, try:

cargo run --example low_level_ocr_full_page

Development

To run tests, you will need at Tesseract 4.x to match what we have in tests/tessdata/eng.traineddata. See CircleCI config to see how to replicate the setup.

Issues

Collection of the latest Issues

darklajid

darklajid

Comment Icon4

Hey.

Most OCR work I've seen so far uses (b/w, CCITT compressed) multi-page documents. I'd like to make these work with leptess, but it seems (unless I'm missing something?) that there's only support for Pix (not: PixA), nor a mapping for direct TIFF I/O (say pixaReadMultipageTiff from Leptonica). The high level wrapper (leptess:LepTess) also doesn't expose a method to directly set_image a Pix, but that would be the most trivial thing to change.

In other words: I was hoping for a Rust (leptess) workflow that allows

  • reading a multi-page TIFF as PixA
  • iterating over each page -> Pix and collecting the recognition results

Is that something you'd be willing to support? Am I missing a way how this would work today already? I could offer to look into this, but I admit that I'm a Rust beginner at this point in time.

tcastelly

tcastelly

Comment Icon5

Hello,

Thank you for this work!

I have a curious behavior, when I try to retrieve the text from the image bellow in command line:

I have as result, `

But when I use the wrapper

I have:

I've tried to use the traineddata from this repository. Or nothing. But same result.

Maybe the command line use default parameters.

Thanks in advance

image

DartMen

DartMen

Comment Icon4

A way to feed a list of words to supplement Tesseract would be very nice.

In a .NET implementation this seems possible by setting: Tess.setVariable("user_words_suffix", "user-words");

Also consider adding a test method to verify user supplied words are loaded and actually used by Tesseract.

kangalioo

kangalioo

Comment Icon2

In LepTess::set_image, bool is used as a return value to indicate success or failure. Booleans are not a good choice for a return value for this use case. Better is Option<()> or Result<(), ()> (I personally think Option<()> is nicer, but Result<(), ()> objectively makes more sense)

As a side-effect of this, the code inside the functions can become much more concise and idiomatic:

https://github.com/kangalioo/leptess/commit/786a7519f904e7a6eafa9be18f8a5f201bff9120

Information - Updated Sep 14, 2022

Stars: 158
Forks: 20
Issues: 7

Repositories & Extras

Serde is a framework for serializing and deserializing Rust data structures efficiently and generically

Rust Greatest JSON weapon is Serde with over 4.4K stars on github and a massive developer community. This is considered a core Rust library for every developer to learn in BRC's opinion

Serde is a framework for serializing and deserializing Rust data structures efficiently and generically

Roctogen: a rust client library for the GitHub v3 API

This client API is generated from the Isahc HTTP client

Roctogen: a rust client library for the GitHub v3 API

Library for safe and correct Unix signal handling in Rust

Unix signals are inherently hard to handle correctly, for several reasons:

Library for safe and correct Unix signal handling in Rust

Library for traversing &amp; reading GameCube and Wii disc images

Based on the C++ library MIT license (LICENSE-MIT or

Library for traversing &amp; reading GameCube and Wii disc images

cargo_auto_github_lib

Library for cargo-auto automation tasks written in rust language with functions for github

cargo_auto_github_lib

Library that implements low-level protocol to the Hitachi HD44780-compatible LCD device

by default (only uses 4 data pins) plus two control pins (R/S and EN)

Library that implements low-level protocol to the Hitachi HD44780-compatible LCD device

Library that implements low-level protocol to the Hitachi HD44780-compatible LCD device

by default (only uses 4 data pins) plus two control pins (R/S and EN)

Library that implements low-level protocol to the Hitachi HD44780-compatible LCD device

Library crate for common tasks when building rust projects

Intended for use with cargo-auto

Library crate for common tasks when building rust projects

A Rust library to connect to AxonServer

See the GitHub project Command / Query Responsibility Segregation (CQRS)

A Rust library to connect to AxonServer

Unofficial command line tool and library for using catbox

Unofficial command line tool and library for using Github Pages

Unofficial command line tool and library for using catbox
Facebook Instagram Twitter GitHub Dribbble
Privacy