tesseract-ocr use ascii only?_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-26 23:26 出处：网络

I have been using tesseract-ocr (in .NET) which has been working well. The images i feed it are ascii only (A-z0-9). Is there a way i can tell it not to use special characte开发者_如何学编程rs?There\'

相关专题：.net tesseract

There's a new thread about this question over at the Google forum linked above. The first answer concludes that it probably isn't possible.

As far as I know, this is correct, if you're using the language data files that are packaged with Tesseract. You can, however, very easily limit the output characters if you're training on your own box files. It's practically automatic: if unicharset_extractor doesn't find any non-ASCII characters in the box files, you'll never see non-ASCII characters in the output.

I was similarly frustrated by all the interpuncts and other unusual characters in my output when I first started using Tesseract, and training on my own box files solved the problem. You can even use the Tesseract training data as a starting point.

use the tessedit_char_whitelist config option.