开发者

tesseract-ocr use ascii only?

开发者 https://www.devze.com 2022-12-26 23:26 出处:网络
I have been using tesseract-ocr (in .NET) which has been working well. The images i feed it are ascii only (A-z0-9). Is there a way i can tell it not to use special characte开发者_如何学编程rs?There\'

I have been using tesseract-ocr (in .NET) which has been working well. The images i feed it are ascii only (A-z0-9). Is there a way i can tell it not to use special characte开发者_如何学编程rs?


There's a new thread about this question over at the Google forum linked above. The first answer concludes that it probably isn't possible.

As far as I know, this is correct, if you're using the language data files that are packaged with Tesseract. You can, however, very easily limit the output characters if you're training on your own box files. It's practically automatic: if unicharset_extractor doesn't find any non-ASCII characters in the box files, you'll never see non-ASCII characters in the output.

I was similarly frustrated by all the interpuncts and other unusual characters in my output when I first started using Tesseract, and training on my own box files solved the problem. You can even use the Tesseract training data as a starting point.


use the tessedit_char_whitelist config option.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号