Training your tesseract on windows 10

3 min readMar 6, 2020

Tesseract, an open-source OCR (Optical Character Recognition) engine developed by Google Labs and maintained by Google. Compared with Microsoft Office Document Imaging (MODI), we can continuously train a library to convert images into text. The ability is constantly enhanced; if the team needs depth, it can also use it as a template to develop an OCR engine that meets its own needs.

Installation

By following the step after double click the installed package. After successful installation, there will be a Tesseract-OCR folder under the corresponding disk. Then add this path into the environment variable path.

Open the command line, enter tesseract, and press Enter to check its current state.

How to use

First prepare an image file, such as test.png.

Switch the command line to the target image file directory, then enter in the command line.

tesseract test.png output_1 –l eng
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile…]

imagename is the target image file name, which needs to be format suffix; outputbase is the conversion result file name; lang is the language name (you can see the language file eng.traineddata beginning with eng in the tessdata folder in Tesseract-OCR), if not marked- l eng defaults to eng.

Training

Tesseract is still very strong! But it’s still not accurate enough, so is there any way we can improve the accuracy of tesseract recognition characters? Next, we will use the companion training tool jTessBoxEditor to train samples to improve our accuracy!

Merge sample file
Open jTessBoxEditor, Tools-> Merge TIFF, select all the sample files, and save the merged file as num.font.exp0.tif

lang is the language, fontname is the font, and num is the custom number.

2. Generate BOX file
Open a command line and change to the directory where num.font.exp0.tif is located, enter, and generate a file named num.font.exp0.box

tesseract --psm 7 num.font.exp0.tif num.font.exp0 batch.nochop makebox

Grammar:

tesseract  [lang].[fontname].exp[num].tif  [lang].[fontname].exp[num] batch.nochop makebox

lang is the language name, fontname is the font name, and num is the serial number; in tesseract, you must pay attention to the format.

3. Character correction
Open jTessBoxEditor, BOX Editor-> Open, open num.font.exp0.tif, then edit all the bounding boxes if needed.

4. Define character profile
Generate a text file named font_properties in the target folder with the content.

font 0 0 0 0 0

You can also create new file in the and line by enter:

echo test 0 0 0 0 0 >font_properties

The font_properties file must be formatted as utf-8 to run the following steps.

This command can convert .txt file to utf8 format in PowerShell.

Get-Content .\test.txt | Set-Content -Encoding utf8 test-utf8.txt

5. The final step.

Run this bash file in your path. this will generate a trainedata file.

echo Run Tesseract for Training.. 
tesseract.exe num.font.exp0.tif num.font.exp0 nobatch box.train 
 
echo Compute the Character Set.. 
unicharset_extractor.exe num.font.exp0.box 
shapeclustering -F font_properties -U unicharset -O num.unicharset num.font.exp0.tr
mftraining -F font_properties -U unicharset -O num.unicharset num.font.exp0.trecho Clustering.. 
cntraining.exe num.font.exp0.trecho Rename Files.. 
rename normproto num.normproto 
rename inttemp num.inttemp 
rename pffmtable num.pffmtable 
rename shapetable num.shapetableecho Create Tessdata.. 
combine_tessdata.exe num.echo. & pause

Training your tesseract on windows 10

Installation

How to use

Training

Written by Gaopeng Bai

Responses (3)