Enabling Tesseract For Ghostscript 9.53 and later

Starting with release 9.53, Ghostscript gained preliminary support for OCR devices, using the open-source Tesseract and Leptonica libraries.

As from Version 9.54, the Tesseract and/or Leptonica sources are contained within the Ghostscript release archive. The supplied release binaries contain the OCR devices, but no traineddata files. Version 9.53 was shipped without Tesseract and/or Leptonica in the release. If you wish to enable OCR support in 9.53, you will need to build your own version of Ghostscript with this support included.

We encourage people to use the 9.54 release rather than the 9.53 release unless they have a very good reason.

This page gives you step-by-step instructions of what to do, both to build Ghostscript with the OCR devices enabled and to actually use them.

Building on any platform

Step 0 - (Version 9.53 only) Update the Ghostscript source

The code as shipped in 9.53.3 was found to have minor problems on some systems. As we identified and fixed such problems, we kept an updated branch in git, called ghostpdl-9.53.x-ocr-fixes.

A snapshot of this can be found here.

Corresponding tesseract and leptonica archives can be found here and here respectively.

Step 1 – Ensure you have the Tesseract Source

If you are using the 9.54 release archives you already have the Tesseract source. Please skip to the next section.

If you are building 9.53 or from a git checkout of Ghostscript, then you will need to import a copy of Tesseract into your source tree.

In order to make Ghostscript work as efficiently as possible with Tesseract, we have made some modifications to Tesseract. By and large, these modifications have been passed back upstream, so Ghostscript should now work with a standard unmodified version of Tesseract. As this has not always been the case (and because it takes time to pass new changes back upstream), we suggest that people start off by using our own version of Tesseract, which is guaranteed to have all our changes in, and not to have been broken by upstream changes.

Our version of the Tesseract source is kept on the 'artifex' branch in the following git repository:

https://git.ghostscript.com/?p=thirdparty-tesseract.git;a=shortlog;h=refs/heads/artifex

The artifex branch is updated over time to track improvements. For the 9.53.3 release, you want to use the artifex-9.53.3 tag. For the 9.54 release, you want to use the artifex-9.54 tag.

For the Ghostscript 9.53 release, you can download a snapshot of this source here.

If our server is overloaded, downloads from that location will fail. Use the versions linked to above instead.

Download that, and unpack it into a directory called 'tesseract' within the ghostpdl sources.

Step 2 – Fetch the Leptonica Source

If you are using the 9.54 release archives you already have the Leptonica source. Please skip to the next section.

If you are building 9.53, or from a git checkout of Ghostscript, then you will need to import a copy of Leptonica into your source tree.

In order to make Ghostscript work as efficiently as possible with Leptonica, we have made some modifications to Leptonica. By and large, these modifications have been passed back upstream, so Ghostscript should now work with a standard unmodified version of Leptonica. As this has not always been the case (and because it takes time to pass new changes back upstream), we suggest that people start off by using our own version of Leptonica, which is guaranteed to have all our changes in, and not to have been broken by upstream changes.

Our version of the Leptonica source is kept on the 'artifex' branch in the following git repository:

https://git.ghostscript.com/?p=thirdparty-leptonica.git;a=shortlog;h=refs/heads/artifex

The artifex branch is updated over time to track improvements. For the 9.53.3 release, you want to use the artifex-9.53.3 tag. For the 9.54 release, you want to use the artifex-9.54 tag.

For the Ghostscript 9.53 release, you can download a snapshot of this source here.

If our server is overloaded, downloads from that location will fail. Use the versions linked to above instead.

Download that and unpack it into a directory called 'leptonica' within the ghostpdl sources.

Step 3 – Fetch 'traineddata'

Tesseract relies on encapsulated knowledge so it can recognise particular languages and/or scripts. This knowledge comes in the form of 'traineddata' files. In order for Tesseract to work, it must have access to the appropriate 'traineddata' file for the selected language(s).

To complicate matters further, Tesseract can be built with different engines. These engines work in different ways, and hence need different information in the 'traineddata' file. It is therefore important to match the traineddata file you have with the build of Tesseract that you are using. Currently, by default, Ghostscript uses the "LSTM" engine (aka the 'modern' engine). The alternative is the 'legacy' engine. You can switch what engine is used by using the -dOCREngine= flag when you call Ghostscript. Details can be found in the Ghostscript documentation, and we will not deal with this more here.

Traineddata files are created by training Tesseract on a range of inputs. This is an involved and painstaking process that we will not cover here.

Fortunately, various sources exist on the net for getting ready prepared traineddata files.

By default, the Ghostscript OCR devices have OCRLanguage set to 'eng', thus the system will need 'eng.traineddata' in order to be able to run.

Now, you have a choice. You can either build your traineddata file(s) into the Ghostscript executable, or you can make them available on disc. Executables supplied by us do not have any data built into them.

To build them into the executable, simply create a 'Tesseract' directory within the 'Resource' directory on disc (noting capitalisation!) and store your traineddata file(s) there.

If you would rather make them available on disc, then either you can put them into the current directory when Ghostscript is run, or you can set the environment variable 'TESSDATA_PREFIX' to point to the directory in which they live.

With the 9.53.3 release source, in order to allow Tesseract language data to be read from TESSDATA_PREFIX, you need to also tell Ghostscript to permit file reading from this location. For example:

export TESSDATA_PREFIX=/my/tesseract/data/
gs --permit-file-read=/my/tesseract/data/ -sDEVICE=...

Note the trailing '/' on the paths. With the code from ghostpdl-9.53.3-ocr-fixes and for release 9.54 (and above), this requirement has been lifted.

Step 4 – Rebuild Ghostscript

Do a full rebuild of Ghostscript.

On windows, use the 'Rebuild' option from the MSVC solution.

On unix, rerun the configure step if working from a release (or rerun autogen.sh if working from git). Then make as usual.

This should leave you with a working copy of Ghostscript that supports tesseract.

Step 5 – Run a Test

On windows, run:

bin/gswin32c.exe -sDEVICE=pdfocr8 -o out.pdf -r600 -dDownScaleFactor=3 zlib/zlib.3.pdf

On unix, run:

bin/gs -sDEVICE=pdfocr8 -o out.pdf -r600 -dDownScaleFactor=3 zlib/zlib.3.pdf

And you should hopefully get an out.pdf created with the contents of zlib/zlib.3.pdf rendered and OCRd within it.

Give us your feedback!

Please let us know how this works for you. The future of these devices will depend upon what feedback we get. Please let us know what they do well for you, what they do badly, what they don't do, but really should, etc. Feedback can be sent to support@artifex.com.