[gs-devel] Converting PDF to a single JPG

Ken Sharp ken.sharp at artifex.com
Mon Oct 15 08:23:40 PDT 2007


At 10:12 15/10/2007 -0400, Barry wrote:
>I'm a new Ghostscript user and I seem to be getting errors no matter what
>I do with Ghostscript 8.60 for Windows 32 on XP. In the console that comes
>with Ghostscript, I can't even enter -h without getting "error: /undefined
>in -h."

Hmm, the console is a *PostScript* interpreter which allows you to type 
PostScript directly and have it interpreted. gswin32 allows you to view the 
result of rendering a PostScript or PDF file directly on screen (think 
Acrobat Reader).

If you want help then you could try using the command line version 
(gswin32c) which understands command line switches. This sounds more like 
what you want to use anyway.

Use of Ghostscript is documented in the HTML files, which (assuming you 
have a copy of everything) should be in /gs/doc. I would suggest 'use.htm' 
to sdtart with.

Feel free to ask questions here if anything isn't clear.


>My goal is to convert a multipage pdf file, which consists of images of
>text, into a jpg file so my OCR software can work with it. If I could get
>Ghostscript to just combine all of the PDF pages into one that I can copy
>to the clipboard, that should work too because I could paste it into other
>software for the conversion to jpg.

Uh, I'm not sure exactly what you want here. If you want all the pages in a 
PDF file rendered to a single output (bitmap) file, then I'm not sure this 
is totally possible. In the general case a PDF file can have a huge number 
of pages (I've seen 80,000 before now).

If you want a single bitmap per page, then that's entirely possible. 
Starting from use.htm, select the 'one page per file' link. This shows how 
to set up automatic numbering. I won't go into detail here since its in the 
docs.


Now, I'm going to assume you don't really want a JPEG file if you intend to 
run this through OCR, the potential for introduced artefacts in JPEG (which 
is a lossy format) will degrade the performance of your OCR package. This 
will be especially true with (for example) script fonts.

I would suggest using TIFF instead, which is a lossless format.

A possible command line would then be:

gswin32c -sOutputFile=OCR_%03d.tif -sDEVICE=tiff32nc -dBATCH -dNOPAUSE 
myfile.pdf

This will produce files called OCR_000.tif, OCR_001.tif etc, the output 
file format will be TIFF, CMYK, 8 bits per channel. (myfile.pdf is, 
obviously, your PDF file)

You may decide you would prefer to ignore colour, in which case 
-sDEVICE=tiffgray will produce gray scale output. You can find a list of 
devices in /gs/doc/devices.htm, in case you would prefer to use a different 
file format.

NB you may well want to change the resolution as well....


                             Ken

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
"The force can have a strong influence on the weak minded"
Obi-wan Kenobi, Star Wars
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -




More information about the gs-devel mailing list