[gs-devel] Converting PDF to a single JPG
Ken Sharp
ken.sharp at artifex.com
Mon Oct 15 08:23:40 PDT 2007
At 10:12 15/10/2007 -0400, Barry wrote:
>I'm a new Ghostscript user and I seem to be getting errors no matter what
>I do with Ghostscript 8.60 for Windows 32 on XP. In the console that comes
>with Ghostscript, I can't even enter -h without getting "error: /undefined
>in -h."
Hmm, the console is a *PostScript* interpreter which allows you to type
PostScript directly and have it interpreted. gswin32 allows you to view the
result of rendering a PostScript or PDF file directly on screen (think
Acrobat Reader).
If you want help then you could try using the command line version
(gswin32c) which understands command line switches. This sounds more like
what you want to use anyway.
Use of Ghostscript is documented in the HTML files, which (assuming you
have a copy of everything) should be in /gs/doc. I would suggest 'use.htm'
to sdtart with.
Feel free to ask questions here if anything isn't clear.
>My goal is to convert a multipage pdf file, which consists of images of
>text, into a jpg file so my OCR software can work with it. If I could get
>Ghostscript to just combine all of the PDF pages into one that I can copy
>to the clipboard, that should work too because I could paste it into other
>software for the conversion to jpg.
Uh, I'm not sure exactly what you want here. If you want all the pages in a
PDF file rendered to a single output (bitmap) file, then I'm not sure this
is totally possible. In the general case a PDF file can have a huge number
of pages (I've seen 80,000 before now).
If you want a single bitmap per page, then that's entirely possible.
Starting from use.htm, select the 'one page per file' link. This shows how
to set up automatic numbering. I won't go into detail here since its in the
docs.
Now, I'm going to assume you don't really want a JPEG file if you intend to
run this through OCR, the potential for introduced artefacts in JPEG (which
is a lossy format) will degrade the performance of your OCR package. This
will be especially true with (for example) script fonts.
I would suggest using TIFF instead, which is a lossless format.
A possible command line would then be:
gswin32c -sOutputFile=OCR_%03d.tif -sDEVICE=tiff32nc -dBATCH -dNOPAUSE
myfile.pdf
This will produce files called OCR_000.tif, OCR_001.tif etc, the output
file format will be TIFF, CMYK, 8 bits per channel. (myfile.pdf is,
obviously, your PDF file)
You may decide you would prefer to ignore colour, in which case
-sDEVICE=tiffgray will produce gray scale output. You can find a list of
devices in /gs/doc/devices.htm, in case you would prefer to use a different
file format.
NB you may well want to change the resolution as well....
Ken
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
"The force can have a strong influence on the weak minded"
Obi-wan Kenobi, Star Wars
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
More information about the gs-devel
mailing list