[gs-devel] Extracting PDF metadata and exploding pages
SaGS
sags5495 at hotmail.com
Wed May 2 18:58:37 UTC 2012
----- Original Message -----
From: "Scott Gifford" <sgifford at suspectclass.com>
To: <gs-devel at ghostscript.com>
Sent: Tuesday, 1 May 2012 06:43
Subject: [gs-devel] Extracting PDF metadata and exploding pages
> ...
> First, it uses poppler's pdfinfo to extract metadata from the PDF, like
> this:
>
> Title: t10_4C
> Creator: Adobe Illustrator CS4
> Producer: Adobe PDF library 9.00
> CreationDate: Fri Dec 16 18:26:22 2011
> ModDate: Fri Dec 16 18:26:22 2011
> Tagged: no
> Pages: 1
> Encrypted: no
> Page size: 270 x 162 pts
> File size: 955508 bytes
> Optimized: yes
> PDF version: 1.4
Try Ghostscript's toolbin\pdf_info.ps. May even be more suitable, depending
on what exact metatdata you need. For example 'Page size' above is vague,
different pages may have different sizes and also there are different
'boxes' for each page (Mediabox, Cropbox, and others). If some info you need
is not already provided, you can modify pdf_info.ps with only a little
PostScript programming.
Another tool to try is pdftk, see its dump_data command.
>
>
> Next, it splits a multi-page PDF into many single-page PDFs, with "pdftk
> burst".
>
> After that it uses ghostscript to generate PNG thumbnails of each page.
>From your description it doesn't seem you *need* those one-page PDFs.
Convert the original PDF to one-PNG-per-page in one go by using %d in
Ghostscript's output filename. The %d gets replaced with the page number. If
you prefer fixed width 0-padded numbers use something like %04d (yes, it's
just C printf() formatting).
>
> The user then re-orders the pages in a Web UI using the thumbnails.
OK (that's your app).
> Finally, it puts them back together in a different order with ghostscript.
pdftk is more suitable for this task, out-of-the-box, see its cat command.
For example 'pdftk IN.PDF cat 3 1 10-5 2 4 11-end output OUT.PDF' shuffles
the first 10 pages and leaves the rest untouched.
The page reordering can be done using Ghostscript alone, without fully
interpreting the input file and generating a brand new output PDF, but this
requires [a lot?] more PostScript programming and knowledge about PDF
internals. Start with toolbin\pdfinflt.ps. This tool loads the input PDF
without interpreting it (= without translating it to a series of drawing
operations) then writes it out with the streams uncompressed. You can do
some surgery on the PDF Page tree between loading the input and writing the
output (and there's no need to suppress compressing the streams). A much
more complex example is lib\pdfopt.ps, this one loads a PDF and writes it
out linearised ('Web-optimised').
> ...
> I would really like to be able to load the PDF file into ghostscript one
> time, extract the data I need, then convert the pages one at a time to
> individual PDF files then to PNGs. Is it possible to drive ghostscript
> like this, having it do multiple operations on each page?
Not multiple operations on each page, but I think it's possible to get the
metadata and the one-PNG-per-page in just one execution of Ghostscript.
Haven't tried it though; I think I can imagine ways for doing this, but it's
not tame at all. In any case I don't think it's worth the trouble, your only
gain is that you start Ghostscript only once instead of twice. Most of the
time is spent interpreting the PDF and generating the PNGs.
> ...
More information about the gs-devel
mailing list