[gs-devel] Extracting PDF metadata and exploding pages

SaGS sags5495 at hotmail.com
Wed May 2 18:58:37 UTC 2012


----- Original Message ----- 
From: "Scott Gifford" <sgifford at suspectclass.com>
To: <gs-devel at ghostscript.com>
Sent: Tuesday, 1 May 2012 06:43
Subject: [gs-devel] Extracting PDF metadata and exploding pages


> ...
> First, it uses poppler's pdfinfo to extract metadata from the PDF, like
> this:
>
> Title:          t10_4C
> Creator:        Adobe Illustrator CS4
> Producer:       Adobe PDF library 9.00
> CreationDate:   Fri Dec 16 18:26:22 2011
> ModDate:        Fri Dec 16 18:26:22 2011
> Tagged:         no
> Pages:          1
> Encrypted:      no
> Page size:      270 x 162 pts
> File size:      955508 bytes
> Optimized:      yes
> PDF version:    1.4

Try Ghostscript's toolbin\pdf_info.ps. May even be more suitable, depending 
on what exact metatdata you need. For example 'Page size' above is vague, 
different pages may have different sizes and also there are different 
'boxes' for each page (Mediabox, Cropbox, and others). If some info you need 
is not already provided, you can modify pdf_info.ps with only a little 
PostScript programming.

Another tool to try is pdftk, see its dump_data command.

>
>
> Next, it splits a multi-page PDF into many single-page PDFs, with "pdftk
> burst".
>
> After that it uses ghostscript to generate PNG thumbnails of each page.

>From your description it doesn't seem you *need* those one-page PDFs. 
Convert the original PDF to one-PNG-per-page in one go by using %d in 
Ghostscript's output filename. The %d gets replaced with the page number. If 
you prefer fixed width 0-padded numbers use something like %04d (yes, it's 
just C printf() formatting).

>
> The user then re-orders the pages in a Web UI using the thumbnails.

OK (that's your app).

> Finally, it puts them back together in a different order with ghostscript.

pdftk is more suitable for this task, out-of-the-box, see its cat command. 
For example 'pdftk IN.PDF cat 3 1 10-5 2 4 11-end output OUT.PDF' shuffles 
the first 10 pages and leaves the rest untouched.

The page reordering can be done using Ghostscript alone, without fully 
interpreting the input file and generating a brand new output PDF, but this 
requires [a lot?] more PostScript programming and knowledge about PDF 
internals. Start with toolbin\pdfinflt.ps. This tool loads the input PDF 
without interpreting it (= without translating it to a series of drawing 
operations) then writes it out with the streams uncompressed. You can do 
some surgery on the PDF Page tree between loading the input and writing the 
output (and there's no need to suppress compressing the streams). A much 
more complex example is lib\pdfopt.ps, this one loads a PDF and writes it 
out linearised ('Web-optimised').

> ...
> I would really like to be able to load the PDF file into ghostscript one
> time, extract the data I need, then convert the pages one at a time to
> individual PDF files then to PNGs.  Is it possible to drive ghostscript
> like this, having it do multiple operations on each page?

Not multiple operations on each page, but I think it's possible to get the 
metadata and the one-PNG-per-page in just one execution of Ghostscript. 
Haven't tried it though; I think I can imagine ways for doing this, but it's 
not tame at all. In any case I don't think it's worth the trouble, your only 
gain is that you start Ghostscript only once instead of twice. Most of the 
time is spent interpreting the PDF and generating the PNGs.

> ...



More information about the gs-devel mailing list