[gs-devel] Extracting PDF metadata and exploding pages
ken.sharp at artifex.com
Wed May 2 14:57:24 UTC 2012
At 23:43 30/04/2012 -0400, Scott Gifford wrote:
>The user then re-orders the pages in a Web UI using the
>thumbnails. Finally, it puts them back together in a different order with
>I have not been able to find a reasonable way to extract the PDF metadata
The information is available, but you have to extract it yourself, which
would require a little PostScript programming.
> or to "burst" a multi-page PDF document into many one-page PDF documents
> (well, I could use -dFirstPage and -dLastPage for every page, but that
> requires many calls to gs for a big document and is much, much slower
> than pdftk).
Yes, that's because pdftk just extracts the streams, Ghostscript fully
interprets the original into marking operation and then uses those marking
operations to build a brand new PDF file. No part of the original content
is reused. This is much slower than simply pulling bits out of a PDF file.
Its also (probably) a bad thing to do in your case, since information is
potentially lost while doing so.
>I would really like to be able to load the PDF file into ghostscript one
>time, extract the data I need, then convert the pages one at a time to
>individual PDF files then to PNGs. Is it possible to drive ghostscript
>like this, having it do multiple operations on each page?
OK you could do this with Ghostscript, but I suspect this is not the
correct tool for the job. I would suggest that instead you use MuPDF. It
also won't do what you want 'out of the box', but a little programming
would enable you to create such a tool relatively easily.
The document Info dictionary is available to you, and the metadata you want
is (mostly at least) present there. MuPDF will readily allow you to make
low resolution thumbnails of each page. It won't allow you to make separate
PDFs one per page, but it seems to me you don't really want to do that anyway.
Part of the internal structure of the PDF file is the 'Pages tree', this is
a tree-like list of the pages in the document. In order to rearrange the
pages in a PDF file all you really need to do is modify the pages tree. You
can leave all the rest of the structure intact (you may need to rebuild the
xref if you alter the number of bytes occupied by the pages tree).
Currently MuPDF doesn't offer this as a capability, but it would be easy to
add using the available API I believe. I did have a brief discussion with
one of the MuPDF developers and we believe that this should be possible.
More information about the gs-devel