Changes to the PDF Interpreter

For the past few years, the Ghostscript development team has, amongst many other things, been working on a “from the ground-up” rewrite of the PDF interpreter. This code is now feature complete and has been released in Ghostscript 9.55.0. However, it is not yet the default PDF interpreter and requires action to enable. We encourage our customers and free users to try it out and send us feedback.

Read on as we attempt to answer what we expect to be frequently asked questions, and highlight the changes.

Why the Change?

The original PDF interpreter, as currently supplied with Ghostscript, is written in PostScript. When the original implementation was done this made good sense; the graphics model of PostScript and PDF was compatible and the PDF syntax is (or at least was) broadly similar to PostScript. Indeed that original PDF interpreter has served us well for decades.

However, there are problems, mainly invisible to our users but nevertheless still present. PostScript has been described, with some justification, as a ‘write-only’ language and, being now an elderly language is a rare skill for developers making it quite hard to recruit new engineers with PostScript programming skills. Not all of the Artifex development team are experienced PostScript programmers and even for those of us skilled in the language, the PDF interpreter code is now so large and arcane that it is difficult to fully understand some aspects of the PostScript program which performs the PDF interpretation.

In addition, the PDF specification has continued to evolve, whereas the PostScript language has remained static. PDF has added features like transparency, which have no equivalent in PostScript, and the only way for us to support these has been to add special, often undocumented, PostScript extensions. These extensions have proven to be a security problem in the past and we would like to remove our PDF interpreter’s dependence on them.

It has also become increasingly evident that many PDF producers do not create PDF files that conform to the specification. Since there is no means to ‘verify’ that a PDF file conforms, creators fall back on using Adobe Acrobat, the de facto standard. If Acrobat will open the file then it must be OK! Sadly it turns out that Acrobat is really very tolerant of badly formed PDF files and will always attempt to open them. Often it silently repairs the file in the background; the first time an alert user would be aware of this is when Acrobat offers to ‘save changes’ to a file the user has not modified, frequently Acrobat doesn’t even do that.

Because Acrobat will open these files, there is considerable pressure for Ghostscript to do so as well, though we do try to at least flag warnings to the user when something is found to be incorrect, giving the user a chance to intervene.

But Ghostscript’s PDF interpreter is, as noted, written in PostScript, and PostScript is not a great language for handling error conditions and recovering. In general, when something goes wrong in a PostScript program the expectation is that the PostScript interpreter will generate an error message and stop. It is possible to do better, but it is not trivial. As time has gone on, and we have encountered more and more PDF files with ever more unexpected deviations from the specification, it has become harder and harder to come up with new strategies to work around these faults without re-introducing previously fixed problems or failing to process compliant files. It is also true that many of these workarounds have led to decreased performance when processing all PDF files, not just the malformed ones.

Finally, because the PDF interpreter is written in PostScript, there is no way to divorce it from Ghostscript and its PostScript interpreter. This has performance implications (starting up a PostScript interpreter is quite a complex process) and imposes a resource overhead in that we need both the PostScript interpreter and a complex PostScript program before we even start to interpret the PDF file. Using the PostScript interpreter also exposes us to potential security issues due to the use of non-standard PostScript extensions. There is also the possibility of being forced to run PostScript XObjects (long since deprecated) in a PDF file, which potentially opens up some security problems as this program is run in the PDF environment which is less protected than regular PostScript.

What's New?

The new PDF interpreter is written entirely in C, but interfaces to the same underlying graphics library as the existing PostScript interpreter. So operations in PDF should render exactly the same as they always have (this is affected slightly by differing numerical accuracy), all the same devices that are currently supported by the Ghostscript family, and any new ones in the future should work seamlessly.

Because the interpreter no longer relies on PostScript, however, it can be divorced from it. It is now possible to create a stand-alone PDF interpreter, GhostPDF, and it is integrated as a separate module in the language-switching product GhostPDL.

This offers us some advantages in that the memory footprint is smaller, and the startup time of the stand-alone PDF interpreter is less than starting up the PostScript interpreter.

That said, we do recognise that people are used to being able to process PDF files through Ghostscript, and indeed over the years we have offered customers and free users a wide range of solutions which are based on the fact that the PDF interpreter is currently written in PostScript, and its behaviour can be controlled or influenced from the PostScript environment.

So one of the goals of this project was to enable the C PDF interpreter to be integrated into the PostScript environment in such a way that PostScript can be used to influence the graphics state of the PDF interpreter, and PostScript functionality like BeginPage and EndPage continue to function with it. And of course not forgetting that initial point, Ghostscript today can process PDF files and our users will expect that ability to continue. We’ll set out some of the means for that below.

When Will the Changes Occur?

The code is now released, and the new executable 'gpdf' (gpdfwin32 or gpdfwin64 on Windows) can be used directly to process PDF files without using the PostScript interpreter. Ghostscript (the PostScript interpreter) can still process PDF files, but to use the new interpreter you must add -dNEWPDF to the command line.

The current plan is to change the priority of the interpreters in Ghostscript at the next (Spring 2022) release so that the C-based interpreter is the default, but the old code is still present as a fallback in case of problems. The release after that (Autumn 2022) we plan to remove the old PostScript-based interpreter completely. This schedule will depend on feedback from customers and users.

What Currently Doesn't Work?

We believe the current code is feature complete, and working well. A very small number of files are known not to work perfectly with specific devices, and we will be working to correct those, but equally a larger number of files that previously did not work, or did not process correctly, now do.

Using the New Code

If you are using Ghostscript, then adding -dNEWPDF to the command line will use the new PDF interpreter instead of the old one. This is functionally equivalent to dNEWPDF=true, but you may want to use that anyway as a future release will change the default, requiring the use of -dNEWPDF=false to return to the old interpreter. Explicitly setting NEWPDF to true or false makes it clearer what is required.

Command line switches should work in both cases the same as they do in Ghostscript right now. Please note that the gpdf executable does not permit you to use the pdfmark operator (or otherwise send arbitrary PostScript to the interpreter using the -c switch). The pdfmark operator is a PostScript operator and therefore requires you to use the PostScript interpreter.

Obviously, the gpdf interpreter will not execute PostScript XObjects embedded in PDF files, for the same reason.

Using the PDF Interpreter From PostScript

The new code has been integrated following the old PDF interpreter; if all you want to do is process a PDF file then simply putting the file on the Ghostscript command line is sufficient. Also, the definition of the PostScript ‘run’ operator works with the new PDF interpreter, so you can still use code such as ‘(/home/myfile.pdf) run’.

This is covered in https://ghostscript.com/doc/9.54.0/Language.htm#PDF_scripting

Finally, here is a simple example program to make use of the new PDF interpreter. This PostScript program opens two PDF files simultaneously, then reads pages from each in turn. Used with this command line:

gs -sDEVICE=pdfwrite -o new.pdf –permit-file-read=/temp/ test.ps

it will create a new PDF file containing all the pages from the input PDF files, interleaved. Notice the use of –permit-file-read to allow the PostScript program to open the input PDF files.

%!
% simple program to read two PDF files simultaneously
% and interleave them on output.
%

userdict begin
/MyDict 20 dict def
MyDict begin

% First things first, create two PDF contexts, we
% need one for each input PDF file
%
/File1_Context .PDFInit def
/File2_Context .PDFInit def

% Now open the files on disk, using each context
% (/temp/test1.pdf) File1_Context .PDFFile
(/temp/test2.pdf) File2_Context .PDFFile

% Now loop around showing each page of the input from
% file 1, followed by the same page from file 2.
%
0 1 3 {
dup                    % copy the loop counter
File1_Context exch  % stack: counter context counter
.PDFDrawPage         % draw page 'n'. Stack: counter
showpage              % finish the page
File2_Context exch  % stack: context counter
.PDFDrawPage
showpage              % finish the page
} for

% Close each of the PDF contexts, which will
% also close the PDF files
%
File1_Context .PDFClose
File2_Context .PDFClose

% and finally tidy up our dictionary
end                    % MyDict

The two input PDF files were created using Ghostscript as well, from two simple PostScript programs. The command lines to create the example input PDF files are:

gs -sDEVICE=pdfwrite -o test1.pdf test1.ps
gs -sDEVICE=pdfwrite -o test2.pdf test2.ps

And the PostScript programs:

%!
% Test1.ps draw some simple content
%
1 0 0 setrgbcolor
180 500 moveto
/Helvetica findfont 100 scalefont setfont
(File 1) show
150 250 moveto
(Page 1) show
showpage
1 0 0 setrgbcolor
180 500 moveto
/Helvetica findfont 100 scalefont setfont
(File 1) show
0 1 0 setrgbcolor
150 250 moveto
(Page 2) show
showpage
1 0 0 setrgbcolor
180 500 moveto
/Helvetica findfont 100 scalefont setfont
(File 1) show
0 0 1 setrgbcolor
150 250 moveto
(Page 3) show
showpage
1 0 0 setrgbcolor
180 500 moveto
/Helvetica findfont 100 scalefont setfont
(File 1) show
0 0 0 setrgbcolor
150 250 moveto
(Page 4) show
showpage

%!
% Test2.ps draw some simple content
%
0 1 0 setrgbcolor
180 500 moveto
/Helvetica findfont 100 scalefont setfont
(File 2) show
1 0 0 setrgbcolor
150 250 moveto
(Page 1) show
showpage
0 1 0 setrgbcolor
180 500 moveto
/Helvetica findfont 100 scalefont setfont
(File 2) show
0 1 0 setrgbcolor
150 250 moveto
(Page 2) show
showpage
0 1 0 setrgbcolor
180 500 moveto
/Helvetica findfont 100 scalefont setfont
(File 2) show
0 0 1 setrgbcolor
150 250 moveto
(Page 3) show
showpage
0 1 0 setrgbcolor
180 500 moveto
/Helvetica findfont 100 scalefont setfont
(File 2) show
0 0 0 setrgbcolor
150 250 moveto
(Page 4) show
showpage

Moving Forward

We view this as an important and necessary advancement to the product and are confident that the end result will provide a more robust and flexible PDF rendering tool.

Feedback and bugs can be reported to our development team through our bugtracker at https://bugs.ghostscript.com or via our Discord.


Last revised: 17 September 2021