31.4 PDF Operator Processors

Graphical content within a PDF file is given as streams of PDF “operators”. These operators describe marking operations on a conceptual page. In order to display a PDF file the interpreter needs to run through these operators processing each in turn.

In addition, certain manipulations of PDF operations (like redaction, sanitisation and appending, for example) are best done by operating directly on these operators streams. The alternative scheme, of first converting the operators to graphical objects, then resynthesising an operator stream from that leads to problems with round trip conversions, and the potential loss of structure.

For this reason, the PDF interpreter within MuPDF is structured around an extensible class of pdf_processors. A pdf_processor is a set of functions, one for each operator. The interpreter runs through the operators and handles them by calling the appropriate functions.

By changing the pdf_processor in use, we can therefore change what the effect of interpreting the page is.

MuPDF contains three different pdf_processor implementations, though the system is deliberately open ended, and more can be supplied by any user of the library. Some can even be chained together in powerful ways.

31.4.1 Run processor

The first, and most commonly used processor is the pdf_run_processor. This processor has the effect of interpreting the incoming operators and turning them into device calls (i.e. graphical objects rendered on a page).

When using the standard fz_run_page (and similar) function(s) this is the pdf_processor that is used automatically. It can still be useful to create these manually, especially when coupling them with a pdf_filter_processor (or similar).

Such processors can be created using:

/* 
   pdf_new_run_processor: Create a new "run" processor. This maps 
   from PDF operators to fz_device level calls. 
 
   dev: The device to which the resulting device calls are to be 
   sent. 
 
   ctm: The initial transformation matrix to use. 
 
   usage: A NULL terminated string that describes the usage of 
   this interpretation. Typically View’, though Print is also 
   defined within the PDF reference manual, and others are possible. 
 
   gstate: The initial graphics state. 
 
   nested: The nested depth of this interpreter. This should be 
   0 for an initial call, and will be incremented in nested calls 
   due to Type 3 fonts. 
*/ 
pdf_processor *pdf_new_run_processor(fz_context *ctx, fz_device *dev, const fz_matrix *ctm, const char *usage, pdf_gstate *gstate, int nested);

The component parts of this processor are generally functions named pdf_run_..., and frequently call back into the main pdf interpreter (to handle nested content streams as found in XObjects etc).

31.4.2 Filter processor

The pdf_filter_processor is an example of a processor that allows chaining. PDF operators are fed into the processor, which then ‘filters’ them and passes them out to another processor.

/* 
   pdf_new_filter_processor: Create a filter processor. This 
   filters the PDF operators it is fed, and passes them down 
   (with some changes) to the child filter. 
 
   The changes made by the filter are: 
 
   * No operations are allowed to change the top level gstate. 
   Additional q/Q operators are inserted to prevent this. 
 
   * Repeated/unnecessary colour operators are removed (so, 
   for example, "0 0 0 rg 0 1 rg 0.5 g" would be sanitised to 
   "0.5 g") 
 
   The intention of these changes is to provide a simpler, 
   but equivalent stream, repairing problems with mismatched 
   operators, maintaining structure (such as BMC, EMC calls) 
   and leaving the graphics state in an known (default) state 
   so that subsequent operations (such as synthesising new 
   operators to be appended to the stream) are easier. 
 
   The net graphical effect of the filtered operator stream 
   should be identical to the incoming operator stream. 
 
   chain: The child processor to which the filtered operators 
   will be fed. 
 
   old_res: The incoming resource dictionary. 
 
   new_res: An (initially empty) resource dictionary that will 
   be populated by copying entries from the old dictionary to 
   the new one as they are used. At the end therefore, this 
   contains exactly those resource objects actually required. 
 
*/ 
pdf_processor *pdf_new_filter_processor(fz_context *ctx, pdf_processor *chain, pdf_obj *old_res, pdf_obj *new_res);

Similar filtering processors could be written for other tasks, such as discarding all the text from a page, changing all occurrences of a particular font for another, or converting all the objects on a page to a given colorspace.

The component parts of this processor are generally functions named pdf_filter_....

31.4.3 Buffer processor

The fz_buffer_processor is designed to produce a fz_buffer from an input stream of operators. This is frequently found coupled with a fz_filter_processor, to gather up the filtered version of the operator stream ready for reinsertion into the document.

/* 
   pdf_new_buffer_processor: Create a buffer processor. This 
   collects the incoming PDF operator stream into an fz_buffer. 
 
   buffer: The (possibly empty) buffer to which operators will be 
   appended. 
 
   ahxencode: If 0, then image streams will be send as binary, 
   otherwise they will be asciihexencoded. 
*/ 
pdf_processor *pdf_new_buffer_processor(fz_context *ctx, fz_buffer *buffer, int ahxencode);

This is built using a fz_output_processor.

31.4.4 Output processor

The fz_output_processor is designed to produce an output stream from an input stream of operators. This is frequently found coupled with a fz_filter_processor, to gather up the filtered version of the operator stream ready for reinsertion into the document.

/* 
   pdf_new_output_processor: Create an output processor. This 
   sends the incoming PDF operator stream to an fz_output stream. 
 
   out: The output stream to which operators will be sent. 
 
   ahxencode: If 0, then image streams will be send as binary, 
   otherwise they will be asciihexencoded. 
*/ 
pdf_processor *pdf_new_output_processor(fz_context *ctx, fz_output *out, int ahxencode);

The component parts of this processor are generally functions named pdf_out_....