Copying objects between PDF documents

PDF objects vary in complexity from simple values (booleans, integers, floats, names, etc) to more complex entities (arrays, dictionaries, streams, indirect references etc). While the simplest object types are independent of any particular document, the more complex types are implicitly bound to the document in which they appear.

Most of the time this is all taken care of automatically by the MuPDF core, but special care must be taken when trying to copy objects from one PDF file to another.

31.5.1 The problem

To illustrate this, imagine that you have 2 PDF documents open, docA and docB. Imagine that we want to lookup an object from docA, and insert into docB. A naive code fragment to do this might be:

pdf_dict_putp_drop(ctx,
               pdf_trailer(ctx, docB),
               "Root/Example",
               pdf_dict_getp(ctx,
                           pdf_trailer(ctx, docA),
                           "Root/Example"));

1 0 obj
<<
/Type /Catalog
/Pages 3 0 R
/Metadata 9 0 R
/Example true
>>
endobj
...
trailer
<<
/Root 1 0 R
>>

The value of Root/Example is read as true, which can safely be written into another file.

2 0 obj
/Complex
endobj
1 0 obj
<<
/Type /Catalog
/Pages 3 0 R
/Metadata 9 0 R
/Example [ (More) 2 0 R ]
>>
endobj
...
trailer
<<
/Root 1 0 R
>>

In this case the value of Root/Example is read as an array of 2 elements; the first element being the string ”More”, and the second being a reference to object 2 in the file.

If this was to be written directly into the new file, we’d still have an array of 2 elements, with the first element being the string ”More”. The second would refer to whatever object 2 in the new file happens to be.

The solution to this requires us to walk the directed (possibly cyclic) graph of child objects within the object to be copied from one file to another, and to ‘deep copy’ the contents.

31.5.2 Grafting objects

/*
   pdf_graft_object: Return a deep copied object equivalent to the
   supplied object, suitable for use within the given document.

   dst: The document in which the returned object is to be used.

   obj: The object deep copy.

   Note: If grafting multiple objects, you should use a pdf_graft_map
   to avoid potential duplication of target objects.
*/
pdf_obj *pdf_graft_object(fz_context *ctx, pdf_document *dst, pdf_obj *obj);

This takes an object in one document, and returns an equivalent object that can safely be written into document dst. Any indirect references within the original object will have been copied across as new objects within dst as a side effect of this call.

pdf_dict_putp_drop(ctx,
               pdf_trailer(ctx, docB),
               "Root/Example",
               pdf_graft_object(ctx, docB,
                              pdf_dict_getp(ctx,
                                          pdf_trailer(ctx, docA,
                                          "Root/Example"));

31.5.3 A further problem

2 0 obj
/Complex
endobj
1 0 obj
<<
/Type /Catalog
/Pages 3 0 R
/Metadata 9 0 R
/Example [ (More) 2 0 R ]
/Example2 [ (Even more) 2 0 R ]
>>
endobj
...
trailer
<<
/Root 1 0 R
>>

Suppose we want to copy both Root/Example and Root/Example2 between files. If we read the first of these, and write it, it will cause object 2 to be copied to the new file (as a new object, 99 say). When we read the second one, and write that, it will cause object 2 to be copied into the second file again (as object 100 perhaps).

In the example above, with the object consisting of a single name this duplication may not matter, but when you consider that objects might be dictionaries with lots of contents, or even streams with many megabytes of data attached, the problem becomes clear.

31.5.4 Graft maps

A pdf_graft_map is a mapping from one pdf_document to another that ensures objects in the source document are only ever copied into the target document at most once.

/*
   pdf_new_graft_map: Prepare a graft map object to allow objects
   to be deep copied from one document to the given one, avoiding
   problems with duplicated child objects.

   dst: The document to copy objects to.

   Note: all the source objects must come from the same document.
*/
pdf_graft_map *pdf_new_graft_map(fz_context *ctx, pdf_document *dst);

/*
   pdf_drop_graft_map: Drop a graft map.
*/
void pdf_drop_graft_map(fz_context *ctx, pdf_graft_map *map);

/*
   pdf_graft_mapped_object: Return a deep copied object equivalent
   to the supplied object, suitable for use within the target
   document of the map.

   map: A map targeted at the document in which the returned
   object is to be used.

   obj: The object deep copy.

   Note: Copying multiple objects via the same graft map ensures
   that any shared child are not duplicated more than once.
*/
pdf_obj *pdf_graft_mapped_object(fz_context *ctx, pdf_graft_map *map, pdf_obj *obj);

A ‘safe’ version of the example given earlier that copies both Root/Example and Root/Example2 would therefore be:

pdf_graft_map *map = pdf_new_graft_map(ctx, docB);

pdf_dict_putp_drop(ctx,
               pdf_trailer(ctx, docB),
               "Root/Example",
               pdf_graft_mapped_object(
                              ctx, map,
                              pdf_dict_getp(ctx,
                                          pdf_trailer(ctx, docA,
                                          "Root/Example"));
pdf_dict_putp_drop(ctx,
               pdf_trailer(ctx, docB),
               "Root/Example2",
               pdf_graft_mapped_object(
                              ctx, map,
                              pdf_dict_getp(ctx,
                                          pdf_trailer(ctx, docA,
                                          "Root/Example2"));
pdf_drop_graft_map(ctx, map);

31.5 Copying objects between PDF documents

31.5.1 The problem

31.5.2 Grafting objects

31.5.3 A further problem

31.5.4 Graft maps