31.5 Copying objects between PDF documents

PDF objects vary in complexity from simple values (booleans, integers, floats, names, etc) to more complex entities (arrays, dictionaries, streams, indirect references etc). While the simplest object types are independent of any particular document, the more complex types are implicitly bound to the document in which they appear.

Most of the time this is all taken care of automatically by the MuPDF core, but special care must be taken when trying to copy objects from one PDF file to another.

31.5.1 The problem

To illustrate this, imagine that you have 2 PDF documents open, docA and docB. Imagine that we want to lookup an object from docA, and insert into docB. A naive code fragment to do this might be:

pdf_dict_putp_drop(ctx, 
               pdf_trailer(ctx, docB), 
               "Root/Example", 
               pdf_dict_getp(ctx, 
                           pdf_trailer(ctx, docA), 
                           "Root/Example"));

This may actually work in limited cases, such as:

1 0 obj 
<< 
 /Type /Catalog 
 /Pages 3 0 R 
 /Metadata 9 0 R 
 /Example true 
>> 
endobj 
... 
trailer 
<< 
 /Root 1 0 R 
>>

The value of Root/Example is read as true, which can safely be written into another file.

This can easily fall down though, as can be seen in more complex cases:

2 0 obj 
/Complex 
endobj 
1 0 obj 
<< 
 /Type /Catalog 
 /Pages 3 0 R 
 /Metadata 9 0 R 
 /Example [ (More) 2 0 R ] 
>> 
endobj 
... 
trailer 
<< 
 /Root 1 0 R 
>>

In this case the value of Root/Example is read as an array of 2 elements; the first element being the string ”More”, and the second being a reference to object 2 in the file.

If this was to be written directly into the new file, we’d still have an array of 2 elements, with the first element being the string ”More”. The second would refer to whatever object 2 in the new file happens to be.

The solution to this requires us to walk the directed (possibly cyclic) graph of child objects within the object to be copied from one file to another, and to ‘deep copy’ the contents.

We refer to this process as ‘grafting’ objects from one tree into another.

31.5.2 Grafting objects

To move a single object to a new tree, use pdf_graft_object:

/* 
   pdf_graft_object: Return a deep copied object equivalent to the 
   supplied object, suitable for use within the given document. 
 
   dst: The document in which the returned object is to be used. 
 
   obj: The object deep copy. 
 
   Note: If grafting multiple objects, you should use a pdf_graft_map 
   to avoid potential duplication of target objects. 
*/ 
pdf_obj *pdf_graft_object(fz_context *ctx, pdf_document *dst, pdf_obj *obj);

This takes an object in one document, and returns an equivalent object that can safely be written into document dst. Any indirect references within the original object will have been copied across as new objects within dst as a side effect of this call.

The ‘safe’ version of the code given above would therefore be:

pdf_dict_putp_drop(ctx, 
               pdf_trailer(ctx, docB), 
               "Root/Example", 
               pdf_graft_object(ctx, docB, 
                              pdf_dict_getp(ctx, 
                                          pdf_trailer(ctx, docA, 
                                          "Root/Example"));

31.5.3 A further problem

Even this is not perfect. Consider the example:

2 0 obj 
/Complex 
endobj 
1 0 obj 
<< 
 /Type /Catalog 
 /Pages 3 0 R 
 /Metadata 9 0 R 
 /Example [ (More) 2 0 R ] 
 /Example2 [ (Even more) 2 0 R ] 
>> 
endobj 
... 
trailer 
<< 
 /Root 1 0 R 
>>

Suppose we want to copy both Root/Example and Root/Example2 between files. If we read the first of these, and write it, it will cause object 2 to be copied to the new file (as a new object, 99 say). When we read the second one, and write that, it will cause object 2 to be copied into the second file again (as object 100 perhaps).

In the example above, with the object consisting of a single name this duplication may not matter, but when you consider that objects might be dictionaries with lots of contents, or even streams with many megabytes of data attached, the problem becomes clear.

The solution to this is to use a pdf_graft_map.

31.5.4 Graft maps

A pdf_graft_map is a mapping from one pdf_document to another that ensures objects in the source document are only ever copied into the target document at most once.

/* 
   pdf_new_graft_map: Prepare a graft map object to allow objects 
   to be deep copied from one document to the given one, avoiding 
   problems with duplicated child objects. 
 
   dst: The document to copy objects to. 
 
   Note: all the source objects must come from the same document. 
*/ 
pdf_graft_map *pdf_new_graft_map(fz_context *ctx, pdf_document *dst); 
 
/* 
   pdf_drop_graft_map: Drop a graft map. 
*/ 
void pdf_drop_graft_map(fz_context *ctx, pdf_graft_map *map); 
 
/* 
   pdf_graft_mapped_object: Return a deep copied object equivalent 
   to the supplied object, suitable for use within the target 
   document of the map. 
 
   map: A map targeted at the document in which the returned 
   object is to be used. 
 
   obj: The object deep copy. 
 
   Note: Copying multiple objects via the same graft map ensures 
   that any shared child are not duplicated more than once. 
*/ 
pdf_obj *pdf_graft_mapped_object(fz_context *ctx, pdf_graft_map *map, pdf_obj *obj);

A ‘safe’ version of the example given earlier that copies both Root/Example and Root/Example2 would therefore be:

pdf_graft_map *map = pdf_new_graft_map(ctx, docB); 
 
pdf_dict_putp_drop(ctx, 
               pdf_trailer(ctx, docB), 
               "Root/Example", 
               pdf_graft_mapped_object( 
                              ctx, map, 
                              pdf_dict_getp(ctx, 
                                          pdf_trailer(ctx, docA, 
                                          "Root/Example")); 
pdf_dict_putp_drop(ctx, 
               pdf_trailer(ctx, docB), 
               "Root/Example2", 
               pdf_graft_mapped_object( 
                              ctx, map, 
                              pdf_dict_getp(ctx, 
                                          pdf_trailer(ctx, docA, 
                                          "Root/Example2")); 
pdf_drop_graft_map(ctx, map);