PDF objects vary in complexity from simple values (booleans, integers, floats, names, etc) to more complex entities (arrays, dictionaries, streams, indirect references etc). While the simplest object types are independent of any particular document, the more complex types are implicitly bound to the document in which they appear.
Most of the time this is all taken care of automatically by the MuPDF core, but special care must be taken when trying to copy objects from one PDF file to another.
To illustrate this, imagine that you have 2 PDF documents open, docA and docB. Imagine that we want to lookup an object from docA, and insert into docB. A naive code fragment to do this might be:
This may actually work in limited cases, such as:
The value of Root/Example is read as true, which can safely be written into another file.
This can easily fall down though, as can be seen in more complex cases:
In this case the value of Root/Example is read as an array of 2 elements; the first element being the string ”More”, and the second being a reference to object 2 in the file.
If this was to be written directly into the new file, we’d still have an array of 2 elements, with the first element being the string ”More”. The second would refer to whatever object 2 in the new file happens to be.
The solution to this requires us to walk the directed (possibly cyclic) graph of child objects within the object to be copied from one file to another, and to ‘deep copy’ the contents.
We refer to this process as ‘grafting’ objects from one tree into another.
To move a single object to a new tree, use pdf_graft_object:
This takes an object in one document, and returns an equivalent object that can safely be written into document dst. Any indirect references within the original object will have been copied across as new objects within dst as a side effect of this call.
The ‘safe’ version of the code given above would therefore be:
Even this is not perfect. Consider the example:
Suppose we want to copy both Root/Example and Root/Example2 between files. If we read the first of these, and write it, it will cause object 2 to be copied to the new file (as a new object, 99 say). When we read the second one, and write that, it will cause object 2 to be copied into the second file again (as object 100 perhaps).
In the example above, with the object consisting of a single name this duplication may not matter, but when you consider that objects might be dictionaries with lots of contents, or even streams with many megabytes of data attached, the problem becomes clear.
The solution to this is to use a pdf_graft_map.
A pdf_graft_map is a mapping from one pdf_document to another that ensures objects in the source document are only ever copied into the target document at most once.
A ‘safe’ version of the example given earlier that copies both Root/Example and Root/Example2 would therefore be: