Ghostscript : ZUGFeRD

ZUGFeRD is an acronym, the original name is Zentraler User Guide des Forums elektronische Rechnung Deutschland (or Central User Guide for Electronic Invoicing in Germany, for those of us who don’t speak German), and describes a system for electronic invoicing originally developed for use in Germany. There is a de facto identical French e-invoice format called Factur-X and both are intended (with version 2.0 at least) to be consistent with the EU /2014 /55 directive and the EN 16931 standard for electronic invoicing in the EU.

The goal is to deliver human-readable invoices which can also be processed quickly and accurately by computerised systems to reduce the costs of invoicing and improve accounting efficiency.

In technical terms a ZUGFeRD document is a PDF file with an embedded XML file which contains the machine-readable structured data.

While PDF is widely supported and does a good job of rendering visible content as the author intended, it is extremely difficult for computers to process reliably in order to extract the invoicing data. The XML format is nearly the opposite; it is simple for computers to parse and extract the relevant tagged data, but difficult for unskilled humans to read.

Obviously the information in the XML should be used to generate the appearance displayed by rendering the PDF file!

Where is Ghostscript involved?

In order to create a ZUGFeRD file a user may often have an existing PDF file and an XML file which contains the invoicing data, and need to combine the two into a single ZUGFeRD file. In addition the ZUGFeRD PDF file must conform to a particular PDF standard; PDF/A-3. Finally some additional information is required in the XML Metadata of the PDF file.

Ghostscript can readily transform the PDF file containing the visual representation of the invoice, and with some additional effort can combine the XML file at the same time, as well as adding the extra required XML Metadata which will result in a conforming ZUGFeRD file.

The following sections describe in more technical detail what exactly constitutes a ZUGFeRD file, what steps need to be taken to create one, and document those steps. Less technically minded readers might want to skip these sections, they are mainly intended for the technically curious, especially for anyone interested in modifying the process. The final solution is described in the summary at the end.

The PDF/A standard

Before we go any further we need to consider the standards we are required to use. I won’t look at the XML file, because Ghostscript is completely incapable of reading that. We just have to assume that the file is valid and the PDF file which contains the visual representation is accurate. But we do need to understand a little about the PDF/A standard.

PDF/A is the standard for ‘archive’ format PDF, the intention is that files which conform to this standard follow certain additional rules which will make them relatively future-proof and suitable for long-term archiving. For example, PDF/A requires that all fonts used in the document are embedded in the PDF file so that there is no requirement in future to find the relevant fonts in order to display the contents.

Because the main goal of PDF is to display the same, as far as possible, across different platforms a great deal of the PDF/A specification covers things like colour spaces. PDF/A files must use characterised colour spaces for example, again in order to ensure that in future the file will display the same as it does today. Much of this is not critical for the task of displaying invoices but it is useful to adopt a standard, and this standard is at least intended for long-term archiving.

There are multiple versions of the PDF/A standard. At the time of writing there are version numbers 1 to 3 and version 4 is due for release but has been delayed presumably because of the ongoing global pandemic. Fundamentally the versions are as follows;

Version 1, the original and the simplest is based on PDF 1.4.
Version 2 is based on PDF 1.7, permits transparency to be used (it is forbidden in version 1) and permits the embedding of PDF/A files within the PDF file (embedded files are barred in version 1).
Version 3 permits files other than PDF/A-conformant files to be embedded into the PDF file (barred in earlier versions).

Since ZUGFeRD requires us to embed an XML file inside the PDF we obviously must create at least a PDF/A-3 file.

PDF/A also has what is called conformance levels, the basic level of conformance is ‘b’, the advanced (or accessible) version is ‘a’. Version 2 adds a third conformance level ‘u’. I won’t go into details here but essentially conformance level ‘a’ means that the PDF file contains additional information beyond the marks on the page, including details such as structure data (heading, body, etc) and Unicode character maps (ToUnicode Cmaps). This information is useful for search/copy/paste and also for tools such as text-to-speech converters. The ‘u’ conformance level indicates that a file has ToUnicode information, but no other additional metadata.

Because of the way that Ghostscript’s pdfwrite device works (https://www.ghostscript.com/doc/9.54.0/VectorDevices.htm#Overview) we cannot be certain that a given input file will have all the information required to satisfy anything but the basic requirements of PDF/A. Section 6.8.1 of the PDF/A-1 specification (ISO 19005-1) has a note stating:

PDF/A-1 writers should not add structural or semantic information that is not explicitly or implicitly present in the source material solely for the purpose of achieving conformance.

Similarly Section 6.7.1 of the PDF/A-2 specification (ISO 19005-2) has a note:

PDF/A-2 writers should not add structural or semantic information that is not explicitly or implicitly present in the source material solely for the purpose of achieving conformance.

As a result Ghostscript’s pdfwrite device always writes PDF/A conforming files, of whatever version, as level ‘b’ conformance.

Creating PDF/A-3 files with Ghostscript

Because of the colour space requirements of PDF/A this is a little more involved than we would really like. The process is documented here https://www.ghostscript.com/doc/9.54.0/VectorDevices.htm#PDFA but we’ll cover it in this document.

The first thing you need to do is to use the pdfwrite device, so that Ghostscript will write a PDF file, instead of a TIFF file or a JPEG or simply draw on the display. We do that using the DEVICE switch, and that switch takes a string as a parameter (the name of the device) so it’s a -s switch. For example (note that on Windows you would use gswin64c, not gs):

gs -sDEVICE=pdfwrite

Now we also need to tell Ghostscript what we would like the output file to be called, and we can do that in a couple of different ways. Firstly there’s the OutputFile switch, which again takes a string and so is a -s. Eg:

gs -sDEVICE=pdfwrite -sOutputFile=foo.pdf

You could also use the -o switch which is ‘special’, it not only sets the output filename, but it also includes the BATCH and NOPAUSE switches, so that Ghostscript will not wait for confirmation between pages, and will exit when done instead of returning to the interactive PostScript interpreter prompt. Eg:

gs -sDEVICE=pdfwrite -o foo.pdf

Finally we need to tell Ghostscript the name of the input file. We’re assuming here that the input file is a PDF file but it could also be a PostScript program if that was what you created as a representation of the XML. It is technically possible to create PDF/A files from XPS and PCL (including PXL/PCL6) input, but that’s beyond the scope of this document. Anyway, the full command line to create a PDF file would look like this:

gs -sDEVICE=pdfwrite -o foo.pdf bar.pdf

Where bar.pdf is the existing PDF file.

Now we need to tell pdfwrite that we want to make a PDF/A file, and in fact specifically a PDF/A-3 file, this time we add the PDFA switch and it has a numeric parameter (the PDF/A version) so it takes a -d switch:

gs -sDEVICE=pdfwrite -dPDFA=3 -o foo.pdf bar.pdf

Now if you try that command line, and then verify the PDF file, it’s quite likely that it will fail to verify as a conformant PDF/A file. The reason is that we need to do some extra work surrounding colour spaces.

PDF/A requires that all colour specifications be ‘characterised’. That means we can’t just give a random RGB triplet, because different devices will display the RGB values differently (even if both are RGB devices). What we need to do is supply an ICC profile for the colour.

What the ICC profile does is turn the colour (say an RGB colour) into a device-independent colour space (the CIE XYZ colour space). That’s essentially a mathematical model of colour, and what we can do is use a 2 way transformation. Let’s say for example that we have chosen a colour that looks the way we want it to, and we’re using a Sony LCD panel for which we have an ICC profile. We take the RGB values which we selected and the ICC profile and we put both of those in the PDF file. Now we take that PDF file to another device, say an LG monitor, for which we also have an ICC profile. The PDF consumer takes the component values, uses the Sony ICC profile to turn the numbers into a value in CIE space, then uses the LG ICC profile to turn the CIE space value into an RGB triplet which we can display on the LG monitor. If the profiles and colour management engine are doing their job properly the colour on the LG monitor should be the same as it was on the Sony monitor.

The point of this, of course, is to ensure that the archive format can continue to be displayed in the future and look the same as it originally did. It’s probably the case that we don’t really care that much about colour accuracy in our invoice, but it’s part of the standard.

So we need to make sure that all the colour specifications in our PDF file are either in a device-independent colour space (ICC profiles or one of the built in spaces like Lab) or are in one device space, and we provide an ICC profile for that space.

There are two parts to this, firstly we want to have pdfwrite convert all colours for us, and we do that by specifying the ColorConversionStrategy. This is an unusual parameter, it can either be a string or a name. If you want to specify it as a string then you use the -s switch and if you want to specify it as a name then you use -d. If you are using a string then the possible values are DeviceIndependentColor, RGB or CMYK and if you are using the name format then the possible values are /DeviceIndependentColor, /DeviceRGB or /DeviceCMYK. You could use Gray or /DeviceGray but you probably don’t want to do that in this case, as it will turn all the colours into shades of grey.

Examples:

gs -sDEVICE=pdfwrite -dPDFA=3 -sColorConversionStrategy=RGB -o foo.pdf bar.pdf gs -sDEVICE=pdfwrite -dPDFA=3 -dColorConversionStrategy=/DeviceRGB foo.pdf bar.pdf

Observant readers will have spotted that we still haven’t actually told pdfwrite anything about an ICC profile, which we obviously need to do. The only way to add that currently is to use the PostScript programming language. To help you out there’s a boilerplate example included in /ghostpdl/lib/pdfa_def.ps but you do need to edit it. The lines marked with the PostScript comment “% Customise” are the ones you need to change, of these the most critical is this one:

% Define an ICC profile :

/ICCProfile (srgb.icc) % Customise

You must replace the ‘srgb.icc’ with a fully qualified path to an appropriate ICC profile, in particular this must be the right kind of profile. If you specified ColorConversionStrategy=RGB then it needs to be an RGB profile, if you specified CMYK then it needs to be a CMYK profile. If you are concerned about colour fidelity then you should use the ICC profile appropriate for your device. If you decide that’s not hugely relevant to you then you can use any convenient profile of the correct type. Ghostscript ships with its own default ICC profiles in ghostpdl/iccprofiles and you can use one of thos

The other line you really need to alter is this one:

/OutputConditionIdentifier (sRGB) % Customize

This ought to be either one of the standard spaces from the ICC Characterisation Data Registry or Custom.

Once you’ve modified that file then you need to tell Ghostscript to use it, for example:

gs -sDEVICE=pdfwrite -dPDFA=3 -sColorConversionStrategy=RGB -o foo.pdf /usr/home/me/pdfa_def.ps bar.pdf

But if you try that with current versions of Ghostscript it may throw an invalidfileaccess error. This is because the pdfa_def.ps file is a PostScript program and it is trying to open a file on disk (the ICC profile). Recent versions of Ghostscript have defaulted to running in SAFER mode which bars the interpreter from reading files on disk. There’s considerable effort in the code to try and make this transparent to the user, so input files (like pdfa_def.ps and bar.pdf) are automatically added to a ‘permitted’ list, along with some other paths and resources. But in general you can’t just go around opening files.

So to deal with that we have two options, the first is to use -dNOSAFER which turns off that protection. It’s fine to use that while you are working things out, but we really wouldn’t recommend you do that in the long term as it’s a security risk. Instead you should add the file to the list of permitted files for reading, which you do using the –permit-file-read control. This is documented here https://www.ghostscript.com/doc/9.54.0/Use.htm#Other_parameters under -dSAFER, but for now all you really need to do is add the filename of the ICC profile to the read list:

gs –permit-file-read=/usr/home/me/profile.icc -sDEVICE=pdfwrite -dPDFA=3 -sColorConversionStrateg=RGB -o foo.pdf /usr/home/me/pdfa_def.ps bar.pdf

At this point, all being well, you will have turned your original PDF file into a PDF/A-3 file. There is just one further consideration; what happens if the original file contains something which cannot be put into a PDF/A-3 file.

Because PDF/A is a subset of the PDF specification, and it forbids certain kinds of content, it is possible that, when the input is a PDF file, it is not possible to deal with some of the content and still produce a valid PDF/A file. What happens then is controlled by the PDFACompatibilityPolicy switch. This is a numeric switch with three possible values:

0 - (default) Include the feature or operation in the output file, the file will not be PDF/A compliant. Because the document Catalog is emitted before this is encountered, the file will still contain PDF/A metadata but will not be compliant. A warning will be emitted in this case.
1 - The feature or operation is ignored, the resulting PDF file will be PDF/A compliant. A warning will be emitted for every elided feature.
2 - Processing of the file is aborted with an error, the exact error may vary depending on the nature of the PDF/A incompatibility.

So now that we’ve covered creating a PDF/A file, lets look at making a ZUGFeRD file.

Creating a ZUGFeRD file

So we’ve created a PDF/A-3b file, all we need to do now is embed the XML file into the PDF, and wrap it up with the relevant Metadata. To do this we’re going to need some more PostScript programming. You can either add this to the modified pdfa_def.ps file or create a new file with the embedding code in it. You can also add the embedding code at the Ghostscript command line using the -c switch to introduce PostScript, but as the amount of coding required is rather large it’s almost certainly better to write a file to hold it all.

Embedding XML

The first thing we need to do is define a PDF stream object to hold the XML invoice:

[ /_objdef {InvoiceStream} /type /stream /OBJ pdfmark

Now we need to add the required entries to that stream dictionary:

[ {InvoiceStream} << /Type /EmbeddedFile /Subtype (text/xml) cvn /Params << /ModDate (D:20130121081433+01’00’) >> >> /PUT pdfmark

Then read the XML invoice data from the file and store it in the PDF stream:

[ {InvoiceStream} (/usr/home/me/invoice1.xml) (r) file /PUT pdfmark

Finally close the PDF stream:

[ {InvoiceStream} /CLOSE pdfmark

Now that we have a stream containing the XML invoice embedded into the PDF file we need to add the internal plumbing that tells the PDF file consumer that there is an embedded file and what it’s properties are. To do this we create a File Specification dictionary which describes the embedded file, create an Associated Files array and put a reference to the File Specification into it. Finally we add an /AF entry to the PDF file’s Catalog dictionary which references that Associated Files array.

A PDF consumer can look in the Catalog dictionary to see that there is an AF entry, which tells it that there are embedded files. Then the array has a File Specification for each embedded file, allowing the consumer to deal with them.

So we start by creating the File Specification dictionary for the embedded XML invoice:

[ /_objdef {Invoice_FSDict} /type /dict /OBJ pdfmark

And just like the stream we populate the dictionary with the required keys:

[ {Invoice_FSDict} << /Type /FileSpec /F (/usr/home/me/invoice1.xml) /UF (/usr/home/me/invoice1.xml) /Desc (ZUGFeRD XML invoice) /AFRelationship /Alternative /EF << /F {InvoiceStream} /UF {InvoiceStream} >> >> /PUT pdfmark

Now we create the Associated Files array to hold the File Specification dictionary:

[ /_objdef {AFArray} /type /array /OBJ pdfmark

And we add the File Specification dictionary to the array:

[ {AFArray} {FSDict} /APPEND pdfmark

Note that if you wanted to include multiple embedded files you could do so by creating each one as we did for the XML invoice and then use multiple APPEND pdfmarks to add them to the Associated Files array. But you would only need one Associated Files array.

Anyway, having created the Associated Files array, now we need to put it into the PDF file’s Catalog dictionary (PDF files are created with a Catalog dictionary so we don’t need to make one).

[ {Catalog} << /AF {AFArray} >> /PUT pdfmark

And now, finally, we can tell the pdfwrite device to actually embed the file. Technically what this does is create an EmbeddedFiles dictionary in the PDF file’s name tree (and create the name tree if there isn’t one already). This pdfmark is quite simple:

[ /Name (/usr/home/me/invoice1.xml) /FS {FSDict} /EMBED pdfmark

Additional XML Metadata

In order to be a valid ZUGFeRD file the PDF file must contain XML Metadata, which is required for a PDF/A file anyway, and that XML Metadata must contain some additional data. This is somewhat problematic; there is a Metadata pdfmark but as far as I can tell this completely replaces the XML Metadata in the PDF file.

There are two problems with this; firstly the portion of the XML Metadata which is not specific to ZUGFeRD must match the Document Information dictionary values (although this is optional we choose to always emit it) which would be very difficult to arrange. Secondly the content of the non-ZUGFeRD specific XML is quite extensive.

To address this we have added a non-standard pdfmark called Ext_Metadata. Any XML supplied in that fashion is added to the XML Metadata when it is written. This permits us to fairly trivially add the XML required and still have pdfwrite do the hard work of synchronising the Document Information dictionary and the remaining XML values. The required command is large but never needs to be modified:


[ 

	/XML 

( 

    <!-- XMP extension schema container for the zugferd schema -->

    <rdf:Description rdf:about=""

	xmlns:pdfaExtension="http://www.aiim.org/pdfa/ns/extension/"

	xmlns:pdfaSchema="http://www.aiim.org/pdfa/ns/schema#"

	xmlns:pdfaProperty="http://www.aiim.org/pdfa/ns/property#">

<!-- Container for all embedded extension schema descriptions -->

	<pdfaExtension:schemas>

	    <rdf:Bag>

		<rdf:li rdf:parseType="Resource">

		    <!-- Optional description of schema -->

		    <pdfaSchema:schema>ZUGFeRD PDFA Extension Schema</pdfaSchema:schema>

		    <!-- Schema namespace URI -->

		    <pdfaSchema:namespaceURI>urn:ferd:pdfa:invoice:rc#</pdfaSchema:namespaceURI>

		    <!-- Preferred schema namespace prefix -->

	<pdfaSchema:prefix>zf</pdfaSchema:prefix>

<!-- Description of schema properties -->

		    <pdfaSchema:property>

			<rdf:Seq>!

				<rdf:li rdf:parseType="Resource">

	<!-- DocumentFileName: Name of the embedded file; must be equal with the value of the /F tag in the /EF structure -->

				<pdfaProperty:name>DocumentFileName</pdfaProperty:name>

				<pdfaProperty:valueType>Text</pdfaProperty:valueType>

				<pdfaProperty:category>external</pdfaProperty:category>

				<pdfaProperty:description>name of the embedded xml invoice file</pdfaProperty:description>

			    </rdf:li>

				<rdf:li rdf:parseType="Resource">

	<!-- DocumentType: INVOICE -->

				<pdfaProperty:name>DocumentType</pdfaProperty:name>

				<pdfaProperty:valueType>Text</pdfaProperty:valueType>

				<pdfaProperty:category>external</pdfaProperty:category>

				<pdfaProperty:description>INVOICE</pdfaProperty:description>

			    </rdf:li>

				<rdf:li rdf:parseType="Resource">

	<!-- Version: The actual version of the ZUGFeRD standard -->

				<pdfaProperty:name>Version</pdfaProperty:name>

				<pdfaProperty:valueType>Text</pdfaProperty:valueType>

				<pdfaProperty:category>external</pdfaProperty:category>

				<pdfaProperty:description>The actual version of the ZUGFeRD data</pdfaProperty:description>

			    </rdf:li>

				<rdf:li rdf:parseType="Resource">

	<!-- ConformanceLevel: The actual conformance level of the ZUGFeRD standard, e.g. BASIC, COMFORT, EXTENDED -->

				<pdfaProperty:name>ConformanceLevel</pdfaProperty:name>

				<pdfaProperty:valueType>Text</pdfaProperty:valueType>

				<pdfaProperty:category>external</pdfaProperty:category>

				<pdfaProperty:description>The conformance level of the ZUGFeRD data</pdfaProperty:description>

			    </rdf:li>

			</rdf:Seq>

		    </pdfaSchema:property>

		</rdf:li>

	    </rdf:Bag>
	</pdfaExtension:schemas>

				</rdf:Description>

  <rdf:Description rdf:about="" xmlns:zf="urn:ferd:pdfa:invoice:rc#">

	<zf:DocumentType>INVOICE</zf:DocumentType>

	<zf:DocumentFileName>ZUGFeRD-invoice.xml</zf:DocumentFileName>

	<zf:Version>RC</zf:Version>

	<zf:ConformanceLevel>BASIC</zf:ConformanceLevel>

	  </rdf:Description>

)
/Ext_Metadata pdfmark

And that’s it. The result should be a conforming ZUGFeRD file.

Summary

It is now possible to create ZUGFeRD compliant invoices using Ghostscript provided you have a valid XML invoice, and a PDF representation of that invoice. But as anyone reading this entire document will immediately realise, the process is not entirely simple.

To make life easier for non-technical users I’ve created a PostScript program which does pretty much all the work involved, and requires only a few command line settings to be provided. I would recommend that the files required for the creation are all stored in a single directory because that way it’s easy to add permissions for Ghostscript to read the files, otherwise each file would need its own permission.

The file is called zugferd.ps and will be included in Ghostscript releases from 9.54.0 onwards, in the 'lib' directory. More information on zugferd.ps can be found in the Appendix below. Note; a new version of this program, capable of supporting more versions of the ZUGFeRD specification, was added in version 9.55.0

The example command line below assumes you are using a Linux or MacOS version of Ghostscript where the Ghostscript binary is called ‘gs’, on Windows the Ghostscript executable is called one of gswin32, gswin32c, gswin64 or gswin64c depending on whether you have installed the 32-bit or 64-bit version and whether you want to use the command line or windowed executable.

Example Command Line

Note the ‘\’ indicates a line break purely to fit the command line on this page, you should not include these!

gs --permit-file-read=/usr/home/me/zugferd/ -sDEVICE=pdfwrite -dPDFA=3\
-sColorConversionStrategy=RGB -sZUGFeRDXMLFile=/usr/home/me/zugferd/invoice.xml\
-sZUGFeRDProfile=/usr/home/me/rgb.icc -sZUGFeRDVersion=2p1 -sZUGFeRDConformanceLevel=BASIC\
-o /usr/home/me/zugferd/zugferd.pdf\
/usr/home/me/zugferd/zugferd.ps /usr/home/me/zugferd/original.pdf

Clearly the paths here are for illustration only and need to be customised.

Possible Future Enhancements

The code currently produces a PDF file which is either ZUGFeRD 1.0 or ZUGFeRD 2.0 compliant, the differences between the two depend (I believe) on the content of the XML invoice file. The current standard is 2.1 which requires small changes to the boilerplate XML. It is trivial to make these changes to the PostScript program if required.

Ghostscript’s pdfwrite device does not currently support digital signing of PDF files, which would obviously improve the security of the document. We may add the ability to sign documents in the future.

References

The ZUGFeRD format (version 1.0)
https://konik.io/ZUGFeRD-Spezifikation/Das-ZUGFeRD-Format_1p0.pdf

As above in English
https://konik.io/ZUGFeRD-Spezifikation/ZUGFeRD-Format_1p0_Specification-english.pdf

The ZUGFeRD format (version 2.01)
https://www.ferd-net.de/en/downloads/publications/details/zugferd-211-english

The ZUGFeRD format (version 2.1.1)
https://www.inposia.com/en/download-new-zugferd-specification-2-1-1-for-free/

The ICC Characterisation Data Registry http://www.color.org/registry2.xalter

Appendix

ZUGFeRD.PS Listing

The ZUGFeRD.PS program assists in the creation of a ZUGFeRD document.

The program requires two command line parameters; -sZUGFeRDProfile= which contains a fully qualified path to an appropriate (correct colour space) ICC profile, and -sZUGFeRDXMLFile= which contains a fully qualified path to the XML invoice file.

There are additionally two optional parameters; -sZUGFeRDVersion= allows selection of the version of the ZUGFeRD standard, possible values are rc, 1p0, 2p0, 2p1. The default is 2p1. -sZUGFeRDConformanceLevel= defines the level of conformance with the standard, possible values are MINIMUM, BASIC, BASIC WL, EN 16931, EXTENDED and XRECHNUNG, the default is BASIC

Example command line is in the comments, and a usage message if the user fails to provide either of the required elements. Obviously the user must also set -dPDFA=3 and -sColorConversionStrategy in order to create a valid PDF/A-3b file.

For the full source code, please see our Git repository.

What is ZUGFeRD?