[gs-bugs] [Bug 690661] CID-Cmap font characters

bugs.ghostscript.com-bugzilla-daemon at ghostscript.com bugs.ghostscript.com-bugzilla-daemon at ghostscript.com
Tue Jul 28 06:40:20 PDT 2009


http://bugs.ghostscript.com/show_bug.cgi?id=690661

ken.sharp at artifex.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P1                          |P4



------- Additional Comments From ken.sharp at artifex.com  2009-07-28 06:40 -------
This is quite complicated. The font is a CIDFont with TrueType outlines, though
you don't have to worry about the outline type, just the fact that its a CIDFont.

So firstly you need to use a different font method, returning glyphs instead of
character codes:

	    code = font->procs.next_char_glyph(&scan, &chr, &glyph);

scan is a pointer to the text enumerator, chr is a gs_char and glyph is a gs_glyph.

CIDFonts use a different kind of encoding to regular type 1 fonts, and as you've
realised this may mean using more than a single byte for the CID. In your case
some of the glyphs have 2-byte CIDs.

Now the 'real' character code really is the 2-byte number, 0x110, 0x102 and
0x19f for the CIDs in your case. As you'll immediately realise these cannot
correspond to ASCII character values. 

In fact even type 1 or type 3 fonts need not have an Encoding which matches
ASCII, so simply retrieving the character code is not really sufficient. This is
one of the reasons why editing PDF files is erratic at best.

In your case, the font does include Unicode information, in the form of a
ToUnicode CMap. You can use this to return the Unicode code points which each
CID refers to. Here is the decoded CMap:

12 0 obj 
<<
/Length 565
>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
16 beginbfchar
<0003> <0020>
<0102> <0061>
<002F> <0049>
<0110> <0063>
<011E> <0065>
<012E> <FB01>
<005A> <0052>
<0147> <FB02>
<015D> <0069>
<016F> <006C>
<0176> <006E>
<0190> <0073>
<0355> <002C>
<019F> <00740069>
<01A9> <00740074>
<01AB> <007400740069>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

endstream 
endobj 

You can see that 0x110 equates to Unicode code point 0x63, 0x102 to 0x61 and
0x19f to 0x00740069. If you check these code points  Eg:

http://www.fileformat.info/info/unicode/char/0063/index.htm
http://www.fileformat.info/info/unicode/char/0061/index.htm

You will see that these correspond to lower case 'c' and lower case 'a'.

It looks like the final glyph is a 'ti' ligature, and therefore has two Unicode
code points, 0074 and 0069 which map to 't' and 'i' respectively. (note that
there also appear to be 'tt' and 'tti' ligatures defined in this ToUnicode CMap).

If at all possible you should use the ToUnicode CMap to give you the text
definitions, Encodings are not always reliably ASCII encoded, even for Latin
text. I suspect you have simply been lucky not to encounter this so far. If you
don't have ToUnicode information you should check the glyph names to see if they
match an ASCII encoding.

Note that you will not get ToUnicode CMaps for PostScript, only PDF files. There
is a 'similar', undocumented, table which the Adobe PostScript driver (only!)
produces called GlyphNames2Unicode.

Now, your next problem is that Ghostscript doesn't care about ToUnicode CMaps in
general, and so does not process them. In gs/Resource/Init in pdf_font.ps:

/.processToUnicode   % <font-resource> <font-dict> <encoding|null>
.processToUnicode -
{
  % Currently pdfwrite is only device which can handle GlyphNames2Unicoide to 
  % generate a ToUnicode CMaps. So don't bother with other devices.
  currentdevice .devicename /pdfwrite eq {

You will have to change the last line to something like :

  true eq {

or add the name of your device so that GS will process the ToUnicode CMap for you.

You can retrieve the Unicode code point via :

unicode = font->procs.decode_glyph(((gs_font *)font, glyph);

As usual I recommend you look at pdfwrite, which is currently the only device
which does anything like what you want. In particular I suggest the routines
pdf_text_process, which (for CIDFonts) calls process_cmap_text, which calls
pdf_add_ToUnicode.

If all this seems unreasonably complicated you can simply decide not to handle
this ytpe of font for the present, of course.




------- You are receiving this mail because: -------
You are the QA contact for the bug, or are watching the QA contact.



More information about the gs-bugs mailing list