[gs-bugs] [Bug 691506] converting pdf with accented characters to text

bugzilla-daemon at ghostscript.com bugzilla-daemon at ghostscript.com
Wed Jul 28 13:52:31 UTC 2010


Ken Sharp <ken.sharp at artifex.com> changed:

           What    |Removed                     |Added
             Status|NEW                         |RESOLVED
         Resolution|                            |WORKSFORME

--- Comment #4 from Ken Sharp <ken.sharp at artifex.com> 2010-07-28 13:52:30 UTC ---
(In reply to comment #2)
> Hi Ken,
> Thanks for the quick answer.
> I'm uploading one particular pdf, although any french pdf will do.
> The text on the first page:
> Jacques Chirac présidera une journée
> is extracted as:
> Jacques Chirac pre'sidera une journe'

The ps2ascii.ps script deliberately outputs accented glyphs as the regular
glyph plus an 'accent', the accent characters are defined in the file in a
particular way. So eacute is e', ecaron would be e^ adiereis would be a" and so
on. This appears to be done in order to use plain old ASCII (ie 7-bit, *not*
the extended ASCII range), all the accented characters are in the extended
ASCII range, > 127.

So this is working as designed. 

You can (of course!) change the way it works, you'll need to edit ps2ascii.ps.
Look for the section commented '% Encode the ISO accented characters.'. That
currently breaks the ISO Latin named accented glyphs into their two components,
then glues the resulting characters back together to make a 2 character string.
You'll need to change all of that.

A quick and dirty solution would be to add something like:

/eacute <XX> 
/egrave <XX>

and so on, where XX is the hexadecimal value of the character you want to use.
Define this *after* the loop defining the ISO latin characters.

Closing as 'worksforme'.

Configure bugmail: http://bugs.ghostscript.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

More information about the gs-bugs mailing list