IRC Logs

Log of #ghostscript at irc.freenode.net.

Search:
 <<<Back 1 day (to 2012/12/28)2012/12/29 
JakeSays i need to manually construct a pdf that will display every glyph in a font. any ideas on how i'd go about that?00:04.50 
jroes hey guys, trying to install ghostscript via homebrew on a mac, and I'm getting a 404 for this URL: http://ghostscript.com/~giles/jbig2/jbig2dec/jbig2dec-0.11.tar.gz01:35.15 
  anyone know what might have happened to that file?01:35.30 
ray_work jroes: (for the logs). You don't need (or want) a separate jbig2dec set of sources. What you need is already a subdirectory in the gs directory01:59.52 
  JakeSays: Since PDF isn't a programming language, you have to do it by hand (tedious maximus), but using PostScript a simple program can display all glyphs. See "prfont.ps" which can do a "catalog" of all fonts in the FontDirectory. If you use this with the Ghostscript pdfwrite device, you will get a PDF02:26.38 
JakeSays ray_work: that should work, although a png would work even better. i have fonts that i extracted from a pdf. can i use those with prfont.ps?02:28.08 
ray_work JakeSays: Note that if you use the -sFONTPATH=... and ask for a non-existant font, ALL of the fonts on the FONTPATH will be identified, so you can develop a massive catalog of glyphs02:28.18 
JakeSays ah cool02:28.29 
  so i can set fontpath to the dir with my extracted fonts02:28.38 
ray_work JakeSays: -sFONTPATH=___ looks for all Type 1 and TrueType (and I think OpenType) fonts to add them02:29.19 
JakeSays ray_work: excellent. i'll give it a try. thanks!02:29.41 
ray_work JakeSays: I am not sure if it works with Type3 (bitmap) fonts, but it's worth a try02:29.52 
JakeSays the fonts i have are ttf02:30.05 
ray_work JakeSays: OK. GS should find them02:30.21 
  JakeSays: I recommend trying a font to the 'display' (or x11) device to see if the format suits you.02:31.19 
  JakeSays: Note that TTF fonts are loaded by the font name in the TTF file, NOT the name of the TTF file02:31.59 
JakeSays ray_work: thats fine. i'm trying to map the glyphs to characters. i've managed to do it manually for one pdf, but every embedded font is different, so i need to automate. my thinking is to extract the fonts from the pdf, render/ocr them to get the chars02:33.57 
  its a hack upon hack but..02:34.05 
ray_work JakeSays: thus if you have a font file 'arial.ttf" that has "ArialMT" in it, it will be found if you try and load "ArailMT", not just "Arial" (unless the Fontmap has an alias, as Ghostscript02:34.28 
  has02:34.32 
  JakeSays: yes, the problem with fonts is that the mapping of glyphs in an embedded font is often arbitrary, although sometimes the "Encoding" or the "ToUnicode" in an embedded font helps (that's what txtwrite uses)02:36.38 
JakeSays ray_work: yeah, txtwrite maps everything to xFFFF02:37.16 
ray_work JakeSays: then there probably is not a ToUnicode directory02:37.40 
JakeSays directory?02:38.01 
  oh you mean in the font02:38.15 
ray_work JakeSays: sorry, I'm not on my main system, but IIRC it is a directory in the FontDirectory PDF object02:38.55 
JakeSays ray_work: no prob. so does my reasoning make sense, or is there another way to map w/o support from the pdf/font itself?02:40.36 
ray_work JakeSays: from the PDF spec "this information can be deduced from the encoding used to represent the text in the PDF file. Otherwise, the PDF producer application should specify the mapping explicitly by including a special object, the ToUnicode CMap."02:41.11 
JakeSays so if txtwrite can't resolve it, then there most likely isn't a ToUnicode directory02:42.31 
ray_work JakeSays: OCR (or at least outline shape at a specific size to a hash lookup) is all that we have been able to arrive at02:42.46 
  JakeSays: BTW, the ToUnicode CMap stream is in the Font (not the FontDirectory). sorry.02:43.43 
JakeSays so this is either a busted pdf print driver, or a clever attempt at obfuscation on the part of the application generating the output02:44.12 
ray_work JakeSays: I recommend careful reading (and re-reading) of section '5.9 Extraction of Text Content"02:44.51 
JakeSays pulls up spec02:45.14 
ray_work JakeSays: we see PDF's like that all the time that just don't bother to make their text mappable / searchable. Also creating a PDF from a PCL file can have lost all encoding info02:46.06 
JakeSays ray_work: all of the fonts have names like LGAZPG+TTEDt0002:48.41 
ray_work JakeSays: off to dinner. I'll check back later (probably as ray_laptop), but I'll review the logs to see if you posted anything02:48.47 
JakeSays ray_work: ok. hey thanks for your help!02:48.59 
ray_work JakeSays: that's OK since the LGAZPG+ just is a unique id withing a file for an embedded subset font02:49.58 
  the TTEDt00 is usually a dummy name created by the PostScript creator that (Windows) that then creates a PDF from it02:50.46 
JakeSays ah ok02:50.57 
  a quick seach of a pdf finds no ToUnicode02:51.21 
ray_work JakeSays: I assume you are searching a 'decoded' PDF (as from gs's pdfinflt.ps or mupdf -d)02:52.24 
JakeSays ray_work: yeah - using mutool actually02:52.44 
  mutool clean02:52.50 
ray_work JakeSays: you also want to look for Encoding and Differences02:53.08 
  I usually use the old names pdfclean -d instead of mutool clean -d02:53.35 
JakeSays ray_work: no luck on either of those02:53.35 
ray_work JakeSays: an I assume that Adobe Acrobat can extract meaningful text from these files ?02:55.10 
JakeSays ray_work: good question. i wont have access to acrobat until next week02:56.07 
ray_work if so, then either Adobe does OCR (not likely) or there is some magic (undocumented?) algorithm for mapping to useful codes.02:56.08 
  JakeSays: have you posted a sample file anywhere ?02:56.23 
JakeSays foxit's editor cant map them though02:56.28 
ray_work JakeSays: I have full Acrobat 902:56.48 
JakeSays ray_work: i havent. unfortunately the pdfs are medical statements.02:57.05 
ray_work and there are files that we can't figure out HOW Adobe gets text codes from them02:57.20 
  JakeSays: well, usually people provide a "dummy" patient file (Name: John Doe, Diagnosis: dead as a doornail)02:58.39 
JakeSays ray_work: i wish. we're a medical claims clearing house. basically people ship unimaginable crap to us02:59.22 
ray_work sort of a Lorem Ipsum patient record02:59.27 
JakeSays and we're supposed to make sense of it02:59.31 
ray_work JakeSays: well, if you find anything useful as far as mapping, please let us know.03:00.42 
JakeSays ray_work: will do. i'm going to try my ocr thing and see how it goes. 03:01.07 
ray_work JakeSays: but in you case, resorting to OCR might be needed.03:01.12 
henrys marcosw you around today?17:53.00 
  hi robin_watts_mac are you glowing yet?17:56.47 
robin_watts_mac henrys: Just got back.17:57.59 
  Fab trip. Got far closer than I would have thought possible.17:58.19 
henrys able to see the building?17:59.00 
robin_watts_mac Within about 400 yards :)17:59.11 
henrys wow17:59.22 
robin_watts_mac yeah.17:59.33 
  Will try to upload photos later.17:59.42 
henrys love to see them18:00.16 
JakeSays so do you guys use postscript as a general purpose language much?19:35.00 
 Forward 1 day (to 2012/12/30)>>> 
ghostscript.com
Search: