| <<<Back 1 day (to 2012/12/28) | 2012/12/29 |
JakeSays | i need to manually construct a pdf that will display every glyph in a font. any ideas on how i'd go about that? | 00:04.50 |
jroes | hey guys, trying to install ghostscript via homebrew on a mac, and I'm getting a 404 for this URL: http://ghostscript.com/~giles/jbig2/jbig2dec/jbig2dec-0.11.tar.gz | 01:35.15 |
| anyone know what might have happened to that file? | 01:35.30 |
ray_work | jroes: (for the logs). You don't need (or want) a separate jbig2dec set of sources. What you need is already a subdirectory in the gs directory | 01:59.52 |
| JakeSays: Since PDF isn't a programming language, you have to do it by hand (tedious maximus), but using PostScript a simple program can display all glyphs. See "prfont.ps" which can do a "catalog" of all fonts in the FontDirectory. If you use this with the Ghostscript pdfwrite device, you will get a PDF | 02:26.38 |
JakeSays | ray_work: that should work, although a png would work even better. i have fonts that i extracted from a pdf. can i use those with prfont.ps? | 02:28.08 |
ray_work | JakeSays: Note that if you use the -sFONTPATH=... and ask for a non-existant font, ALL of the fonts on the FONTPATH will be identified, so you can develop a massive catalog of glyphs | 02:28.18 |
JakeSays | ah cool | 02:28.29 |
| so i can set fontpath to the dir with my extracted fonts | 02:28.38 |
ray_work | JakeSays: -sFONTPATH=___ looks for all Type 1 and TrueType (and I think OpenType) fonts to add them | 02:29.19 |
JakeSays | ray_work: excellent. i'll give it a try. thanks! | 02:29.41 |
ray_work | JakeSays: I am not sure if it works with Type3 (bitmap) fonts, but it's worth a try | 02:29.52 |
JakeSays | the fonts i have are ttf | 02:30.05 |
ray_work | JakeSays: OK. GS should find them | 02:30.21 |
| JakeSays: I recommend trying a font to the 'display' (or x11) device to see if the format suits you. | 02:31.19 |
| JakeSays: Note that TTF fonts are loaded by the font name in the TTF file, NOT the name of the TTF file | 02:31.59 |
JakeSays | ray_work: thats fine. i'm trying to map the glyphs to characters. i've managed to do it manually for one pdf, but every embedded font is different, so i need to automate. my thinking is to extract the fonts from the pdf, render/ocr them to get the chars | 02:33.57 |
| its a hack upon hack but.. | 02:34.05 |
ray_work | JakeSays: thus if you have a font file 'arial.ttf" that has "ArialMT" in it, it will be found if you try and load "ArailMT", not just "Arial" (unless the Fontmap has an alias, as Ghostscript | 02:34.28 |
| has | 02:34.32 |
| JakeSays: yes, the problem with fonts is that the mapping of glyphs in an embedded font is often arbitrary, although sometimes the "Encoding" or the "ToUnicode" in an embedded font helps (that's what txtwrite uses) | 02:36.38 |
JakeSays | ray_work: yeah, txtwrite maps everything to xFFFF | 02:37.16 |
ray_work | JakeSays: then there probably is not a ToUnicode directory | 02:37.40 |
JakeSays | directory? | 02:38.01 |
| oh you mean in the font | 02:38.15 |
ray_work | JakeSays: sorry, I'm not on my main system, but IIRC it is a directory in the FontDirectory PDF object | 02:38.55 |
JakeSays | ray_work: no prob. so does my reasoning make sense, or is there another way to map w/o support from the pdf/font itself? | 02:40.36 |
ray_work | JakeSays: from the PDF spec "this information can be deduced from the encoding used to represent the text in the PDF file. Otherwise, the PDF producer application should specify the mapping explicitly by including a special object, the ToUnicode CMap." | 02:41.11 |
JakeSays | so if txtwrite can't resolve it, then there most likely isn't a ToUnicode directory | 02:42.31 |
ray_work | JakeSays: OCR (or at least outline shape at a specific size to a hash lookup) is all that we have been able to arrive at | 02:42.46 |
| JakeSays: BTW, the ToUnicode CMap stream is in the Font (not the FontDirectory). sorry. | 02:43.43 |
JakeSays | so this is either a busted pdf print driver, or a clever attempt at obfuscation on the part of the application generating the output | 02:44.12 |
ray_work | JakeSays: I recommend careful reading (and re-reading) of section '5.9 Extraction of Text Content" | 02:44.51 |
JakeSays | pulls up spec | 02:45.14 |
ray_work | JakeSays: we see PDF's like that all the time that just don't bother to make their text mappable / searchable. Also creating a PDF from a PCL file can have lost all encoding info | 02:46.06 |
JakeSays | ray_work: all of the fonts have names like LGAZPG+TTEDt00 | 02:48.41 |
ray_work | JakeSays: off to dinner. I'll check back later (probably as ray_laptop), but I'll review the logs to see if you posted anything | 02:48.47 |
JakeSays | ray_work: ok. hey thanks for your help! | 02:48.59 |
ray_work | JakeSays: that's OK since the LGAZPG+ just is a unique id withing a file for an embedded subset font | 02:49.58 |
| the TTEDt00 is usually a dummy name created by the PostScript creator that (Windows) that then creates a PDF from it | 02:50.46 |
JakeSays | ah ok | 02:50.57 |
| a quick seach of a pdf finds no ToUnicode | 02:51.21 |
ray_work | JakeSays: I assume you are searching a 'decoded' PDF (as from gs's pdfinflt.ps or mupdf -d) | 02:52.24 |
JakeSays | ray_work: yeah - using mutool actually | 02:52.44 |
| mutool clean | 02:52.50 |
ray_work | JakeSays: you also want to look for Encoding and Differences | 02:53.08 |
| I usually use the old names pdfclean -d instead of mutool clean -d | 02:53.35 |
JakeSays | ray_work: no luck on either of those | 02:53.35 |
ray_work | JakeSays: an I assume that Adobe Acrobat can extract meaningful text from these files ? | 02:55.10 |
JakeSays | ray_work: good question. i wont have access to acrobat until next week | 02:56.07 |
ray_work | if so, then either Adobe does OCR (not likely) or there is some magic (undocumented?) algorithm for mapping to useful codes. | 02:56.08 |
| JakeSays: have you posted a sample file anywhere ? | 02:56.23 |
JakeSays | foxit's editor cant map them though | 02:56.28 |
ray_work | JakeSays: I have full Acrobat 9 | 02:56.48 |
JakeSays | ray_work: i havent. unfortunately the pdfs are medical statements. | 02:57.05 |
ray_work | and there are files that we can't figure out HOW Adobe gets text codes from them | 02:57.20 |
| JakeSays: well, usually people provide a "dummy" patient file (Name: John Doe, Diagnosis: dead as a doornail) | 02:58.39 |
JakeSays | ray_work: i wish. we're a medical claims clearing house. basically people ship unimaginable crap to us | 02:59.22 |
ray_work | sort of a Lorem Ipsum patient record | 02:59.27 |
JakeSays | and we're supposed to make sense of it | 02:59.31 |
ray_work | JakeSays: well, if you find anything useful as far as mapping, please let us know. | 03:00.42 |
JakeSays | ray_work: will do. i'm going to try my ocr thing and see how it goes. | 03:01.07 |
ray_work | JakeSays: but in you case, resorting to OCR might be needed. | 03:01.12 |
henrys | marcosw you around today? | 17:53.00 |
| hi robin_watts_mac are you glowing yet? | 17:56.47 |
robin_watts_mac | henrys: Just got back. | 17:57.59 |
| Fab trip. Got far closer than I would have thought possible. | 17:58.19 |
henrys | able to see the building? | 17:59.00 |
robin_watts_mac | Within about 400 yards :) | 17:59.11 |
henrys | wow | 17:59.22 |
robin_watts_mac | yeah. | 17:59.33 |
| Will try to upload photos later. | 17:59.42 |
henrys | love to see them | 18:00.16 |
JakeSays | so do you guys use postscript as a general purpose language much? | 19:35.00 |
| Forward 1 day (to 2012/12/30)>>> | |