Ghostscript IRC logs

Log of #ghostscript at irc.freenode.net.

	<<<Back 1 day (to 2012/12/28)	2012/12/29
JakeSays	i need to manually construct a pdf that will display every glyph in a font. any ideas on how i'd go about that?	00:04.50
jroes	hey guys, trying to install ghostscript via homebrew on a mac, and I'm getting a 404 for this URL: http://ghostscript.com/~giles/jbig2/jbig2dec/jbig2dec-0.11.tar.gz	01:35.15
	anyone know what might have happened to that file?	01:35.30
ray_work	jroes: (for the logs). You don't need (or want) a separate jbig2dec set of sources. What you need is already a subdirectory in the gs directory	01:59.52
	JakeSays: Since PDF isn't a programming language, you have to do it by hand (tedious maximus), but using PostScript a simple program can display all glyphs. See "prfont.ps" which can do a "catalog" of all fonts in the FontDirectory. If you use this with the Ghostscript pdfwrite device, you will get a PDF	02:26.38
JakeSays	ray_work: that should work, although a png would work even better. i have fonts that i extracted from a pdf. can i use those with prfont.ps?	02:28.08
ray_work	JakeSays: Note that if you use the -sFONTPATH=... and ask for a non-existant font, ALL of the fonts on the FONTPATH will be identified, so you can develop a massive catalog of glyphs	02:28.18
JakeSays	ah cool	02:28.29
	so i can set fontpath to the dir with my extracted fonts	02:28.38
ray_work	JakeSays: -sFONTPATH=___ looks for all Type 1 and TrueType (and I think OpenType) fonts to add them	02:29.19
JakeSays	ray_work: excellent. i'll give it a try. thanks!	02:29.41
ray_work	JakeSays: I am not sure if it works with Type3 (bitmap) fonts, but it's worth a try	02:29.52
JakeSays	the fonts i have are ttf	02:30.05
ray_work	JakeSays: OK. GS should find them	02:30.21
	JakeSays: I recommend trying a font to the 'display' (or x11) device to see if the format suits you.	02:31.19
	JakeSays: Note that TTF fonts are loaded by the font name in the TTF file, NOT the name of the TTF file	02:31.59
JakeSays	ray_work: thats fine. i'm trying to map the glyphs to characters. i've managed to do it manually for one pdf, but every embedded font is different, so i need to automate. my thinking is to extract the fonts from the pdf, render/ocr them to get the chars	02:33.57
	its a hack upon hack but..	02:34.05
ray_work	JakeSays: thus if you have a font file 'arial.ttf" that has "ArialMT" in it, it will be found if you try and load "ArailMT", not just "Arial" (unless the Fontmap has an alias, as Ghostscript	02:34.28
	has	02:34.32
	JakeSays: yes, the problem with fonts is that the mapping of glyphs in an embedded font is often arbitrary, although sometimes the "Encoding" or the "ToUnicode" in an embedded font helps (that's what txtwrite uses)	02:36.38
JakeSays	ray_work: yeah, txtwrite maps everything to xFFFF	02:37.16
ray_work	JakeSays: then there probably is not a ToUnicode directory	02:37.40
JakeSays	directory?	02:38.01
	oh you mean in the font	02:38.15
ray_work	JakeSays: sorry, I'm not on my main system, but IIRC it is a directory in the FontDirectory PDF object	02:38.55
JakeSays	ray_work: no prob. so does my reasoning make sense, or is there another way to map w/o support from the pdf/font itself?	02:40.36
ray_work	JakeSays: from the PDF spec "this information can be deduced from the encoding used to represent the text in the PDF file. Otherwise, the PDF producer application should specify the mapping explicitly by including a special object, the ToUnicode CMap."	02:41.11
JakeSays	so if txtwrite can't resolve it, then there most likely isn't a ToUnicode directory	02:42.31
ray_work	JakeSays: OCR (or at least outline shape at a specific size to a hash lookup) is all that we have been able to arrive at	02:42.46
	JakeSays: BTW, the ToUnicode CMap stream is in the Font (not the FontDirectory). sorry.	02:43.43
JakeSays	so this is either a busted pdf print driver, or a clever attempt at obfuscation on the part of the application generating the output	02:44.12
ray_work	JakeSays: I recommend careful reading (and re-reading) of section '5.9 Extraction of Text Content"	02:44.51
*JakeSays*	pulls up spec	02:45.14
ray_work	JakeSays: we see PDF's like that all the time that just don't bother to make their text mappable / searchable. Also creating a PDF from a PCL file can have lost all encoding info	02:46.06
JakeSays	ray_work: all of the fonts have names like LGAZPG+TTEDt00	02:48.41
ray_work	JakeSays: off to dinner. I'll check back later (probably as ray_laptop), but I'll review the logs to see if you posted anything	02:48.47
JakeSays	ray_work: ok. hey thanks for your help!	02:48.59
ray_work	JakeSays: that's OK since the LGAZPG+ just is a unique id withing a file for an embedded subset font	02:49.58
	the TTEDt00 is usually a dummy name created by the PostScript creator that (Windows) that then creates a PDF from it	02:50.46
JakeSays	ah ok	02:50.57
	a quick seach of a pdf finds no ToUnicode	02:51.21
ray_work	JakeSays: I assume you are searching a 'decoded' PDF (as from gs's pdfinflt.ps or mupdf -d)	02:52.24
JakeSays	ray_work: yeah - using mutool actually	02:52.44
	mutool clean	02:52.50
ray_work	JakeSays: you also want to look for Encoding and Differences	02:53.08
	I usually use the old names pdfclean -d instead of mutool clean -d	02:53.35
JakeSays	ray_work: no luck on either of those	02:53.35
ray_work	JakeSays: an I assume that Adobe Acrobat can extract meaningful text from these files ?	02:55.10
JakeSays	ray_work: good question. i wont have access to acrobat until next week	02:56.07
ray_work	if so, then either Adobe does OCR (not likely) or there is some magic (undocumented?) algorithm for mapping to useful codes.	02:56.08
	JakeSays: have you posted a sample file anywhere ?	02:56.23
JakeSays	foxit's editor cant map them though	02:56.28
ray_work	JakeSays: I have full Acrobat 9	02:56.48
JakeSays	ray_work: i havent. unfortunately the pdfs are medical statements.	02:57.05
ray_work	and there are files that we can't figure out HOW Adobe gets text codes from them	02:57.20
	JakeSays: well, usually people provide a "dummy" patient file (Name: John Doe, Diagnosis: dead as a doornail)	02:58.39
JakeSays	ray_work: i wish. we're a medical claims clearing house. basically people ship unimaginable crap to us	02:59.22
ray_work	sort of a Lorem Ipsum patient record	02:59.27
JakeSays	and we're supposed to make sense of it	02:59.31
ray_work	JakeSays: well, if you find anything useful as far as mapping, please let us know.	03:00.42
JakeSays	ray_work: will do. i'm going to try my ocr thing and see how it goes.	03:01.07
ray_work	JakeSays: but in you case, resorting to OCR might be needed.	03:01.12
henrys	marcosw you around today?	17:53.00
	hi robin_watts_mac are you glowing yet?	17:56.47
robin_watts_mac	henrys: Just got back.	17:57.59
	Fab trip. Got far closer than I would have thought possible.	17:58.19
henrys	able to see the building?	17:59.00
robin_watts_mac	Within about 400 yards :)	17:59.11
henrys	wow	17:59.22
robin_watts_mac	yeah.	17:59.33
	Will try to upload photos later.	17:59.42
henrys	love to see them	18:00.16
JakeSays	so do you guys use postscript as a general purpose language much?	19:35.00
	Forward 1 day (to 2012/12/30)>>>

IRC Logs

Log of #ghostscript at irc.freenode.net.