Ghostscript IRC logs

Log of #ghostscript at irc.freenode.net.

	<<<Back 1 day (to 2014/10/09)	20141010
rsc	So I can influence the font name for example.	09:50.29
	Right now the "font name" is "DejaVuLGCSans-Identity-H"	09:50.43
kens	Well tell it to use DejaVuLGCSans then. But I have no idea if that will work at all	09:51.07
	The PostScript has 2 byte encoded text	09:51.32
rsc	Is it enough to do a search and replace in the *.ps?	09:51.37
kens	God no.	09:51.43
	PostScript is a programming language	09:51.54
	As I said, your text is double byte, you would need a type 0 or CIDFont to be able to handle that properly	09:52.27
	The application generating the PostScript needs to not do that in order for you to get soemthing which will work	09:52.55
rsc	How to get "type 0"? I thought CIDFont is an issue?	09:54.47
kens	Type 0 is what a CIDFOnt turns into when you load it.	09:55.13
rsc	No I am totally confused. I thought CIDFont is the reason why I can't copy & paste without garbled results.	09:55.43
	*Now	09:55.46
kens	But it can also be manufactured by other means. However I doubt you can do it from the application	09:55.48
	The fact that the application is using a CIDFont wis why the copy/paste doesn't work, yes	09:56.08
	And because the application is using a CIDFOnt, it emits the text in a form encoded suitably for the CIDFont. THat form will NOT work with a type 1 or type 42 (what you would think of as TrueType) font.	09:56.57
	SO you cannot simply search and replace the font name, replacing a CIDFont with a regular font, because hte text will not then be suitably encoded for that font.	09:57.44
	FWIW the CIDSystemInfo attached to the CIDFont in the PDF file does say that the Ordering is Unicode, so a smart application could use that to figure out what the text is	09:58.33
rsc	Then neither evince nor Adobe is smart.	09:58.54
kens	Well its a heuristic, and not totally reliable. It would be effort to code that, so I guess most people don;t bother	09:59.22
	THe chances of it being present, and correct, are small	09:59.34
	I must admit I'm not completely sure how our own txtwrite device is getting mostly useful text out, and I wrote that device.....	10:00.10
rsc	The copy & paste result from Adobe looks "correct" but it is "10 00 41" - while "41" is "A"	10:00.39
kens	Like I said, you are using 2 byte encodings (Unicode) so the 1st byte is always going to be 0x00 for Western languages	10:01.27
	Frankly there is no way you are going to get a PDF file you can reliably cut and paste from, starting from the PostScript you are using. All the PDF consumers are going to be forced to fall back on guesswork (because there is no ToUnicode information available) so therefore unreliable	10:03.25
	Some may work better than others.	10:03.43
rsc	But the text for copy & paste is not separately supplied? It is generated from the pdf writer?	10:07.03
	And I can not make it trimming the first byte simply? :)	10:07.36
kens	trimming the fist byte from what exactly ? THe PostScrip file ? The PDF file ? THe cut and pasted text ?	10:08.25
	I don't know what you mean by the text being separately supplied	10:08.53
rsc	Copy & paste result is 0x10 0x00 0x41 if I am not mistaken. So where does this exactly come from?	10:09.03
	Is it possible to have some kind of hackish workaround there to only have 0x41 or 0x00 0x41 instead?	10:09.42
	(to get a correct copy & paste result)	10:09.54
kens	Well it comes form the application doing the cut and paste I guess. Where it comes from exactly I can't guess. Howeve3r the text is present i the PDF file so it comes (basically) from there	10:09.58
rsc	If I have an "A" in a PDF, is it there twice? Once for representation and once for copy & paste?	10:10.42
kens	rsc by changing which file ? The only thing you can change is the cut and pasted text, change either the PostScript file by removing the bytes and it will give you an error when you try to process it, change the PDF file by removing the bytes and it will not open	10:11.07
	rsc, no the text is only there once.	10:11.17
	Cut/paste/search is done by examining the text in the PDF file. First you look up which font is being used (also in the PDF file) then you take the correct number of bytes and make a numeric character code from it.	10:12.24
	What happens after that depends on the font and the ionformation available.	10:12.35
	If there's a ToUnicode CMap then you take the character code as an index, and that tells you the corresponding Unicode code point.	10:13.03
	THat's 100% reliable and teh way most things work	10:13.14
	If you don't have a ToUnicode CMap then you are left with guessing.	10:13.28
rsc	So if I have to stick to CIDFont (which is likely because I can not change the application fundamentally), I need ToUnicode CMap definately to get rid of this, right?	10:13.53
kens	You can use the glyph names from type 1 fonts. You can look up the POST table (if its present) from a TrueType font. If neither of those is availabvle then most apps simply say 'lets hope its ASCII'	10:14.19
	rsc yes, if you are using a CIDFont the only reliable mechanism is a ToUnicode CMap	10:14.45
rsc	Okay. For that usecase it would be enough if I cover characters from Western Europe.	10:15.07
kens	You are using subset fonts, so you can't produce a 'one size fits all' ToUnicode CMap	10:15.55
rsc	Let me go one step back. That fscking application here supports either "latin1" only by using Type1 fonts or "unicode" by using TTF.	10:17.20
kens	If you say so.	10:17.35
rsc	Can I somehow figure out if it uses CID for the "latin1 only" stuff?	10:17.43
kens	Look at the PostScript and see what font name it uses	10:17.57
	If its a name of the form <font name>-Identity-H or similar then its a CIDFont	10:18.24
	Also you cna look at the text in the PostScript and see if its single byte or doube byte encoded	10:18.44
rsc	NimbusSanL-Regu, Type 1C tells Evince here.	10:19.30
kens	If your PostScript contained a GlyphNames2Unicode entry in the font dictioanry then you would get a ToUnicode CMap generated for you, but since the PostScript doesn't actually have the font embedded, that can't happen	10:20.01
	rsc yes that's a type 2 font, but its basically the same	10:20.26
	You should have single byte encoded text, I would guess it will copy/paste/search as you expect	10:20.48
rsc	Copy/paste/search works, thus likely single byte encoded.	10:21.17
kens	Yes.	10:21.23
	Like I said, in the absence of any other information, applications will usually assume ASCII, and Latin1 is basically ASCII	10:21.54
rsc	How can I generate such a "ToUnicode CMap"?	10:22.52
kens	Like I said, you can't, it needs to be done programatically by the application embedding the font.	10:23.24
	In case you hand't guessed, you're in a very complicated area of PDF here	10:23.45
rsc	Can't I provide some mapping list to ghostscrit?	10:25.00
kens	Not really, no.	10:25.14
rsc	Means a "ToUnicode CMap" is only a hypothetical but not practical solution?	10:25.35
kens	Its highly practical for certain tasks; starting from another PDF file, or PostScript generated on Windows for instance.	10:26.16
	But if your application isn't generating it, its not easy to add afterwards.	10:26.35
rsc	What would the application have to do exactly?	10:27.33
kens	OK well there is no concept of a ToUnicode CMap in PostScript. THe Windows Postcript driver has a specific extension which includes a /GlyphNames2Unicode entry in an embedded font dictionary and we support that extension.	10:28.38
	So an application (or PostScript producer) would have to firstly embed teh font (your app doesn't so it fails at the first hurdle) then it would have to add the entry to the dictioanry and fill it in so that the character codes are matched to Unicode (actually UTF-16) values. In your case that would be an identity mapping of course.	10:29.49
rsc	Uhm. Nothing that can be easily done as non-C-programmer I guess.	10:32.40
kens	No, I'm afraid not.	10:32.49
	Just embedding the font would be a complex task	10:32.58
rsc	But it is generic and not really application specific?	10:35.17
	So is it something where Artifex could stick a price to it?	10:35.42
kens	The ToUnicode CMap is part of the PDF specification, the GlyphNames2Unicode extension is specific to the Adobe PostScript driver on Windows	10:35.54
rsc	No Windows involved here, just Linux.	10:36.09
kens	I'm not sure what you are asking about....	10:36.25
chrisl	We'd have to modify every applications that emits Postscript......	10:36.45
rsc	I thought if it could be an option to let you change the application to include the GlyphNames2Unicode entries to the PostScript.	10:37.40
kens	As chrisl says, we would have to modify every application that emits PostScript. We would also have to change at least the one you are using to embed teh fonts too. We don't have that kind of manpower	10:38.23
rsc	Why every application that emits PostScript? I thought the issue is that my application here doesn't just do the right thing?	10:38.56
kens	You seem to be talking generically, not about a specific application	10:39.20
rsc	Oh, sorry if I was unprecise about that.	10:39.47
kens	If you mean your specific application then it would need to be modified to embed fonts in the output, and add teh relevant GlyphNames2Unicode information	10:39.53
chrisl	And, frankly, especially right now, we don't have the man power to take on work like that	10:40.05
kens	It would be a major undertaking for the people who maintain that appplication, well outside of anything we coudl undertake, especially at the moment.	10:40.43
rsc	kens: okay, because it takes months to change that?	10:41.03
	kens: can you give me a very rough estimation how huge it would be?	10:41.16
	I anyway need to run to somebody and ask for budget etc.	10:41.34
kens	Well we don't have any background in that application, so we would first have to understand it. Embedding fonts is a very complicated process and that in itself would take an experienced engineer (experienced with fonts and PostScript) months to write and test fully.	10:42.17
rsc	Okay, so months.	10:42.44
kens	Please don't ask us to undertake such a task, we would have to say no.	10:42.50
	rsc months if you have an engineer experienced in PostScript and fonts.	10:43.12
rsc	kens: yes, I got this.	10:43.20
kens	There are very few of those in the world.	10:43.20
rsc	Is changing the application from CIDFont to something else better faster done?	10:44.03
	s/better/	10:44.11
kens	I imagine the application uses CIDFonts for the very excellent reason that its the only way to support non-Western languages	10:44.40
rsc	Is it? But how does say, libreoffice, solve this? I don't see "TrueType (CID)" there in such PDFs.	10:45.29
kens	So changing to another font type probably isn't an option. I imagine that text is stored internally as Unicode code point values, so it would be hard to change	10:45.37
	rsc You can include 2 methods of course, one for Western text and one for non-Western (>256 characters in the language)	10:46.26
	More complex of course	10:46.34
	Supporting two methods for achieving the same end is usually somethign engineers ahte	10:46.59
	s/ahte/hate/	10:47.40
chrisl	kens: what's the procedure when a bountiable bug is resolved?	10:50.02
kens	I don't recall right now	10:50.15
	Probably best to notify henry	10:50.33
chrisl	I'll do that....	10:50.45
kens	Doesn't SHelly already know the procedure ? He must have claimed before.....	10:51.27
chrisl	Yeh, I wasn't sure if it's a "pull" procedure from Shelly's end, or a "push" procedure from henrys	10:52.02
kens	I have a suspicion its a pull, but I could easily be mistaken, no harm in contacting henry anyway	10:52.22
chrisl	Okay, I've let both Henry and Shelly know......	11:07.24
kens	Seems reasonable	11:07.32
nsz	tor8: yesterday i tried the urls on http://git.ghostscript.com/?p=user/tor/mujs.git;a=summary but could not clone them	11:25.08
	looking at the commit diff in browser looked ok, except i'd use isalpha instead of manual 'a'<=c && ...	11:25.50
	libc isalpha generates smaller and faster code	11:26.04
	http://git.musl-libc.org/cgit/musl/tree/include/ctype.h#n30	11:26.23
	this is how isalpha should be implemented	11:26.35
	hm actually libc isaplha is not correct semantically but the musl implementation is how to do efficiently what you do there	11:30.20
tor8	nsz: libc isalpha is setlocale dependent, so unusable	11:39.12
	and musl's isalpha (while minimal and elegant) only tests A-Za-z, not the full unicode range	11:40.31
	nsz: I'm concerned that you couldn't clone the repo though	11:41.00
nsz	i mean you could do muslisaplha(c) \|\| isalpharune(c)	11:42.16
	but it's just a minor nitpick	11:42.29
tor8	nsz: ahem, my bad. I'd confused git-export-daemon-ok and git-daemon-export-ok. should be able to clone now.	11:42.45
nsz	:)	11:42.55
tor8	nsz: true, but as you said, it's a minor nitpick :)	11:43.00
nsz	i's assume the current code is just optimization and isalpharune handles the ascii case as well	11:43.34
tor8	nsz: yeah. isalpharune handles ascii as well, but it's quite a bit slower since it involves a binary search through a table	11:44.03
nsz	clone works but i can only checks things later	11:49.07
tor8	nsz: no rush	11:49.45
nsz	btw locale is not an issue with isalpha unless setlocale is called (and the libc supports more than one 8bit encodings)	11:49.58
	the problem is that if c>255 is ub	11:50.19
rsc	kens: okay, thanks so far.	11:52.31
tor8	nsz: we're a library, we have no control over whether the user has called setlocale or not :(	12:42.29
	hence we need to reimplement strtof and printf. such a stupid design, setlocale.	12:42.49
nsz	yes that's a shame	13:05.34
	btw strtof and float printf are tricky to implement correctly	13:06.32
	(musl libc has correctly rounded implementations of these in c)	13:06.54
henrys	chrisl, kens: shelly usually batches up a few and send me email then I review them. You don't need to do anything	14:19.12
kens	thanks henrys	14:19.48
zx	hello	14:20.38
ghostbot	Welcome to #ghostscript, the channel for Ghostscript and MuPDF. If you have a question, please ask it, don't ask to ask it. Do be prepared to wait for a reply as devs will check the logs and reply when they come on line.	14:20.38
chrisl	henrys: okay, cool, thanks!	14:21.04
zx	I want to insert an image into an exist pdf file with mupdf.But there are not documents,I hope you can help me.Can you give me some examples? I use C language. I have added an annotation ,I can see the rect but cann't find the image.	14:21.43
	anynone hereï¼	14:23.27
Robin_Watts	zx: yes, people are here	14:23.58
kens	I see we're getting customer emails not cc'ed to support again.	14:24.14
	Halfway through some conversation :-(	14:24.23
zx	can you give me a good way to insert an image to an exist pdf file	14:26.43
Robin_Watts	zx: Not using current mupdf, no.	14:27.13
kens	Adobe Illustrator,possibly Photoshop	14:27.17
Robin_Watts	zx: You could try to use the new filter stuff in mupdf.	14:27.46
	That would enable you to tack on arbitrary content to the end of the content streams.	14:28.15
zx	could you give me an example	14:28.20
Robin_Watts	but that requires a degree of PDF knowledge.	14:28.25
	No examples, no, it's still very new code.	14:28.36
	It was written to allow people to add watermarks.	14:28.42
	It may still only be on my repo...	14:29.06
	zx: http://git.ghostscript.com/?p=user/robin/mupdf.git;a=summary	14:30.56
	The 'add post processing option to page operator cleaning' commit is the one you need.	14:31.13
zx	ok thanks alot	14:32.13
Robin_Watts	essentially you call pdf_clean_page_contents and pass in the page you want to work with.	14:32.28
	You also pass in a pdf_pagE_contents_process_fn.	14:32.38
	That is called back after the page contents are cleaned, with the page contents in a buffer.	14:33.02
	You can then append to the buffer.	14:33.09
	Let me know how you get on with it. It's very new (almost completely untested) code.	14:33.24
*kens*	gives up on the customer email, one for marcos to sanitise	14:35.36
henrys	chrisl: NOCACHE doesn't work in pcl because it is done in gs_init.ps. so I need a call to 0 setcachelimit in all the other languages when we parse the parameter.	15:47.40
chrisl	henrys: you could implement NOCACHE in pcl	15:48.14
	or I can do it....	15:48.45
henrys	chrisl: no I got it.	15:51.21
mvrhel_laptop	good morning	15:53.11
kens	morning	15:53.19
henrys	chrisl: I hate booleans that start with NO but I guess NOCACHE is something we are stuck with.	17:10.18
rayjj	henrys: there's a lot of NO... options in the Ghostscript set	17:11.36
henrys	we should try and be more positive	17:12.09
rayjj	I think Peter sort of changed styles over time.	17:12.16
	but I agree that -dUseCache=/false would be better (or even better -dUseFontCache=/false so we know which cache)	17:13.06
chrisl	I guess the preference was for options that didn't need a "=....."	17:16.07
*kens*	is amused by the Good emails :-)	19:53.27
	Forward 1 day (to 2014/10/11)>>>

IRC Logs

Log of #ghostscript at irc.freenode.net.