Ghostscript IRC logs

	<<<Back 1 day (to 2020/11/16)	Fwd 1 day (to 2020/11/18) >>>	20201117
Intruder777	Hi. I'm trying to merge several PDF files into one using gswin64.exe tool. The text in resulting pdf looks good, but when I try to copy it - it copies some garbage instead of text (before merging the copying worked fine). Can someone help with this?		15:50.28
artifexirc-bot	<KenSharp> Intruder777 eve3ryone's in a meeting at the moment, hold on a bit please		15:51.07
Intruder777	Ok, thank you. Here is the command line I'm using: `gswin64.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -dDetectDuplicateImages -dCompressFonts=true -r150 -sOutputFile=output.pdf file1.pdf file2.pdf file3.pdf` . So file1.pdf has text which can be copied, and in the output.pdf the clipboard gets garbage instead of text when copying		15:53.34
artifexirc-bot	<KenSharp> Intruder777 I may get called away but let's start.		16:07.17
	<KenSharp> The first point is that Ghostscript (and more particularly the pdfwrite device) doesn't 'merge' PDF files		16:07.38
	<KenSharp> The process is described here:https://www.ghostscript.com/doc/9.53.3/VectorDevices.htm#Overview		16:07.45
	<KenSharp> What happens is that each input PDF file is interpreted into a series of marks on the page, which are sent to the pdfwrite device which then reassembles a new PDF file from them		16:08.18
	<KenSharp> So what's in the output PDF file (in terms of the language and syntax) bears no relation to what was in the input		16:08.36
	<KenSharp> Now, it sounds to me like your input files do not have ToUnicode CMaps		16:08.49
	<KenSharp> In which case the PDF consumer lacks search information.		16:09.02
	<KenSharp> The best it can do (probably) is to use teh characetr Encoding and pretend its ASCII		16:09.17
	<KenSharp> The problem, probably, is that when pdfwrite creates the fonts for output, it is unable to preserve that encopding and uses a custom encoding		16:09.51
	<KenSharp> For example; if you had the text "Hello World" then H woudl be assigned the character code 1, e would be assigned 2 l would be assigned 3 and so on		16:10.25
	<KenSharp> Obviously that is not ASCII		16:10.30
Intruder777	The file1.pdf was created by MS Word - it was a plain .docx file which was "saved as..." as pdf. And it has non-english characters		16:10.35
artifexirc-bot	<KenSharp> Well, obviously I'm guessing because I haven't seen your PDF files 🙂		16:11.00
	<KenSharp> If the non-Latin text is searchable then it ought to have ToUnicode CMaps (assumign by non-English you don't mean somethign like French)		16:11.32
	<KenSharp> You also haven't mentioned which version of Ghostscript you are using		16:12.17
Intruder777	GPL Ghostscript 9.20		16:12.57
artifexirc-bot	<KenSharp> Well the current version is 9.53.3		16:13.10
	<KenSharp> 9.20 is 4 years old		16:13.25
	<KenSharp> So first I'd suggest you try updating		16:13.33
Intruder777	I see. As for the original pdf file - it has something much worse than french - cyrilic		16:13.49
artifexirc-bot	<KenSharp> Other than that I'd need to see the input files. The simplest way to provide those is to open a bug report at bugs.ghostscript.com		16:14.02
	<KenSharp> But broadly speaking I'd expect that the problem is the fonts are being re-encoded and there is either no ToUnicode information, or the ToUnicode is being lost		16:14.47
Intruder777	the text in console says "Substituting font Helvetica for AralMT. Loading NimbusSans-Regular font from %rom%Resource/Font/NibusSans-Regular... "		16:15.15
artifexirc-bot	<KenSharp> Well that's a bad start		16:15.26
Intruder777	I see. I'm going to try to install recent version now...		16:15.36
artifexirc-bot	<KenSharp> But again it depends on what kind of font it is. If its a CIDFont then the font ought to be embeded, if it isn't then there will be problems		16:16.19
	<KenSharp> Got to go to another channel, will bbs		16:16.29
Intruder777	Looks like latest version of GS helped. Now same command line produces output pdf where I can copy/search cyrillic text		16:29.01
artifexirc-bot	<KenSharp> Well we do improve things from time to time 🙂		16:29.21
Intruder777	Thank you for your help. I figured out that I've already had latest version installed into my ProgramFiles folder, but there was some other stuff in my %PATH% so some old 9.20 version was used which was embedded into some other stuff.		16:31.14
artifexirc-bot	<KenSharp> Yeah the $PATH environment variable doesn't get set by the installer on Windows so you have to do that manually		16:31.47
Intruder777	BTW, the latest 9.53.3 version still says those things on the console: "Substituting font Helvetica for AralMT. Loading NimbusSans-Regular font from %rom%Resource/Font/NibusSans-Regular... ". But anyways, the output result is much better.		16:33.21
artifexirc-bot	<KenSharp> Probably means that that font wasn't the one causing you a problem. I'd guess the Cyrillic font is a different one		16:33.53
	<KenSharp> You can always use the Windows TT font as a substitute for a Font (but not a CIDFont)		16:34.20
	<KenSharp> You just need to tell Ghostscrit that its an alias by editing the font map		16:34.33
Intruder777	The font name in original docx file (which was saved as pdf) is Calibri		16:35.41
artifexirc-bot	<KenSharp> Then I'm guessing Calibri was embedded in the PDF file or you would get a message it was being substituted		16:36.12
Intruder777	yeah, so probably issue was totally related to the GS version only...		16:37.18
artifexirc-bot	<KenSharp> I would think so, its presumably something we've enhanced over the years		16:37.40
Intruder777	Thanks again!		16:37.55
artifexirc-bot	<KenSharp> No problem have a good day		16:38.05
	<KenSharp> @NancyABQ and there's a bmpcmp now as well		16:38.45
	<<<Back 1 day (to 2020/11/16)	Forward 1 day (to 2020/11/18)>>>

Log of #ghostscript at irc.freenode.net.