| <<<Back 1 day (to 2020/11/16) | Fwd 1 day (to 2020/11/18) >>> | 20201117 |
Intruder777 | Hi. I'm trying to merge several PDF files into one using gswin64.exe tool. The text in resulting pdf looks good, but when I try to copy it - it copies some garbage instead of text (before merging the copying worked fine). Can someone help with this? | 15:50.28 |
artifexirc-bot | <KenSharp> Intruder777 eve3ryone's in a meeting at the moment, hold on a bit please | 15:51.07 |
Intruder777 | Ok, thank you. Here is the command line I'm using: `gswin64.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -dDetectDuplicateImages -dCompressFonts=true -r150 -sOutputFile=output.pdf file1.pdf file2.pdf file3.pdf` . So file1.pdf has text which can be copied, and in the output.pdf the clipboard gets garbage instead of text when copying | 15:53.34 |
artifexirc-bot | <KenSharp> Intruder777 I may get called away but let's start. | 16:07.17 |
| <KenSharp> The first point is that Ghostscript (and more particularly the pdfwrite device) doesn't 'merge' PDF files | 16:07.38 |
| <KenSharp> The process is described here:https://www.ghostscript.com/doc/9.53.3/VectorDevices.htm#Overview | 16:07.45 |
| <KenSharp> What happens is that each input PDF file is interpreted into a series of marks on the page, which are sent to the pdfwrite device which then reassembles a new PDF file from them | 16:08.18 |
| <KenSharp> So what's in the output PDF file (in terms of the language and syntax) bears no relation to what was in the input | 16:08.36 |
| <KenSharp> Now, it sounds to me like your input files do not have ToUnicode CMaps | 16:08.49 |
| <KenSharp> In which case the PDF consumer lacks search information. | 16:09.02 |
| <KenSharp> The best it can do (probably) is to use teh characetr Encoding and pretend its ASCII | 16:09.17 |
| <KenSharp> The problem, probably, is that when pdfwrite creates the fonts for output, it is unable to preserve that encopding and uses a custom encoding | 16:09.51 |
| <KenSharp> For example; if you had the text "Hello World" then H woudl be assigned the character code 1, e would be assigned 2 l would be assigned 3 and so on | 16:10.25 |
| <KenSharp> Obviously that is not ASCII | 16:10.30 |
Intruder777 | The file1.pdf was created by MS Word - it was a plain .docx file which was "saved as..." as pdf. And it has non-english characters | 16:10.35 |
artifexirc-bot | <KenSharp> Well, obviously I'm guessing because I haven't seen your PDF files 🙂 | 16:11.00 |
| <KenSharp> If the non-Latin text is searchable then it ought to have ToUnicode CMaps (assumign by non-English you don't mean somethign like French) | 16:11.32 |
| <KenSharp> You also haven't mentioned which version of Ghostscript you are using | 16:12.17 |
Intruder777 | GPL Ghostscript 9.20 | 16:12.57 |
artifexirc-bot | <KenSharp> Well the current version is 9.53.3 | 16:13.10 |
| <KenSharp> 9.20 is 4 years old | 16:13.25 |
| <KenSharp> So first I'd suggest you try updating | 16:13.33 |
Intruder777 | I see. As for the original pdf file - it has something much worse than french - cyrilic | 16:13.49 |
artifexirc-bot | <KenSharp> Other than that I'd need to see the input files. The simplest way to provide those is to open a bug report at bugs.ghostscript.com | 16:14.02 |
| <KenSharp> But broadly speaking I'd expect that the problem is the fonts are being re-encoded and there is either no ToUnicode information, or the ToUnicode is being lost | 16:14.47 |
Intruder777 | the text in console says "Substituting font Helvetica for AralMT. Loading NimbusSans-Regular font from %rom%Resource/Font/NibusSans-Regular... " | 16:15.15 |
artifexirc-bot | <KenSharp> Well that's a bad start | 16:15.26 |
Intruder777 | I see. I'm going to try to install recent version now... | 16:15.36 |
artifexirc-bot | <KenSharp> But again it depends on what kind of font it is. If its a CIDFont then the font **ought** to be embeded, if it isn't then there will be problems | 16:16.19 |
| <KenSharp> Got to go to another channel, will bbs | 16:16.29 |
Intruder777 | Looks like latest version of GS helped. Now same command line produces output pdf where I can copy/search cyrillic text | 16:29.01 |
artifexirc-bot | <KenSharp> Well we do improve things from time to time 🙂 | 16:29.21 |
Intruder777 | Thank you for your help. I figured out that I've already had latest version installed into my ProgramFiles folder, but there was some other stuff in my %PATH% so some old 9.20 version was used which was embedded into some other stuff. | 16:31.14 |
artifexirc-bot | <KenSharp> Yeah the $PATH environment variable doesn't get set by the installer on Windows so you have to do that manually | 16:31.47 |
Intruder777 | BTW, the latest 9.53.3 version still says those things on the console: "Substituting font Helvetica for AralMT. Loading NimbusSans-Regular font from %rom%Resource/Font/NibusSans-Regular... ". But anyways, the output result is much better. | 16:33.21 |
artifexirc-bot | <KenSharp> Probably means that that font wasn't the one causing you a problem. I'd guess the Cyrillic font is a different one | 16:33.53 |
| <KenSharp> You can always use the Windows TT font as a substitute for a Font (but not a CIDFont) | 16:34.20 |
| <KenSharp> You just need to tell Ghostscrit that its an alias by editing the font map | 16:34.33 |
Intruder777 | The font name in original docx file (which was saved as pdf) is Calibri | 16:35.41 |
artifexirc-bot | <KenSharp> Then I'm guessing Calibri was embedded in the PDF file or you would get a message it was being substituted | 16:36.12 |
Intruder777 | yeah, so probably issue was totally related to the GS version only... | 16:37.18 |
artifexirc-bot | <KenSharp> I would think so, its presumably something we've enhanced over the years | 16:37.40 |
Intruder777 | Thanks again! | 16:37.55 |
artifexirc-bot | <KenSharp> No problem have a good day | 16:38.05 |
| <KenSharp> @NancyABQ and there's a bmpcmp now as well | 16:38.45 |
| <<<Back 1 day (to 2020/11/16) | Forward 1 day (to 2020/11/18)>>> | |