Ghostscript IRC logs

	<<<Back 1 day (to 2022/06/16)	Fwd 1 day (to 2022/06/18) >>>	20220617
artifexirc-bot	<byrnejb> Is this the place to ask questions about gpcl6?		13:14.09
	<KenSharp> Its as good as any		13:17.58
	<KenSharp> I can't promise an answer but we'll certainly try		13:18.16
	<byrnejb> OK. I am running version 9.54.0 that I built from source on a FreeBSD host, version 12.?.		13:21.38
	<byrnejb> I am using this with socat to create a virtaul pdf printered for a legacy host that produces old fashioned green bar 132 column reports.		13:22.33
	<byrnejb> The setup works fine.		13:22.49
	<KenSharp> OK.....		13:22.54
	<byrnejb> However, the pdf documents it produces are not searchable.		13:23.21
	<KenSharp> Very possibly not		13:23.29
	<byrnejb> This is a feature that is desired.		13:23.34
	<KenSharp> There is no guarantee of searchability in any PDF file.		13:23.41
	<KenSharp> PCL cannot supply ToUnicode information for it's text		13:23.57
	<KenSharp> (There is simply no defined mechanism for it in the language)		13:24.08
	<byrnejb> I looked into the pdfwrite docs on https://ghostscript.com/doc/current/VectorDevices.htm#PDF		13:24.09
	<KenSharp> So there is no way to know that (eg) character code 0x41 is a U+0041		13:24.31
	<KenSharp> Have you got an example PCL file ?		13:24.48
	<byrnejb> This makes mention of a -s option called UseOCR		13:25.15
	<KenSharp> Umm yes.....		13:25.24
	<KenSharp> That's a somewhat experimental feature		13:25.30
	<KenSharp> To use it you will need to build with Tesseract		13:25.41
	<byrnejb> however that throws a segmentation fault when used		13:25.42
	<KenSharp> ANd Leptonica		13:25.45
	<KenSharp> Well it should not seg fault		13:25.59
	<KenSharp> Again, do you have an example file and a command line ?		13:26.12
	<byrnejb> Well, here is the situation. I start with an ASCII 7bit data stream that is piped directly to gpcl6		13:27.33
	<KenSharp> Hmm OK so straight text, PCL will accept that. No idea what font you get		13:28.01
	<byrnejb> here is the command line: ocat -d -d -lf/var/log/lpd_socat_${HPNP} TCP4-LISTEN:${HPNP},bind=${IPNP},fork,reuseaddr,su=hp3000 SYSTEM:'gpcl6 -dNOSAFE -dNOPAUSE -sDEVICE=pdfwrite -sUseOCR=Always -sOutputFile=/var/spool/hp3000/np${HPNP}/HP3000-RPT-$(date -Iseconds).pdf -' &		13:28.43
	<byrnejb> Courier is fine. These are just stocktab reports.		13:29.02
	<KenSharp> OK I don't really speak bash so I'm trying to get some assistance here 🙂		13:29.14
	<KenSharp> Does any text input cause a seg fault ?		13:29.40
	<KenSharp> Because if so I can just use a random text file here		13:29.50
	<byrnejb> here" network connection piped to gpcl6. gpcl options are: gpcl6 -dNOSAFE -dNOPAUSE -sDEVICE=pdfwrite -sUseOCR=Always -sOutputFile=whatever -' # ( - read from stdin)		13:31.50
	<byrnejb> I bellieve so but I will test		13:32.06
	<KenSharp> OK well I'm going to go and experience the joys of trying to build this, I'll need a few minutes.		13:32.36
	<KenSharp> BTW the parameter is NOSAFER (has a trailing R) and I really wouldn't use that, especially for PCL		13:33.06
	<byrnejb> Yes. A straight cat of a ASCII text file taken from the legacy host cause a segmentation fault		13:33.35
	<byrnejb> cat QZARATBD \| gpcl6 -dNOSAFE -dNOPAUSE -dHaveTrueTypes=true -sDEVICE=pdfwrite -sUseOCR=Always -sOutputFile=test2.pdf -		13:33.44
	<byrnejb> Segmentation fault		13:33.45
	<KenSharp> OK I'm going to need a few minutes		13:34.05
	<KenSharp> Before I can try this		13:34.10
	<byrnejb> that is fine. I apprecaiate the help.		13:34.22
	<KenSharp> BTW whee did you get Tesseract and Leptonica from ?		13:34.38
	<chrisl> And the training data <sigh>		13:35.04
	<KenSharp> Umm oh yes		13:35.11
	<KenSharp> forgot about that		13:35.16
	<chrisl> If I do that, I get a searchable PDF		13:40.41
	<byrnejb> I have neither to my knowledge. I was simply trying to understand the documentation of a utility of which I have next to no experience.		13:41.12
	<KenSharp> Oh, well if you haven't put the Tesseract and Leptonica sources in the ghostpdl tree before you build it, then OCR won't work		13:41.46
	<KenSharp> Tesseract is the OCR engine, we need that for it to do anything		13:42.03
	<byrnejb> Ah.		13:42.16
	<chrisl> The tesseract and leptonica sources were in the release archives		13:42.21
	<KenSharp> @chrisl is that with Leptonica and Tesseract ?		13:42.27
	<chrisl> Yes		13:42.33
	<KenSharp> OK well if they were in the release archive, then they should be present		13:43.01
	<KenSharp> If it works for you then I'm stumped really		13:43.12
	<byrnejb> But for me the resort to OCR was not planned, just stumbled upon in an erroft to discover how to get gpcl6 to produce searchable pdf documents. It seemed to me that given the basic input is in ASCII 7bit that it should just work. If I use A2pdf then the pdf document it produces from the same source file is searchable.		13:45.38
	<chrisl> But there's no ToUnicode in there, or anything. And it's just using NimbusMono-Regular so it's basically ASCII character codes		13:45.41
	<KenSharp> For me a simple 7-bit ASCII file sent through gpcl6 to pdfwrite results in searchable text (ASCII character codes) without any special action		13:45.44
	<KenSharp> I just used a handy log file I had and it came out searchable		13:46.24
	<KenSharp> As @chrisl says, for me it's all in NimbusMono-Regular, so Courier in effect		13:47.14
	<KenSharp> But it seems to work adequately		13:47.23
	<KenSharp> byrnejb your best bet is probably to send us an input file we can try, otherwise we cna't be sure we are duplicating what you are doing		13:48.08
	<chrisl> It's certainly never going to work with anything Unicode....		13:48.41
	<KenSharp> Well no....		13:48.49
	<KenSharp> But PCL6 isn't supposed to accept anything except ASCII as 'text' is it ? I don't know if you can tell it to use some fancy encoding or not		13:49.17
	<byrnejb> I will get a test file that I can share off the legacy host and attach it here. It will take a few minutes.		13:50.01
	<chrisl> You can use different/custom "symbol tables", but you'd need to use PCL to set those		13:50.04
	<byrnejb> I think I see what may be the problem.Â These reports were originally set up to print on Laserjets and they include a PCL setup string at the start to set the font to line-printer 16.6 and landscape mode to print on letter size sheets.		13:55.18
	<byrnejb> I will strip that out and see if that affects anything.		13:55.50
	<KenSharp> Well it still should not crash....		13:56.33
	<byrnejb> I agree. But I do not need OCR if gpcl6 creates searchable pdfs from straight ascii.		13:57.20
	<KenSharp> Well it does for me and Chris 🙂		13:57.38
	<byrnejb> And so I hope for me too if it turns out that the issue is the PCL frontend commands.		13:58.10
	<chrisl> The OCR stuff made no difference for me, anyway. I suspect because it was using a built-in font with no customisations		14:05.52
	<byrnejb> Yes, that was it. If I strip out the PCL command then the pdf is searchable.		14:06.10
	<KenSharp> OK well that's good 🙂		14:06.22
	<byrnejb> very		14:06.29
	<byrnejb> That I can manage.		14:06.39
	<chrisl> line-printer probably has some hoaky, special encoding		14:11.39
	<chrisl> Actually, it probably ends up the PDF as a Type3 with a custom encoding, which would do it		14:12.22
	<byrnejb> Well, the pdf documents become searchable but the reports no long fit on the page.		14:12.24
	<KenSharp> Straight text input to PCL just splats the text on the page, no formatting		14:13.15
	<chrisl> You'll probably want to change the PCL so it selects Courier		14:13.27
	<KenSharp> I presume that your prefix was selecting things like font and size, and possibly as Chris says 'other stuff' such as an encoding		14:13.42
	<byrnejb> this is the entire pcl command string: &l1o2a8d4e60F(s16.66H		14:14.52
	<KenSharp> There will be ESC characters in there somewhere		14:15.07
	<KenSharp> You need to paste it as a hex dump otherwise it can't be decoded reliably		14:15.32
	<byrnejb> I am not sure hat this is what you need but I cut the pcl command string from the source file and tried to convert it using hexdump:		14:20.37
	<byrnejb> echo -n '&l1o2a8d4e60F(s16.66H ' \| hexdump -e '1/1 "%01x"'		14:20.41
	<byrnejb> 266c316f326138643465363046287331362e36*		14:20.42
	<byrnejb> 4820*		14:20.43
	<KenSharp> Seriously ? No 0x1B ? I'm not certain that's even legal		14:21.14
	<byrnejb> echo -n '&l1o2a8d4e60F(s16.66H ' \| hexdump -b		14:21.55
	<byrnejb> 0000000 046 154 061 157 062 141 070 144 064 145 066 060 106 050 163 061		14:21.56
	<byrnejb> 0000010 066 056 066 066 110 040 040		14:21.57
	<byrnejb> 0000017		14:21.58
	<byrnejb> I am not familar with this stuff		14:22.15
	<KenSharp> I don't want to sound like I'm getting at you, but.... You've done 'echo' of a string, where did you get the string from ? Is it possible that you didn't copy some invisible characters ?		14:22.59
	<byrnejb> As I stated.Â I copied that string directly from the source file and pasted it into a single quote delimited string		14:23.48
	<byrnejb> You are not getting at me. I do not have a great deal of expirence working at this level of detail.		14:24.26
	<KenSharp> Again, I don't like to sound doubtful, but &l1o isn't valid PCL whereas ESC&l10 is.		14:24.35
	<byrnejb> Yes, I know. The string in the source file looks like this: ^[&l1o2a8d4e60F^[(s16.66H		14:26.41
	<Robin_Watts> Hex 1b is ESC, right?		14:26.59
	<KenSharp> Right, that '[' will be a 0x1B, and esc character		14:27.06
	<Robin_Watts> d'oh. I misread. Ignore me, sorry.		14:27.16
	<byrnejb> But the ^[ non-displaying character (which is the escape charated I presume) does not appear to copy through the clip board		14:27.28
	<KenSharp> Yes, very possibly, it's why I kept pushing, sorry....		14:27.47
	<KenSharp> I was fairly convinced there needed to be some ESC characters		14:27.59
	<byrnejb> no problem. you are helping me		14:28.01
	<KenSharp> Now I need to go try and find this stuff in the reference manual, which is never easy		14:28.21
	<byrnejb> My objective is to get the existing report format produced as a searchable pdf. A desireable outcome is one that does not involve reprogrammingg on the legacy host.		14:29.03
	<KenSharp> Well it 'looks like' the ESC&l10 should set the orientation of the page		14:29.44
	<KenSharp> TBH I'm struggling with the rest, PCL is not my area really		14:31.42
	<KenSharp> aha, ESC(s16.66H selects the "resident 16.66 pitch line printer font"		14:33.04
	<byrnejb> This is the actual code that produces the pcl string:		14:33.16
	<byrnejb> ;Define ASCII <esc> character		14:33.21
	<byrnejb> DEFINE QZD-ESC-I INTEGER SIZE 2 = 27		14:33.22
	<byrnejb> DEFINE QZD-ESC-C CHARACTER SIZE 1 = (CHARACTERS(QZD-ESC-I))[2:1]		14:33.24
	<byrnejb> ;Define PCL printer format strings for printing on 8.5x11 letter		14:33.25
	<byrnejb> DEFINE QZD-ESC-SU CHARACTER SIZE 80 = &		14:33.26
	<byrnejb> ;Set to Landscape 8LPI 16.6 CPI 60 LPP (132 landscape orientation)		14:33.27
	<byrnejb> QZD-ESC-C + "&l1o2a8d4e60F" + QZD-ESC-C + "(s16.66H"		14:33.29
	<byrnejb> ;Set to Portrait 8LP 16.6 CPI (132 portrait orientation) 80 LPP		14:33.30
	<byrnejb> ; QZD-ESC-C + "&l0o2a8d4e80F" + QZD-ESC-C + "(s16.66H"		14:33.31
	<KenSharp> Well I'd suggest that it's the selection of the line printer font which is causing your encoding problem. Probably because it's doing something weird and not using a real font at all, some kind of type 3 font.		14:35.04
	<KenSharp> You could try just leaving off the QZD-ESC-C + "(s16.ggH" which will leave the rest of the setup in place		14:35.52
	<byrnejb> hmm. well the reports will not fit on letter size paper if the font used is any larger		14:36.11
	<KenSharp> I've no idea what the font will be		14:38.30
	<KenSharp> But you either need to use a built-in font or download a font. Downloading a font is not really an option		14:38.59
	<byrnejb> the printer font is builtin to the printers.Â All laserjet compatible printers that I have use have it.		14:39.49
	<KenSharp> The reason I suggest leaving off the font selection is simply to isolate the problem		14:40.02
	<byrnejb> kk will try that		14:40.22
	<KenSharp> Yes the line printer font is indeed built in. But it's giving you a result you don't want (wrong encoding)		14:40.27
	<KenSharp> So you need to use a different font, and select a size that will fit. But first let's make sure that you get searchable text		14:41.03
	<byrnejb> Removing the font selection and leaving the rest results in a non-searchable pdf		14:43.05
	<KenSharp> Well that's kind of annoying, you did remove all the copies of that ? Because I see it listed twice in your text above3		14:43.40
	<KenSharp> Well that's kind of annoying, you did remove all the copies of that ? Because I see it listed twice in your text above		14:43.43
	<byrnejb> also, the resulting pdf truncates the report lines.		14:44.00
	<byrnejb> however, the orientation is correct		14:44.11
	<KenSharp> Well, the best I can suggest is you remove commands one by one till you find out which one is causing the text to be non-searchable		14:44.41
	<KenSharp> I'm not in a position to do that easily		14:44.48
	<byrnejb> I will do so, but I think that anything with an escaped charater is likely to give the same results. I already removed the whole string and that gave a searchable pdf but the resulting display is unusable		14:52.38
	<byrnejb> the data is there but its layout is not easy toe read		14:53.14
	<KenSharp> Well each ESC introduces a new PCL command, so your string contains several. I imagine most of them are benign (media selection, orientation, etc). So if you find out which one is causing the problem then we can maybe think about a way to work around it. However none of use currently ehre are PCL experts so its hard for us to give you guidance in advance.		14:54.21
	<byrnejb> I understand. You have been most helpful in assiting me in pinning down where the problem lies.		14:55.04
	<byrnejb> 63		14:55.18
	<byrnejb> sry wrong screen		14:55.34
	<KenSharp> 🙂		14:57.28
	<byrnejb> well, I discovered that the problem is not with gpcl6. It does create searchable pdfs from the printer output.		19:53.56
	<byrnejb> The problem is caused by rotating the pdf pages in the viewer (atril).		19:54.34
	<byrnejb> The way the report pdfs are produced has the text at right angles to the way they need to be viewed.Â In short they are produced for a US lettersized sheet, which is correct for hard copy but awkward for pdf viewing.		19:56.34
	<<<Back 1 day (to 2022/06/16)	Forward 1 day (to 2022/06/18)>>>

Log of #ghostscript at irc.freenode.net.