| <<<Back 1 day (to 2022/06/16) | Fwd 1 day (to 2022/06/18) >>> | 20220617 |
artifexirc-bot | <byrnejb> Is this the place to ask questions about gpcl6? | 13:14.09 |
| <KenSharp> Its as good as any | 13:17.58 |
| <KenSharp> I can't promise an answer but we'll certainly try | 13:18.16 |
| <byrnejb> OK. I am running version 9.54.0 that I built from source on a FreeBSD host, version 12.?. | 13:21.38 |
| <byrnejb> I am using this with socat to create a virtaul pdf printered for a legacy host that produces old fashioned green bar 132 column reports. | 13:22.33 |
| <byrnejb> The setup works fine. | 13:22.49 |
| <KenSharp> OK..... | 13:22.54 |
| <byrnejb> However, the pdf documents it produces are not searchable. | 13:23.21 |
| <KenSharp> Very possibly not | 13:23.29 |
| <byrnejb> This is a feature that is desired. | 13:23.34 |
| <KenSharp> There is no guarantee of searchability in **any** PDF file. | 13:23.41 |
| <KenSharp> PCL cannot supply ToUnicode information for it's text | 13:23.57 |
| <KenSharp> (There is simply no defined mechanism for it in the language) | 13:24.08 |
| <byrnejb> I looked into the pdfwrite docs on https://ghostscript.com/doc/current/VectorDevices.htm#PDF | 13:24.09 |
| <KenSharp> So there is no way to know that (eg) character code 0x41 is a U+0041 | 13:24.31 |
| <KenSharp> Have you got an example PCL file ? | 13:24.48 |
| <byrnejb> This makes mention of a -s option called UseOCR | 13:25.15 |
| <KenSharp> Umm yes..... | 13:25.24 |
| <KenSharp> That's a somewhat experimental feature | 13:25.30 |
| <KenSharp> To use it you will need to build with Tesseract | 13:25.41 |
| <byrnejb> however that throws a segmentation fault when used | 13:25.42 |
| <KenSharp> ANd Leptonica | 13:25.45 |
| <KenSharp> Well it should not seg fault | 13:25.59 |
| <KenSharp> Again, do you have an example file and a command line ? | 13:26.12 |
| <byrnejb> Well, here is the situation. I start with an ASCII 7bit data stream that is piped directly to gpcl6 | 13:27.33 |
| <KenSharp> Hmm OK so straight text, PCL will accept that. No idea what font you get | 13:28.01 |
| <byrnejb> here is the command line: ocat -d -d -lf/var/log/lpd_socat_${HPNP} TCP4-LISTEN:${HPNP},bind=${IPNP},fork,reuseaddr,su=hp3000 SYSTEM:'gpcl6 -dNOSAFE -dNOPAUSE -sDEVICE=pdfwrite -sUseOCR=Always -sOutputFile=/var/spool/hp3000/np${HPNP}/HP3000-RPT-$(date -Iseconds).pdf -' & | 13:28.43 |
| <byrnejb> Courier is fine. These are just stocktab reports. | 13:29.02 |
| <KenSharp> OK I don't really speak bash so I'm trying to get some assistance here 🙂 | 13:29.14 |
| <KenSharp> Does any text input cause a seg fault ? | 13:29.40 |
| <KenSharp> Because if so I can just use a random text file here | 13:29.50 |
| <byrnejb> here" network connection piped to gpcl6. gpcl options are: gpcl6 -dNOSAFE -dNOPAUSE -sDEVICE=pdfwrite -sUseOCR=Always -sOutputFile=whatever -' # ( - read from stdin) | 13:31.50 |
| <byrnejb> I bellieve so but I will test | 13:32.06 |
| <KenSharp> OK well I'm going to go and experience the joys of trying to build this, I'll need a few minutes. | 13:32.36 |
| <KenSharp> BTW the parameter is NOSAFER (has a trailing R) and I really wouldn't use that, especially for PCL | 13:33.06 |
| <byrnejb> Yes. A straight cat of a ASCII text file taken from the legacy host cause a segmentation fault | 13:33.35 |
| <byrnejb> cat QZARATBD | gpcl6 -dNOSAFE -dNOPAUSE -dHaveTrueTypes=true -sDEVICE=pdfwrite -sUseOCR=Always -sOutputFile=test2.pdf - | 13:33.44 |
| <byrnejb> Segmentation fault | 13:33.45 |
| <KenSharp> OK I'm going to need a few minutes | 13:34.05 |
| <KenSharp> Before I can try this | 13:34.10 |
| <byrnejb> that is fine. I apprecaiate the help. | 13:34.22 |
| <KenSharp> BTW whee did you get Tesseract and Leptonica from ? | 13:34.38 |
| <chrisl> And the training data <sigh> | 13:35.04 |
| <KenSharp> Umm oh yes | 13:35.11 |
| <KenSharp> forgot about that | 13:35.16 |
| <chrisl> If I do that, I get a searchable PDF | 13:40.41 |
| <byrnejb> I have neither to my knowledge. I was simply trying to understand the documentation of a utility of which I have next to no experience. | 13:41.12 |
| <KenSharp> Oh, well if you haven't put the Tesseract and Leptonica sources in the ghostpdl tree before you build it, then OCR won't work | 13:41.46 |
| <KenSharp> Tesseract is the OCR engine, we need that for it to do anything | 13:42.03 |
| <byrnejb> Ah. | 13:42.16 |
| <chrisl> The tesseract and leptonica sources were in the release archives | 13:42.21 |
| <KenSharp> @chrisl is that with Leptonica and Tesseract ? | 13:42.27 |
| <chrisl> Yes | 13:42.33 |
| <KenSharp> OK well if they were in the release archive, then they should be present | 13:43.01 |
| <KenSharp> If it works for you then I'm stumped really | 13:43.12 |
| <byrnejb> But for me the resort to OCR was not planned, just stumbled upon in an erroft to discover how to get gpcl6 to produce searchable pdf documents. It seemed to me that given the basic input is in ASCII 7bit that it should just work. If I use A2pdf then the pdf document it produces from the same source file is searchable. | 13:45.38 |
| <chrisl> But there's no ToUnicode in there, or anything. And it's just using NimbusMono-Regular so it's basically ASCII character codes | 13:45.41 |
| <KenSharp> For me a simple 7-bit ASCII file sent through gpcl6 to pdfwrite results in searchable text (ASCII character codes) without any special action | 13:45.44 |
| <KenSharp> I just used a handy log file I had and it came out searchable | 13:46.24 |
| <KenSharp> As @chrisl says, for me it's all in NimbusMono-Regular, so Courier in effect | 13:47.14 |
| <KenSharp> But it seems to work adequately | 13:47.23 |
| <KenSharp> byrnejb your best bet is probably to send us an input file we can try, otherwise we cna't be sure we are duplicating what you are doing | 13:48.08 |
| <chrisl> It's certainly never going to work with anything Unicode.... | 13:48.41 |
| <KenSharp> Well no.... | 13:48.49 |
| <KenSharp> But PCL6 isn't supposed to accept anything except ASCII as 'text' is it ? I don't know if you can tell it to use some fancy encoding or not | 13:49.17 |
| <byrnejb> I will get a test file that I can share off the legacy host and attach it here. It will take a few minutes. | 13:50.01 |
| <chrisl> You can use different/custom "symbol tables", but you'd need to use PCL to set those | 13:50.04 |
| <byrnejb> I think I see what may be the problem. These reports were originally set up to print on Laserjets and they include a PCL setup string at the start to set the font to line-printer 16.6 and landscape mode to print on letter size sheets. | 13:55.18 |
| <byrnejb> I will strip that out and see if that affects anything. | 13:55.50 |
| <KenSharp> Well it still should not crash.... | 13:56.33 |
| <byrnejb> I agree. But I do not need OCR if gpcl6 creates searchable pdfs from straight ascii. | 13:57.20 |
| <KenSharp> Well it does for me and Chris 🙂 | 13:57.38 |
| <byrnejb> And so I hope for me too if it turns out that the issue is the PCL frontend commands. | 13:58.10 |
| <chrisl> The OCR stuff made no difference for me, anyway. I suspect because it was using a built-in font with no customisations | 14:05.52 |
| <byrnejb> Yes, that was it. If I strip out the PCL command then the pdf is searchable. | 14:06.10 |
| <KenSharp> OK well that's good 🙂 | 14:06.22 |
| <byrnejb> very | 14:06.29 |
| <byrnejb> That I can manage. | 14:06.39 |
| <chrisl> line-printer probably has some hoaky, special encoding | 14:11.39 |
| <chrisl> Actually, it probably ends up the PDF as a Type3 with a custom encoding, which would do it | 14:12.22 |
| <byrnejb> Well, the pdf documents become searchable but the reports no long fit on the page. | 14:12.24 |
| <KenSharp> Straight text input to PCL just splats the text on the page, no formatting | 14:13.15 |
| <chrisl> You'll probably want to change the PCL so it selects Courier | 14:13.27 |
| <KenSharp> I presume that your prefix was selecting things like font and size, and possibly as Chris says 'other stuff' such as an encoding | 14:13.42 |
| <byrnejb> this is the entire pcl command string: &l1o2a8d4e60F(s16.66H | 14:14.52 |
| <KenSharp> There will be ESC characters in there somewhere | 14:15.07 |
| <KenSharp> You need to paste it as a hex dump otherwise it can't be decoded reliably | 14:15.32 |
| <byrnejb> I am not sure hat this is what you need but I cut the pcl command string from the source file and tried to convert it using hexdump: | 14:20.37 |
| <byrnejb> echo -n '&l1o2a8d4e60F(s16.66H ' | hexdump -e '1/1 "%01x"' | 14:20.41 |
| <byrnejb> 266c316f326138643465363046287331362e36* | 14:20.42 |
| <byrnejb> 4820* | 14:20.43 |
| <KenSharp> Seriously ? No 0x1B ? I'm not certain that's even legal | 14:21.14 |
| <byrnejb> echo -n '&l1o2a8d4e60F(s16.66H ' | hexdump -b | 14:21.55 |
| <byrnejb> 0000000 046 154 061 157 062 141 070 144 064 145 066 060 106 050 163 061 | 14:21.56 |
| <byrnejb> 0000010 066 056 066 066 110 040 040 | 14:21.57 |
| <byrnejb> 0000017 | 14:21.58 |
| <byrnejb> I am not familar with this stuff | 14:22.15 |
| <KenSharp> I don't want to sound like I'm getting at you, but.... You've done 'echo' of a string, where did you get the string from ? Is it possible that you didn't copy some invisible characters ? | 14:22.59 |
| <byrnejb> As I stated. I copied that string directly from the source file and pasted it into a single quote delimited string | 14:23.48 |
| <byrnejb> You are not getting at me. I do not have a great deal of expirence working at this level of detail. | 14:24.26 |
| <KenSharp> Again, I don't like to sound doubtful, but &l1o isn't valid PCL whereas ESC&l10 is. | 14:24.35 |
| <byrnejb> Yes, I know. The string in the source file looks like this: ^[&l1o2a8d4e60F^[(s16.66H | 14:26.41 |
| <Robin_Watts> Hex 1b is ESC, right? | 14:26.59 |
| <KenSharp> Right, that '[' will be a 0x1B, and esc character | 14:27.06 |
| <Robin_Watts> d'oh. I misread. Ignore me, sorry. | 14:27.16 |
| <byrnejb> But the ^[ non-displaying character (which is the escape charated I presume) does not appear to copy through the clip board | 14:27.28 |
| <KenSharp> Yes, very possibly, it's why I kept pushing, sorry.... | 14:27.47 |
| <KenSharp> I was fairly convinced there needed to be some ESC characters | 14:27.59 |
| <byrnejb> no problem. you are helping me | 14:28.01 |
| <KenSharp> Now I need to go try and find this stuff in the reference manual, which is never easy | 14:28.21 |
| <byrnejb> My objective is to get the existing report format produced as a searchable pdf. A desireable outcome is one that does not involve reprogrammingg on the legacy host. | 14:29.03 |
| <KenSharp> Well it 'looks like' the ESC&l10 should set the orientation of the page | 14:29.44 |
| <KenSharp> TBH I'm struggling with the rest, PCL is not my area really | 14:31.42 |
| <KenSharp> aha, ESC(s16.66H selects the "resident 16.66 pitch line printer font" | 14:33.04 |
| <byrnejb> This is the actual code that produces the pcl string: | 14:33.16 |
| <byrnejb> ;Define ASCII <esc> character | 14:33.21 |
| <byrnejb> DEFINE QZD-ESC-I INTEGER SIZE 2 = 27 | 14:33.22 |
| <byrnejb> DEFINE QZD-ESC-C CHARACTER SIZE 1 = (CHARACTERS(QZD-ESC-I))[2:1] | 14:33.24 |
| <byrnejb> ;Define PCL printer format strings for printing on 8.5x11 letter | 14:33.25 |
| <byrnejb> DEFINE QZD-ESC-SU CHARACTER SIZE 80 = & | 14:33.26 |
| <byrnejb> ;Set to Landscape 8LPI 16.6 CPI 60 LPP (132 landscape orientation) | 14:33.27 |
| <byrnejb> QZD-ESC-C + "&l1o2a8d4e60F" + QZD-ESC-C + "(s16.66H" | 14:33.29 |
| <byrnejb> ;Set to Portrait 8LP 16.6 CPI (132 portrait orientation) 80 LPP | 14:33.30 |
| <byrnejb> ; QZD-ESC-C + "&l0o2a8d4e80F" + QZD-ESC-C + "(s16.66H" | 14:33.31 |
| <KenSharp> Well I'd suggest that it's the selection of the line printer font which is causing your encoding problem. Probably because it's doing something weird and not using a real font at all, some kind of type 3 font. | 14:35.04 |
| <KenSharp> You could try just leaving off the QZD-ESC-C + "(s16.ggH" which will leave the rest of the setup in place | 14:35.52 |
| <byrnejb> hmm. well the reports will not fit on letter size paper if the font used is any larger | 14:36.11 |
| <KenSharp> I've no idea what the font will be | 14:38.30 |
| <KenSharp> But you either need to use a built-in font or download a font. Downloading a font is not really an option | 14:38.59 |
| <byrnejb> the printer font is builtin to the printers. All laserjet compatible printers that I have use have it. | 14:39.49 |
| <KenSharp> The reason I suggest leaving off the font selection is simply to isolate the problem | 14:40.02 |
| <byrnejb> kk will try that | 14:40.22 |
| <KenSharp> Yes the line printer font is indeed built in. But it's giving you a result you don't want (wrong encoding) | 14:40.27 |
| <KenSharp> So you need to use a different font, and select a size that will fit. But first let's make sure that you get searchable text | 14:41.03 |
| <byrnejb> Removing the font selection and leaving the rest results in a non-searchable pdf | 14:43.05 |
| <KenSharp> Well that's kind of annoying, you did remove all the copies of that ? Because I see it listed twice in your text above3 | 14:43.40 |
| <KenSharp> Well that's kind of annoying, you did remove all the copies of that ? Because I see it listed twice in your text above | 14:43.43 |
| <byrnejb> also, the resulting pdf truncates the report lines. | 14:44.00 |
| <byrnejb> however, the orientation is correct | 14:44.11 |
| <KenSharp> Well, the best I can suggest is you remove commands one by one till you find out which one is causing the text to be non-searchable | 14:44.41 |
| <KenSharp> I'm not in a position to do that easily | 14:44.48 |
| <byrnejb> I will do so, but I think that anything with an escaped charater is likely to give the same results. I already removed the whole string and that gave a searchable pdf but the resulting display is unusable | 14:52.38 |
| <byrnejb> the data is there but its layout is not easy toe read | 14:53.14 |
| <KenSharp> Well each ESC introduces a new PCL command, so your string contains several. I imagine most of them are benign (media selection, orientation, etc). So if you find out which one is causing the problem then we can maybe think about a way to work around it. However none of use currently ehre are PCL experts so its hard for us to give you guidance in advance. | 14:54.21 |
| <byrnejb> I understand. You have been most helpful in assiting me in pinning down where the problem lies. | 14:55.04 |
| <byrnejb> 63 | 14:55.18 |
| <byrnejb> sry wrong screen | 14:55.34 |
| <KenSharp> 🙂 | 14:57.28 |
| <byrnejb> well, I discovered that the problem is not with gpcl6. It does create searchable pdfs from the printer output. | 19:53.56 |
| <byrnejb> The problem is caused by rotating the pdf pages in the viewer (atril). | 19:54.34 |
| <byrnejb> The way the report pdfs are produced has the text at right angles to the way they need to be viewed. In short they are produced for a US lettersized sheet, which is correct for hard copy but awkward for pdf viewing. | 19:56.34 |
| <<<Back 1 day (to 2022/06/16) | Forward 1 day (to 2022/06/18)>>> | |