Log of #ghostscript at irc.freenode.net.

Search:
 <<<Back 1 day (to 2022/06/16)Fwd 1 day (to 2022/06/18) >>>20220617 
artifexirc-bot <byrnejb> Is this the place to ask questions about gpcl6?13:14.09 
  <KenSharp> Its as good as any13:17.58 
  <KenSharp> I can't promise an answer but we'll certainly try13:18.16 
  <byrnejb> OK.  I am running  version 9.54.0 that I built from source on a FreeBSD host, version 12.?.13:21.38 
  <byrnejb> I am using this with socat to create a virtaul pdf printered for a legacy host that produces old fashioned green bar 132 column reports.13:22.33 
  <byrnejb> The setup works fine.13:22.49 
  <KenSharp> OK.....13:22.54 
  <byrnejb> However, the pdf documents it produces are not searchable.13:23.21 
  <KenSharp> Very possibly not13:23.29 
  <byrnejb> This is a feature that is desired.13:23.34 
  <KenSharp> There is no guarantee of searchability in **any** PDF file.13:23.41 
  <KenSharp> PCL cannot supply ToUnicode information for it's text13:23.57 
  <KenSharp> (There is simply no defined mechanism for it in the language)13:24.08 
  <byrnejb> I looked into the pdfwrite docs on https://ghostscript.com/doc/current/VectorDevices.htm#PDF13:24.09 
  <KenSharp> So there is no way to know that (eg) character code 0x41 is a U+004113:24.31 
  <KenSharp> Have you got an example PCL file ?13:24.48 
  <byrnejb> This makes mention of a -s option called UseOCR13:25.15 
  <KenSharp> Umm yes.....13:25.24 
  <KenSharp> That's a somewhat experimental feature13:25.30 
  <KenSharp> To use it you will need to build with Tesseract13:25.41 
  <byrnejb> however that throws a segmentation fault when used13:25.42 
  <KenSharp> ANd Leptonica13:25.45 
  <KenSharp> Well it should not seg fault13:25.59 
  <KenSharp> Again, do you have an example file and a command line ?13:26.12 
  <byrnejb> Well, here is the situation.  I start with an ASCII 7bit data stream that is piped directly to gpcl613:27.33 
  <KenSharp> Hmm OK so straight text, PCL will accept that. No idea what font you get13:28.01 
  <byrnejb> here is the command line: ocat -d -d -lf/var/log/lpd_socat_${HPNP} TCP4-LISTEN:${HPNP},bind=${IPNP},fork,reuseaddr,su=hp3000 SYSTEM:'gpcl6 -dNOSAFE -dNOPAUSE -sDEVICE=pdfwrite -sUseOCR=Always -sOutputFile=/var/spool/hp3000/np${HPNP}/HP3000-RPT-$(date -Iseconds).pdf -' &13:28.43 
  <byrnejb> Courier is fine.  These are just stocktab reports.13:29.02 
  <KenSharp> OK I don't really speak bash so I'm trying to get some assistance here 🙂13:29.14 
  <KenSharp> Does any text input cause a seg fault ?13:29.40 
  <KenSharp> Because if so I can just use a random text file here13:29.50 
  <byrnejb> here"   network connection  piped to gpcl6. gpcl options are: gpcl6 -dNOSAFE -dNOPAUSE -sDEVICE=pdfwrite -sUseOCR=Always -sOutputFile=whatever -' # ( - read from stdin)13:31.50 
  <byrnejb> I bellieve so but I will test13:32.06 
  <KenSharp> OK well I'm going to go and experience the joys of trying to build this, I'll need a few minutes.13:32.36 
  <KenSharp> BTW the parameter is NOSAFER (has a trailing R) and I really wouldn't use that, especially for PCL13:33.06 
  <byrnejb> Yes.  A straight cat of a ASCII text file taken from the legacy host cause a segmentation fault13:33.35 
  <byrnejb> cat QZARATBD | gpcl6 -dNOSAFE -dNOPAUSE -dHaveTrueTypes=true -sDEVICE=pdfwrite  -sUseOCR=Always -sOutputFile=test2.pdf -13:33.44 
  <byrnejb> Segmentation fault13:33.45 
  <KenSharp> OK I'm going to need a few minutes13:34.05 
  <KenSharp> Before I can try this13:34.10 
  <byrnejb> that is fine. I apprecaiate the help.13:34.22 
  <KenSharp> BTW whee did you get Tesseract and Leptonica from ?13:34.38 
  <chrisl> And the training data <sigh>13:35.04 
  <KenSharp> Umm oh yes13:35.11 
  <KenSharp> forgot about that13:35.16 
  <chrisl> If I do that, I get a searchable PDF13:40.41 
  <byrnejb> I have neither to my knowledge.  I was simply trying to understand the documentation of a utility of which I have next to no experience.13:41.12 
  <KenSharp> Oh, well if you haven't put the Tesseract and Leptonica sources in the ghostpdl tree before you build it, then OCR won't work13:41.46 
  <KenSharp> Tesseract is the OCR engine, we need that for it to do anything13:42.03 
  <byrnejb> Ah.13:42.16 
  <chrisl> The tesseract and leptonica sources were in the release archives13:42.21 
  <KenSharp> @chrisl is that with Leptonica and Tesseract ?13:42.27 
  <chrisl> Yes13:42.33 
  <KenSharp> OK well if they were in the release archive, then they should be present13:43.01 
  <KenSharp> If it works for you then I'm stumped really13:43.12 
  <byrnejb> But for me the resort to OCR was not planned, just stumbled upon in an erroft to discover how to get gpcl6 to produce searchable pdf documents.  It seemed to me that given the basic input is in ASCII 7bit that it should just work.  If I use A2pdf then the pdf document it produces from the same source file is searchable.13:45.38 
  <chrisl> But there's no ToUnicode in there, or anything. And it's just using NimbusMono-Regular so it's basically ASCII character codes13:45.41 
  <KenSharp> For me a simple 7-bit ASCII file sent through gpcl6 to pdfwrite results in searchable text (ASCII character codes) without any special action13:45.44 
  <KenSharp> I just used a handy log file I had and it came out searchable13:46.24 
  <KenSharp> As @chrisl says, for me it's all in NimbusMono-Regular, so Courier in effect13:47.14 
  <KenSharp> But it seems to work adequately13:47.23 
  <KenSharp> byrnejb your best bet is probably to send us an input file we can try, otherwise we cna't be sure we are duplicating what you are doing13:48.08 
  <chrisl> It's certainly never going to work with anything Unicode....13:48.41 
  <KenSharp> Well no....13:48.49 
  <KenSharp> But PCL6 isn't supposed to accept anything except ASCII as 'text' is it ? I don't know if you can tell it to use some fancy encoding or not13:49.17 
  <byrnejb> I will get a test file that I can share off the legacy host and attach it here.  It will take a few minutes.13:50.01 
  <chrisl> You can use different/custom "symbol tables", but you'd need to use PCL to set those13:50.04 
  <byrnejb> I think I see what may be the problem.  These reports were originally set up to print on Laserjets and they include a PCL setup string at the start to set the font to line-printer 16.6 and landscape mode to print on letter size sheets.13:55.18 
  <byrnejb> I will strip that out and see if that affects anything.13:55.50 
  <KenSharp> Well it still should not crash....13:56.33 
  <byrnejb> I agree. But I do not need OCR if gpcl6 creates searchable pdfs from straight ascii.13:57.20 
  <KenSharp> Well it does for me and Chris 🙂13:57.38 
  <byrnejb> And so I hope for me too if it turns out that the issue is the PCL frontend commands.13:58.10 
  <chrisl> The OCR stuff made no difference for me, anyway. I suspect because it was using a built-in font with no customisations14:05.52 
  <byrnejb> Yes, that was it. If I strip out the PCL command then the pdf is searchable.14:06.10 
  <KenSharp> OK well that's good 🙂14:06.22 
  <byrnejb> very14:06.29 
  <byrnejb> That I can manage.14:06.39 
  <chrisl> line-printer probably has some hoaky, special encoding14:11.39 
  <chrisl> Actually, it probably ends up the PDF as a Type3 with a custom encoding, which would do it14:12.22 
  <byrnejb> Well, the pdf documents become searchable but the reports no long fit on the page.14:12.24 
  <KenSharp> Straight text input to PCL just splats the text on the page, no formatting14:13.15 
  <chrisl> You'll probably want to change the PCL so it selects Courier14:13.27 
  <KenSharp> I presume that your prefix was selecting things like font and size, and possibly as Chris says 'other stuff' such as an encoding14:13.42 
  <byrnejb> this is the entire pcl command string: &l1o2a8d4e60F(s16.66H14:14.52 
  <KenSharp> There will be ESC characters in there somewhere14:15.07 
  <KenSharp> You need to paste it as a hex dump otherwise it can't be decoded reliably14:15.32 
  <byrnejb> I am not sure hat this is what you need but I cut the pcl command string from the source file and tried to convert it using hexdump:14:20.37 
  <byrnejb> echo -n '&l1o2a8d4e60F(s16.66H  ' | hexdump -e '1/1 "%01x"'14:20.41 
  <byrnejb> 266c316f326138643465363046287331362e36*14:20.42 
  <byrnejb> 4820*14:20.43 
  <KenSharp> Seriously ? No 0x1B ? I'm not certain that's even legal14:21.14 
  <byrnejb> echo -n '&l1o2a8d4e60F(s16.66H  ' | hexdump -b14:21.55 
  <byrnejb> 0000000 046 154 061 157 062 141 070 144 064 145 066 060 106 050 163 06114:21.56 
  <byrnejb> 0000010 066 056 066 066 110 040 04014:21.57 
  <byrnejb> 000001714:21.58 
  <byrnejb> I am not familar with this stuff14:22.15 
  <KenSharp> I don't want to sound like I'm getting at you, but.... You've done 'echo' of a string, where did you get the string from ? Is it possible that you didn't copy some invisible characters ?14:22.59 
  <byrnejb> As I stated.  I copied that string directly from the source file and pasted it into a single quote delimited string14:23.48 
  <byrnejb> You are not getting at me.  I do not have a great deal of expirence working at this level of detail.14:24.26 
  <KenSharp> Again, I don't like to sound doubtful, but &l1o isn't valid PCL whereas ESC&l10 is.14:24.35 
  <byrnejb> Yes, I know.  The string in the source file looks like this:  ^[&l1o2a8d4e60F^[(s16.66H14:26.41 
  <Robin_Watts> Hex 1b is ESC, right?14:26.59 
  <KenSharp> Right, that '[' will be a 0x1B, and esc character14:27.06 
  <Robin_Watts> d'oh. I misread. Ignore me, sorry.14:27.16 
  <byrnejb> But the ^[ non-displaying character (which is the escape charated I presume) does not appear to copy through the clip board14:27.28 
  <KenSharp> Yes, very possibly, it's why I kept pushing, sorry....14:27.47 
  <KenSharp> I was fairly convinced there needed to be some ESC characters14:27.59 
  <byrnejb> no problem.  you are helping me14:28.01 
  <KenSharp> Now I need to go try and find this stuff in the reference manual, which is never easy14:28.21 
  <byrnejb> My objective is to get the existing report format produced as a searchable pdf.  A desireable outcome is one that does not involve reprogrammingg on the legacy host.14:29.03 
  <KenSharp> Well it 'looks like' the ESC&l10 should set the orientation of the page14:29.44 
  <KenSharp> TBH I'm struggling with the rest, PCL is not my area really14:31.42 
  <KenSharp> aha, ESC(s16.66H selects the "resident 16.66 pitch line printer font"14:33.04 
  <byrnejb> This is the actual code that produces the pcl string:14:33.16 
  <byrnejb> ;Define ASCII <esc> character14:33.21 
  <byrnejb> DEFINE QZD-ESC-I INTEGER SIZE 2 = 2714:33.22 
  <byrnejb> DEFINE QZD-ESC-C CHARACTER SIZE 1 = (CHARACTERS(QZD-ESC-I))[2:1]14:33.24 
  <byrnejb> ;Define PCL printer format strings for printing on 8.5x11 letter14:33.25 
  <byrnejb> DEFINE QZD-ESC-SU CHARACTER SIZE 80 =                              &14:33.26 
  <byrnejb> ;Set to Landscape 8LPI 16.6 CPI 60 LPP (132 landscape orientation)14:33.27 
  <byrnejb> QZD-ESC-C + "&l1o2a8d4e60F" + QZD-ESC-C + "(s16.66H"14:33.29 
  <byrnejb> ;Set to Portrait 8LP 16.6 CPI (132 portrait orientation) 80 LPP14:33.30 
  <byrnejb> ;      QZD-ESC-C + "&l0o2a8d4e80F" + QZD-ESC-C + "(s16.66H"14:33.31 
  <KenSharp> Well I'd suggest that it's the selection of the line printer font which is causing your encoding problem. Probably because it's doing something weird and not using a real font at all, some kind of type 3 font.14:35.04 
  <KenSharp> You could try just leaving off the QZD-ESC-C + "(s16.ggH" which will leave the rest of the setup in place14:35.52 
  <byrnejb> hmm.  well the reports will not fit on letter size paper if the font used is any larger14:36.11 
  <KenSharp> I've no idea what the font will be14:38.30 
  <KenSharp> But you either need to use a built-in font or download a font. Downloading a font is not really an option14:38.59 
  <byrnejb> the printer font is builtin to the printers.  All laserjet compatible printers that I have use have it.14:39.49 
  <KenSharp> The reason I suggest leaving off the font selection is simply to isolate the problem14:40.02 
  <byrnejb> kk will try that14:40.22 
  <KenSharp> Yes the line printer font is indeed built in. But it's giving you a result you don't want (wrong encoding)14:40.27 
  <KenSharp> So you need to use a different font, and select a size that will fit. But first let's make sure that you get searchable text14:41.03 
  <byrnejb> Removing the font selection and leaving the rest results in a non-searchable pdf14:43.05 
  <KenSharp> Well that's kind of annoying, you did remove all the copies of that ? Because I see it listed twice in your text above314:43.40 
  <KenSharp> Well that's kind of annoying, you did remove all the copies of that ? Because I see it listed twice in your text above14:43.43 
  <byrnejb> also, the resulting pdf truncates the report lines.14:44.00 
  <byrnejb> however, the orientation is correct14:44.11 
  <KenSharp> Well, the best I can suggest is you remove commands one by one till you find out which one is causing the text to be non-searchable14:44.41 
  <KenSharp> I'm not in a position to do that easily14:44.48 
  <byrnejb> I will do so, but I think that anything with an escaped charater is likely to give the same results.  I already removed the whole string and that gave a searchable pdf but the resulting display is unusable14:52.38 
  <byrnejb> the data is there but its layout is not easy toe read14:53.14 
  <KenSharp> Well each ESC introduces a new PCL command, so your string contains several. I imagine most of them are benign (media selection, orientation, etc). So if you find out which one is causing the problem then we can maybe think about a way to work around it. However none of use currently ehre are PCL experts so its hard for us to give you guidance in advance.14:54.21 
  <byrnejb> I understand.  You have been most helpful in assiting me in pinning down where the problem lies.14:55.04 
  <byrnejb> 6314:55.18 
  <byrnejb> sry wrong screen14:55.34 
  <KenSharp> 🙂14:57.28 
  <byrnejb> well, I discovered that the problem is not with gpcl6.  It does create searchable pdfs from the printer output.19:53.56 
  <byrnejb> The problem is caused by rotating the pdf pages in the viewer (atril).19:54.34 
  <byrnejb> The way the report pdfs are produced has the text at right angles to the way they need to be viewed.  In short they are produced for a US lettersized sheet, which is correct for hard copy but awkward for pdf viewing.19:56.34 
 <<<Back 1 day (to 2022/06/16)Forward 1 day (to 2022/06/18)>>> 
ghostscript.com #mupdf
Search: