| <<<Back 1 day (to 2022/05/09) | Fwd 1 day (to 2022/05/11) >>> | 20220510 |
artifexirc-bot | <Knaldgas> Debugging an issue (on AIX) I found a gs process listening on TCP/IP port 1237. ps -ef : /usr/bin/gs -P- -dSAFER -dCompatibilityLevel=1.4 -sPAPERSIZE=a4 -q -P- -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sstdout=%stderr -sOutputFile=xxx.pdf -P- -dSAFER -dCompatibilityLevel=1.4 -sPAPERSIZE=a4 -c .setpdfwrite -f xxx (probably truncated here). | 06:01.04 |
| <Knaldgas> I wasn't aware that gs could start listening for TCP/IP connections?! - Unfortunately in this case it seized a port that another process needed. Any thoughts? | 06:02.21 |
| <KenSharp> As far as I know Ghostscript (as supplied by us) has no ability to listen to TCP/IP at all. | 06:59.00 |
| <Knaldgas> KenSharp, right, that concurs with what I have found on that issue (nothing). But that leaves me quite baffled as to what I saw... The OS identified the gs process as "LISTENING" on that port, and when I killed the process, the port was released. | 07:11.14 |
| <KenSharp> I've no idea why that would be the case. | 07:11.42 |
| <Knaldgas> Odd | 07:11.43 |
| <Knaldgas> KenSharp, thanks a lot for your feedback! :-) | 07:12.02 |
| <KenSharp> You could try building Ghostscript from teh source we supply and see if it behaves the same | 07:12.33 |
| <KenSharp> Rather than using an OS package | 07:12.41 |
| <Knaldgas> The gs process was started by ps2pdf package, not that I understand more from that. I'm not sure if I can get gs to repeat what it did, but building it from your sources might be on the agenda - thanks again :) | 07:18.13 |
| <KenSharp> Unless there's something odd going on ps2pdf is nothing more than a shell script (a very over-complicated shell script) which starts GS with a couple of parameters. You can easily get teh same result by just running Ghostscript directly | 07:19.14 |
| <Knaldgas> KenSharp, could try that, thanks | 07:20.04 |
| <KenSharp> We supply a ps2pdf script in the source somewhere, I was under the imprssion that was what the package maintainers use | 07:20.07 |
| <KenSharp> Yeah in ghostpdl/lib is ps2pdf which is the relevant script | 07:20.43 |
| <KenSharp> Yes as I thought. ps2pdf calls another shell script based on the PDF version required, so usually it calls ps2pdf14, which in turn calls ps2pdfwr with -dCompatibilityLevel=1.4. ps2pdfwr adds on some more options; -P- -dSAFER -q -P- (again) -dNOPAUSE -dBATCH -sDEVICE=pdfwrte -sstdout=%stderr -sOutputFile= | 07:23.44 |
| <KenSharp> That is, of course, all assuming that it's using the ps2pdf script | 07:24.13 |
| <chrisl> We don't have any networking code in Ghostscript, so I'd have to guess it's a third party library. Might be worth looking at what dynamic libs are linked. | 08:13.47 |
| <qwertynik> Thanks @mvrhel for remembering the request and tagging me here. Appreciate it. Had taken a look at the link, looks like instead of BlackText will need to use those params. | 09:38.13 |
| <qwertynik> However, would text using Type 3 fonts be covered by these options? | 09:38.13 |
| <qwertynik> Thanks to Corona, businesses will digitize even faster. And naturally PDF documents will gain even higher traction. | 09:45.02 |
| <qwertynik> | 09:45.02 |
| <qwertynik> In terms of rendering, filling forms PDF format works great. However, extracting data, which anyways it was not originally mean for, is not so straight-forward. | 09:45.03 |
| <qwertynik> | 09:45.04 |
| <qwertynik> Given the presence of experts here wanted to understand if there would be changes to the spec to make data extraction *simpler*. Also are there any social media handles such as this to follow to keep a track of the latest developments in the PDF world - changes in specifications, availability of tools/services etc? | 09:45.06 |
| <qwertynik> Thanks @mvrhel for remembering the request and tagging me here. Appreciate it. Had taken a look at the linked page, appears that instead of BlackText will need to use the new params. | 09:46.11 |
| <qwertynik> However, would text using Type 3 fonts be covered by these options? | 09:46.12 |
| <qwertynik> @here | 09:47.20 |
| <Robin_Watts> Changes to the spec? Unlikely, IMHO. | 09:57.51 |
| <Robin_Watts> How well you can extract information from a PDF file depends, largely, on how well the PDF file is constructed in the first place. | 09:58.28 |
| <Robin_Watts> Most PDF construction programs are satisfied with getting stuff looking right - being able to be searched is a bonus. Actually being able to extract the data meaningfully is a very poor third place. | 10:01.25 |
| <Robin_Watts> If you follow the PDF spec then you can produce PDFs where the raw text can be extracted fairly well. Ghostscript, for example, does a good job of making PDFs where the text can be extracted as text, rather than gobbledegook. | 10:03.10 |
| <Robin_Watts> What's much harder is to make a PDF whereby the structure of a document is extractable (this text story flows down this column, then this one, then we have a table, then it continues on page 3 etc) | 10:04.03 |
| <Robin_Watts> I think the StructTree stuff is supposed to allow that kind of information to be encoded - but the problem is most PDF generators aren't given that information, so they can't hope to encode it into the generated PDF file. | 10:04.50 |
| <Robin_Watts> And if the info is rarely there, PDF consumers don't bother to implement the code to make use of it if it is. | 10:05.25 |
| <Robin_Watts> so if no one uses it, why generate it? It's a vicious circle. | 10:05.40 |
| <Robin_Watts> And if you're scanning/OCRing documents to get your PDF, you can't ever hope to accurately have that information either. | 10:06.36 |
| <Robin_Watts> So people tend to just work from first principles and try to guess stuff on extract. | 10:06.53 |
| <qwertynik> This would be great to have. However, having semantic representation of objects - headings, lists, paragraphs should be a good starting step. | 10:15.47 |
| <qwertynik> Had never heard of it. Thanks for mentioning. Hopefully its adoption increases. | 10:16.36 |
| <qwertynik> Yes. But given the increased digitization, being able to extract content with greater ease from PDFs could become essential. | 10:19.30 |
| <qwertynik> Yes. Cumbersome but the only option as of now. | 10:28.26 |
| <<<Back 1 day (to 2022/05/09) | Forward 1 day (to 2022/05/11)>>> | |