Ghostscript IRC logs

	<<<Back 1 day (to 2022/05/09)	Fwd 1 day (to 2022/05/11) >>>	20220510
artifexirc-bot	<Knaldgas> Debugging an issue (on AIX) I found a gs process listening on TCP/IP port 1237. ps -ef : /usr/bin/gs -P- -dSAFER -dCompatibilityLevel=1.4 -sPAPERSIZE=a4 -q -P- -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sstdout=%stderr -sOutputFile=xxx.pdf -P- -dSAFER -dCompatibilityLevel=1.4 -sPAPERSIZE=a4 -c .setpdfwrite -f xxx (probably truncated here).		06:01.04
	<Knaldgas> I wasn't aware that gs could start listening for TCP/IP connections?! - Unfortunately in this case it seized a port that another process needed. Any thoughts?		06:02.21
	<KenSharp> As far as I know Ghostscript (as supplied by us) has no ability to listen to TCP/IP at all.		06:59.00
	<Knaldgas> KenSharp, right, that concurs with what I have found on that issue (nothing). But that leaves me quite baffled as to what I saw... The OS identified the gs process as "LISTENING" on that port, and when I killed the process, the port was released.		07:11.14
	<KenSharp> I've no idea why that would be the case.		07:11.42
	<Knaldgas> Odd		07:11.43
	<Knaldgas> KenSharp, thanks a lot for your feedback! :-)		07:12.02
	<KenSharp> You could try building Ghostscript from teh source we supply and see if it behaves the same		07:12.33
	<KenSharp> Rather than using an OS package		07:12.41
	<Knaldgas> The gs process was started by ps2pdf package, not that I understand more from that. I'm not sure if I can get gs to repeat what it did, but building it from your sources might be on the agenda - thanks again :)		07:18.13
	<KenSharp> Unless there's something odd going on ps2pdf is nothing more than a shell script (a very over-complicated shell script) which starts GS with a couple of parameters. You can easily get teh same result by just running Ghostscript directly		07:19.14
	<Knaldgas> KenSharp, could try that, thanks		07:20.04
	<KenSharp> We supply a ps2pdf script in the source somewhere, I was under the imprssion that was what the package maintainers use		07:20.07
	<KenSharp> Yeah in ghostpdl/lib is ps2pdf which is the relevant script		07:20.43
	<KenSharp> Yes as I thought. ps2pdf calls another shell script based on the PDF version required, so usually it calls ps2pdf14, which in turn calls ps2pdfwr with -dCompatibilityLevel=1.4. ps2pdfwr adds on some more options; -P- -dSAFER -q -P- (again) -dNOPAUSE -dBATCH -sDEVICE=pdfwrte -sstdout=%stderr -sOutputFile=		07:23.44
	<KenSharp> That is, of course, all assuming that it's using the ps2pdf script		07:24.13
	<chrisl> We don't have any networking code in Ghostscript, so I'd have to guess it's a third party library. Might be worth looking at what dynamic libs are linked.		08:13.47
	<qwertynik> Thanks @mvrhel for remembering the request and tagging me here. Appreciate it. Had taken a look at the link, looks like instead of BlackText will need to use those params.		09:38.13
	<qwertynik> However, would text using Type 3 fonts be covered by these options?		09:38.13
	<qwertynik> Thanks to Corona, businesses will digitize even faster. And naturally PDF documents will gain even higher traction.		09:45.02
	<qwertynik>		09:45.02
	<qwertynik> In terms of rendering, filling forms PDF format works great. However, extracting data, which anyways it was not originally mean for, is not so straight-forward.		09:45.03
	<qwertynik>		09:45.04
	<qwertynik> Given the presence of experts here wanted to understand if there would be changes to the spec to make data extraction simpler. Also are there any social media handles such as this to follow to keep a track of the latest developments in the PDF world - changes in specifications, availability of tools/services etc?		09:45.06
	<qwertynik> Thanks @mvrhel for remembering the request and tagging me here. Appreciate it. Had taken a look at the linked page, appears that instead of BlackText will need to use the new params.		09:46.11
	<qwertynik> However, would text using Type 3 fonts be covered by these options?		09:46.12
	<qwertynik> @here		09:47.20
	<Robin_Watts> Changes to the spec? Unlikely, IMHO.		09:57.51
	<Robin_Watts> How well you can extract information from a PDF file depends, largely, on how well the PDF file is constructed in the first place.		09:58.28
	<Robin_Watts> Most PDF construction programs are satisfied with getting stuff looking right - being able to be searched is a bonus. Actually being able to extract the data meaningfully is a very poor third place.		10:01.25
	<Robin_Watts> If you follow the PDF spec then you can produce PDFs where the raw text can be extracted fairly well. Ghostscript, for example, does a good job of making PDFs where the text can be extracted as text, rather than gobbledegook.		10:03.10
	<Robin_Watts> What's much harder is to make a PDF whereby the structure of a document is extractable (this text story flows down this column, then this one, then we have a table, then it continues on page 3 etc)		10:04.03
	<Robin_Watts> I think the StructTree stuff is supposed to allow that kind of information to be encoded - but the problem is most PDF generators aren't given that information, so they can't hope to encode it into the generated PDF file.		10:04.50
	<Robin_Watts> And if the info is rarely there, PDF consumers don't bother to implement the code to make use of it if it is.		10:05.25
	<Robin_Watts> so if no one uses it, why generate it? It's a vicious circle.		10:05.40
	<Robin_Watts> And if you're scanning/OCRing documents to get your PDF, you can't ever hope to accurately have that information either.		10:06.36
	<Robin_Watts> So people tend to just work from first principles and try to guess stuff on extract.		10:06.53
	<qwertynik> This would be great to have. However, having semantic representation of objects - headings, lists, paragraphs should be a good starting step.		10:15.47
	<qwertynik> Had never heard of it. Thanks for mentioning. Hopefully its adoption increases.		10:16.36
	<qwertynik> Yes. But given the increased digitization, being able to extract content with greater ease from PDFs could become essential.		10:19.30
	<qwertynik> Yes. Cumbersome but the only option as of now.		10:28.26
	<<<Back 1 day (to 2022/05/09)	Forward 1 day (to 2022/05/11)>>>

Log of #ghostscript at irc.freenode.net.