Log of #ghostscript at irc.freenode.net.

 <<<Back 1 day (to 2021/01/02)Fwd 1 day (to 2021/01/04) >>>20210103 
Beeblebrox-BSD Hello. I have a Q regarding pdf file structure for ATS systems. Someone very confidently told me that MS Word vs Linux generated pdf files are different. I don't think so, but where can I find experts on this topic?17:22.46 
ray_laptop Beeblebrox-BSD: pretty much every PDF creator ends up with PDF's that are internally different, but they conform to the PDF specification.17:38.22 
Beeblebrox-BSD Hi Ray, thanks. I thought as much with the ISO spec for PDF and all, but what would be the solution then to "simplify" so as to eliminate quirks?17:39.51 
ray_laptop what varies most widely is how, or if, they embed fonts, and how much of the PDF graphics they use, and how well they use it. There are efficient PDF's and horrendous ones17:39.52 
Beeblebrox-BSD I was thinking PDF flattening, but that's mostly for images17:40.15 
ray_laptop flattening is mostly for transparency17:40.31 
  (to get rid of it, and turn the page into an image)17:40.51 
Beeblebrox-BSD There are also PDF types - archive, e, etc. What would be the best wtype for ATS?17:41.03 
  yeah, so flattening is useless for my problem basically...17:41.31 
ray_laptop if you run a PDF through Ghostscript (using ps2pdf or command line -sDEVICE=pdfwrite, -o out.pdf) it will process the input PDF, correct many common errors, and create a brand new PDF (by default with embedded font subsets), so they will all look pretty much the same.17:43.00 
  the same "style". And if you use GS to create a PDF/A (archive) format, it will conform with that format, and so not rely on anything not within the resulting PDF.17:44.21 
  but processing thru gs doesn't preserve everything in the original PDF (most notably "interactive" features -- JavaScript and potentially some annotations)17:45.43 
  Ghostscript can also "flatten" PDF's that have transparency, by specifying a PDF 1.3 output, or using the 'pdfimage*" device rather than pdfwrite. These make the entire page into an image, but generally removes searchability (and cut/paste)17:47.40 
  but generally keeping a high-level PDF is best -- smallest file size.17:49.23 
Beeblebrox-BSD I was just writing something similar; A solution I read was export as pdf (I use Libreoffice), pdf2ps and then, ps2pdf to get a "simple file. I tried it, but got a weird PDF doc. Reading your comments, it seems I can 1) skip step #2 and just do ps2pdf on a pdf file? and 2) I probably got a weird doc b/c I did not specify good flags?17:49.48 
ray_laptop which is what gs pdfwrite does by default17:49.51 
Beeblebrox-BSD What's a high-level PDF? Is that by version?17:50.21 
ray_laptop you don't need the pdf2ps step, since that will remove transparency. just running it through ps2pdf (the Ghostscript step) is enough.17:51.10 
Beeblebrox-BSD Awesome!17:51.31 
  and what do you mean by high-level PDF?17:52.40 
  Use the Archive flag or not?17:53.27 
ray_laptop pdf2ps is probably NOT ghostscript (you can look at the script to see) and since PostScript doesn't support everything that PDF does, you can lose a LOT. Our "ps2ps2" does as much as possible, and WILL be different that psd2ps output, but that step WILL lose info17:53.52 
Beeblebrox-BSD OK, /A is larger file size, so no /a17:54.02 
ray_laptop high level PDF is the default for gs (PDF 1.4 or later)17:54.27 
Beeblebrox-BSD Ok, version. Thanks.17:54.38 
  Ray, you're the man. Thank you for the insight, great stuff!17:55.23 
ray_laptop the Ghostscript option for that is -dCompatibilityLevel=1.x 17:55.42 
Beeblebrox-BSD cool...17:56.09 
ray_laptop actually, our main pdfwrite expert is Ken Sharp (usual nick kens)17:56.20 
Beeblebrox-BSD Well, I got my answer thanks to your help so <thumbs up emoji>. I'll keep that name in mind though.17:57.25 
ray_laptop Beeblebrox-BSD: the main issue with PDF's is usually the issue is that they are created with fonts that don't have "ToUnicode" and use a funky font subset that prevents us from making a searchable PDF (at least for those fonts)17:58.34 
  we are working on a solution for that which will take longer to process, and be somewhat error prone since it relies on OCR (tesseract+leptonica) to inspect the rendered glyph and get a unicode value.17:59.52 
Beeblebrox-BSD Hm, but dont all Linux/Unix fonts have ToUnicode? Might that be a Windows problem rather than Linux?18:01.33 
  Not a fonts expert by any means here18:02.01 
ray_laptop I think MS is one of the common culprits in not always putting in the ToUnicode map for font subsets, but it may be that pdf2ps also causes a problem since there is no ToUnicode in PostScript (there is a GlyphNames2Unicode that Windows PS driver outputs, but not all PS creators have that)18:02.58 
Beeblebrox-BSD Ah...18:03.54 
ray_laptop I am not a fonts expert either. Chris Liddell (chrisl) and Ken (kens) are our primary font experts18:04.17 
Beeblebrox-BSD Thank you Ray for your help. I wish you the best in 2021 & happy new year.18:05.09 
  Bye for now.18:05.15 
ray_laptop Same to you.18:05.18 
 <<<Back 1 day (to 2021/01/02)Forward 1 day (to 2021/01/04)>>> 
ghostscript.com #mupdf