IRC Logs

Log of #ghostscript at irc.freenode.net.

Search:
 <<<Back 1 day (to 2016/09/09)20160910 
sebras tor8: do we want an accessor function for annot->page->doc? similar to pdf_get_bound_document()?03:24.06 
  tor8: I saw in the js that you poke on the object directy, I decided to go the other way.03:24.39 
  tor8: so basically I'm doing pdf_get_page_document(ctx, pdf_annot_page(ctx, annot));03:35.14 
  tor8: but doing annot->page->doc is simpler of course. I'm happy to do it either way.03:35.41 
  tor8: why does pdf_set_ink_annot_list() take a color? is that color somehow different from the one being set using pdf_set_annot_color()?!04:16.29 
  tor8: I'm looking at the old app's implementation of InkAnnotation.04:16.51 
  also there is no corresponding pdf_ink_annot_list(). now I see what you meant by the annotation interfaces is missing a few bits and pieces.04:17.56 
  and why only three components in the case of ink_annot? hrm.. :-/04:19.19 
  those APIs should just take a int n and float *colorv and not bother trying to be intelligent about colorspaces at all. :)04:19.59 
  tor8: InkAnnotation.setInkList() takes a float[] representing the x,y coordinates for each point of a path, but pdfref17.pdf page 636 actually allows for _several_ disjoint paths. so I believe we want to have float[][] or something like that.04:41.11 
  or possibly float coordinates[] and int pathStartIndicies[] indexing into the previous array.04:41.48 
  pdf_set_ink_annot_list() itself has a single float[] but it also has an int ncount and float counts[] so it knows how many coordinates belong to each path.04:44.02 
Hello71 how do I *actually* losslessly compress a PDF file? every source online seems to say 'just run gs -dPDFSETTINGS=/screen' which technically works but is not "lossless".13:23.25 
  in other words, why is it that 'wc -c somefile.pdf' is 971187 but 'gzip -1 -c somefile.pdf | wc -c' is 612208? I thought PDFs were supposed to already be DEFLATEd.13:24.21 
  I have already run "mutool clean -ggg -z" and "qpdf --min-version=1.7 --object-streams=generate" on this file. a suggestion to run --linearize unsurprisingly slightly increased the file size.13:25.24 
alexcher Oh, so many layers of confusion.13:29.45 
  First, gzip'd file will be always different from the source file. Counting words is not a valid test.13:31.44 
  Second, PDF usually compresses only some of the objects.13:32.51 
  Finally, you need to define "lossless" compression for PDF files.13:33.49 
Hello71 1. "wc" even in its default mode does not only count words.13:35.34 
  2. based on my rudimentary understanding, "grep '<<' somefile.pdf" shows that all the "object"s are /FlateDecode, which is what I would expect from qpdf.13:37.12 
  3. it is possible by some standard algorithm (or non-standard, but the size must be included in the final count) to produce a pixel-identical representation of the PDF at a hypothetical infinite zoom.13:38.18 
  in other words, the PDF in question contains only text, images, and metadata, but no videos, forms, Flash animations, or the like.13:39.06 
alexcher '<<' opens dictionary. This is unrelated to compression.13:39.52 
Hello71 additional points are given if it is possible to reproduce the original PDF byte-for-byte, which the "gzip" command qualifies for.13:39.54 
  yes, but AIUI, all objects must start with a dictionary, and the dictionary must list any filters (in this case, compressors) which must be used to open the object13:40.41 
alexcher To delete some of the features of a pdf file and keep the rest intact, you need to work with the file structure. Ghostscript has a an utility to do the opposite -- decompress the compressed streams. This utility can be modified to patch the file as you want, but it requires some PostScript programming skills.13:46.27 
Hello71 what does that have to do with what I said13:47.02 
  look, forget everything I said13:47.08 
  *why does 'gzip -1' significantly decrease PDF file size*13:47.26 
alexcher Only part of PDF is normally compressed.13:47.56 
sebras and it is possible to ask a PDF producer to not compress _any_ parts of a PDF file.13:49.58 
  also PDF supports several compression filters (some lossy, some not).13:50.35 
Hello71 OK, which parts are not compressed13:50.53 
  again, I used "qpdf" which compresses stream data by default.13:51.18 
sebras I don't know qpdf so I don't know what parts it decides to compress.13:51.35 
  but normally pictures are compressed e.g.13:51.43 
  for bitmap images DCT-based compression is common (basically embedding a JPEG inside the PDF).13:52.29 
  the lossless compression methods _can_ be used for bitmap images13:52.49 
Hello71 some output of "mutool info": https://bpaste.net/show/96ea09bed26813:52.56 
sebras but I guess the idea of those is to be able to compression the graphic operators that describe vector graphics/text/etc.13:53.17 
Hello71 but again if the images are already compressed with DEFLATE, why would running "gzip" again help13:53.42 
sebras Hello71: right, so you have three images compressed by flat compression.13:53.52 
  Hello71: because a PDF consists of several objects, one type of object are the images, another one the fonts you see there from mutool info13:54.36 
  Hello71: but also the content stream which contains a description of draw this line here in that color, that line here in this color and here I want a circle containing a gradient of colors e.t.c13:55.11 
  Hello71: sometimes these content streams are compressed, sometimes not. mutool info doesn't tell you whether or not that is the case though.13:55.31 
Hello71 which is why I did "grep '<<'" to find all dictionaries, which I was under the impression preceded all objects.13:57.02 
  I also did "strings -n10" which revealed no apparent text other than the dictionaries and "endstream".13:57.35 
sebras Hello71: depending on the PDF that might be true, but not for all.13:57.37 
  Hello71: there is a special type of object that can contain other objects. the encapsulating object may be using e.g. ASCII85Encode and in that case you cannot see the internal objects using grep.13:58.36 
Hello71 yes, but if all the objects are already DEFLATEd then it should not make a difference what is inside13:59.13 
  at least from the perspective of the file size after 'gzip'13:59.22 
sebras Hello71: finally you have the the xref at the end of the file, which basically is a table of contents describing where in the file each object can be found.14:00.17 
Hello71 OK. The file says that that must also be opened using FlateDecode.14:01.10 
sebras that part is normally not compressed though.14:01.18 
  but it _can_ be compressed using xref streams. and in that case using a lossless compression method like LZW or Flate or RLE.14:02.58 
  or maybe a combination of those.14:03.05 
  if it compresses better.14:03.11 
  for images you also have predictor parameter that can be tweaked to losslessly compress images even better.14:04.53 
  I doubt that most PDF producers (like qpdf) attempts to try all combinations of all these options to achieve a minimal file size.14:05.18 
  Hello71: why is it important to you to make a minimal file?14:05.40 
Hello71 because I want to store my PDFs in 100 MB instead of 200 MB14:06.30 
  maybe not so useful on desktop HDD, but more important on my phone with 8 GB storage space.14:07.26 
sebras Hello71: well, I don't know of any software that attempts to combine all these methods to get the smallest possible file.14:07.32 
Hello71 again, brute-force compression is not what I am looking for14:07.53 
  my question, for the third time, is why does gzip -9 decrease PDF file size by up to half14:08.22 
  when all objects are already /Filter /FlateDecode14:09.02 
sebras Hello71: without looking at this file I can't tell you. I can just explain how the PDF format works.14:09.55 
alexcher Hello71: open your file in a test editor and see, what's inside.14:10.30 
sebras there might be some object that (surprisingly) is uncompressed.14:11.01 
  what alexchar is suggestion it what I'd do myself.14:11.16 
  suggesting.14:11.20 
Hello71 again, as I said, I used the strings utility to automatically locate all printable strings over 10 characters.14:13.55 
sebras Hello71: if you are having issues with the output of qpdf, why are you directing the questions to mupdf/ghostscript developers instead of the qpdf developers?14:15.34 
sebras foods.14:16.30 
Hello71 qpdf is irrelevant14:24.08 
  or at least only vaguely relevant14:24.16 
  see https://alxu.ca/Untitled.pdf14:25.35 
  file size: 473867, file size after gzip -1: 35073614:25.56 
  all objects in PDF are under FlateDecode already and the only human-readable parts are dictionaries and markers14:40.51 
alexcher Hello71: almost everything in your file is already compressed. If you need to reduce the file size, the presentation graphic should be simplified or removed.15:06.43 
  The question about the compressibility of your file in addressed to the wrong forum. This is a property of gzip. Perhaps, some of the streams have common parts.15:10.36 
sebras Hello71: https://github.com/qpdf/qpdf/blob/master/libqpdf/Pl_Flate.cc#L87 also qpdf is using the default compression level from zlib, not the best one. might matter.17:33.58 
  https://github.com/qpdf/qpdf/blob/master/libqpdf/Pl_PNGFilter.cc#L113 also it seems as if qpdf only implements the PNG up predictor when it may achieve better compression using another predictor for these particular images.17:45.29 
Hello71 even gzip -1 decreases file size by up to 30%17:49.24 
  again, I already ran mutool clean -z17:49.43 
sebras Hello71: the two content streams for pages 2 and 3 (objects 23 and 24) each compress about 50% better (reducing in size by about ~60kbyte each). now, why does this happen? each content stream adjusts the translates and paints the same image in excess of 37000 times which presumably equates to more repetition than the flate compression window can handle, so recompressing the data allows for flate to 18:22.53 
  discover more redundancy which it can compress.18:22.59 
  zlib uses the maximum window size it supports by default, and choosing Z_BEST_COMPRESSION level and allowing it to use a little more memory (DEF_MEM_LEVEL is 8, but can be 9), does not matter for compressing these content streams. in addition these two objects are located next to each other in the PDF file which likely gives gzip even more redundancy to take advantage of.18:25.10 
  I don't know if the gzip command line tool makes some more adjustments to some parameters that I'm not aware of.18:26.41 
  it might.18:26.45 
  actually if you run mutool show -b Untitled.pdf 23 > uncompressed.bin and then run gzip -vv9 uncompressed.bin you will see a similar size as the object in the PDF, but if you run gzip on the already gzipped file once more it will reduce in size again.18:28.30 
Hello71 huh. there's actually an easier way to show that: qpdf --stream-data=uncompress Untitled.pdf Untitled-decomp.pdf; gzip -c Untitled-decomp.pdf | wc -c is 477600 (very close to the PDF size of 473867), but gzip -c Untitled-decomp.pdf | gzip -c | wc -c is 342985, very close to gzip -c Untitled.pdf = 34308018:43.33 
sebras Hello71: well, if you do that you allow gzip to utilize any redundancy _between_ objects too. I wanted to see if each individual object contributed anything.18:49.41 
  :w18:49.44 
  Hello71: if I just recompressed the flated objects an extra time and keep that version if they get smaller I can get mutool clean -z to compress to 348723 where the normal gzip achieves 343046 (presumably due to the extra redundancy and also because it compressed the object structures, i.e. endstream keywords etc).18:52.02 
Hello71 am I missing something or is that then not actually a PDF18:53.07 
  or can you do /Filter [ /FlateDecode /FlateDecode ]18:53.23 
sebras Hello71: you can.18:53.30 
  Hello71: or /Filter [ /FlateDecode /LZWDecode ] if you wish.18:53.49 
  it's 3am now. /me sleeps.18:55.42 
Hello71 huh. I thought running zopfli instead would fix that, but zopfli compresses the whole file to 420623 which gzip then reduces to 30243018:57.25 
ray_laptop Hello71: Please read http://ghostscript.com/doc/current/VectorDevices.htm#PDFWRITE21:19.49 
  this will describe the options you have for various parameters that allow Ghostscript (gs -sDEVICE=pdfwrite -o out.pdf .... in.pdf) to process a PDF and possibly reduce the file size. Note that by default, Ghostscript will compress the "Contents" which is the sequences of PDF operators (with Flate) that are used for vector, text, etc.21:22.24 
  Hello71: The monochrome images default to compression with CCITTG4 which is lossless (and pretty much the best you can get). Ghostscript pdfwrite automatically will detect images that are used in more that one place and only include them once21:23.40 
  Hello71: Don't bother with things like -dPDFSETTINGS=/screen (they are usually not optimal. In particular you may want -dColorImageFilter=/FlateEncode and -d GrayImageFilter=/FlateEncode21:27.47 
  Hello71: Note that embedded fonts aren't compressed since they tend not to compress well.21:29.50 
Gianormagantrous Do ghostscript have some sort of "hiding" or deleting part of a pdf ? Say for example, you had a pdf that had an awful logo on the bottom right of each page. Provided the coordinates to said logo, could I "replace it with white" or something ?21:43.52 
  I am not sure what term to look for (deleting, hiding, censuring, generate white square, etc)21:44.52 
ray_laptop Gian... Not patient enough. Yes, Ghostscript can paint over part of a PDF. It involves setting a "EndPage" PostScript procedure to draw on the page22:41.00 
Hello71 ray_laptop: dunno what exactly you said has anything to do with my question22:54.14 
  I already found the problem with my PDF, the chunks are too large so they overflow the standard DEFLATE buffer.22:54.58 
ray_laptop Hello71: you are trying to squash a PDF to make it as small as possible, but keep images "lossless", right ?22:55.03 
  Hello71: there is no such thing as a "standard deflate buffer"22:55.30 
Hello71 there is DEFLATE64 but it is not available in PDF (or even zlib), so might as well just use LZMA22:55.35 
  er, window, not buffer22:55.42 
  this PDF is fairly simple, so I am pretty sure I have exhausted the standard PDF compression methods.22:57.42 
ray_laptop Hello71: PDF has the Flate data in following a line "stream<cr><lf>" and you DEFINITELY don't want any other encoding filter added, such as ASCIIHexEncode (since that will make it bigger)22:59.52 
  Hello71: and PDF doesn't have LZMA23:00.10 
  Hello71: note that if you are starting with a PDF and it has DCTEncode images, then you don't want to use ghostscript since you want to retain the JPEG *ORIGINAL* data (and not make it bigger by making it into a FlateEncode image)23:02.30 
Hello71 [22:59:52] <+ray_laptop> Hello71: PDF has the Flate data in following a line "stream<cr><lf>" and you DEFINITELY don't want any other encoding filter added, such as ASCIIHexEncode (since that will make it bigger)23:49.12 
  if you examine the PDF I provided it uses just Flate objects23:49.33 
  [23:00:10] <+ray_laptop> Hello71: and PDF doesn't have LZMA23:49.38 
  my point is that if DEFLATE64 and LZMA are both not supported but LZMA is better then just use LZMA23:49.58 
  [23:02:30] <+ray_laptop> Hello71: note that if you are starting with a PDF and it has DCTEncode images, then you don't want to use ghostscript since you want to retain the JPEG *ORIGINAL* data (and not make it bigger by making it into a FlateEncode image)23:50.06 
  again, if you examine the PDF I provided all the images are Flate23:50.16 
  probably since I selected "lossless compression" in LibreOffice23:50.29 
ray_laptop Hello71: provided where ? (I may have missed that) was it attached to a bug report ?23:51.02 
Hello71 [14:25:35] < Hello71> see https://alxu.ca/Untitled.pdf23:51.52 
  it is not a bug in anything, just poor design of DEFLATE (relative to newer compression methods)23:52.17 
ray_laptop Hello71: We don't deal with LibreOffice, but once you have a file that is created lossless, gs _can_ decide 'automagically" to use JPEG if an image looks to be suitable23:52.50 
  Hello71: as far as the compression methods available in PDF, talk to the ISO 32000 committee -- we have to follow the spec23:53.35 
  Hello71: since gs can "AutoFilter..." images, it may help. We have many customer/users that use that. Any "sharp" transitions are assumed to need lossless (Flate), otherwise if it is "smooth" then jpeg (DCTEncode) is used23:55.34 
 Forward 1 day (to 2016/09/11)>>> 
ghostscript.com
Search: