Ghostscript IRC logs

Log of #ghostscript at irc.freenode.net.

	<<<Back 1 day (to 2016/09/09)	20160910
sebras	tor8: do we want an accessor function for annot->page->doc? similar to pdf_get_bound_document()?	03:24.06
	tor8: I saw in the js that you poke on the object directy, I decided to go the other way.	03:24.39
	tor8: so basically I'm doing pdf_get_page_document(ctx, pdf_annot_page(ctx, annot));	03:35.14
	tor8: but doing annot->page->doc is simpler of course. I'm happy to do it either way.	03:35.41
	tor8: why does pdf_set_ink_annot_list() take a color? is that color somehow different from the one being set using pdf_set_annot_color()?!	04:16.29
	tor8: I'm looking at the old app's implementation of InkAnnotation.	04:16.51
	also there is no corresponding pdf_ink_annot_list(). now I see what you meant by the annotation interfaces is missing a few bits and pieces.	04:17.56
	and why only three components in the case of ink_annot? hrm.. :-/	04:19.19
	those APIs should just take a int n and float *colorv and not bother trying to be intelligent about colorspaces at all. :)	04:19.59
	tor8: InkAnnotation.setInkList() takes a float[] representing the x,y coordinates for each point of a path, but pdfref17.pdf page 636 actually allows for _several_ disjoint paths. so I believe we want to have float[][] or something like that.	04:41.11
	or possibly float coordinates[] and int pathStartIndicies[] indexing into the previous array.	04:41.48
	pdf_set_ink_annot_list() itself has a single float[] but it also has an int ncount and float counts[] so it knows how many coordinates belong to each path.	04:44.02
Hello71	how do I actually losslessly compress a PDF file? every source online seems to say 'just run gs -dPDFSETTINGS=/screen' which technically works but is not "lossless".	13:23.25
	in other words, why is it that 'wc -c somefile.pdf' is 971187 but 'gzip -1 -c somefile.pdf \| wc -c' is 612208? I thought PDFs were supposed to already be DEFLATEd.	13:24.21
	I have already run "mutool clean -ggg -z" and "qpdf --min-version=1.7 --object-streams=generate" on this file. a suggestion to run --linearize unsurprisingly slightly increased the file size.	13:25.24
alexcher	Oh, so many layers of confusion.	13:29.45
	First, gzip'd file will be always different from the source file. Counting words is not a valid test.	13:31.44
	Second, PDF usually compresses only some of the objects.	13:32.51
	Finally, you need to define "lossless" compression for PDF files.	13:33.49
Hello71	1. "wc" even in its default mode does not only count words.	13:35.34
	2. based on my rudimentary understanding, "grep '<<' somefile.pdf" shows that all the "object"s are /FlateDecode, which is what I would expect from qpdf.	13:37.12
	3. it is possible by some standard algorithm (or non-standard, but the size must be included in the final count) to produce a pixel-identical representation of the PDF at a hypothetical infinite zoom.	13:38.18
	in other words, the PDF in question contains only text, images, and metadata, but no videos, forms, Flash animations, or the like.	13:39.06
alexcher	'<<' opens dictionary. This is unrelated to compression.	13:39.52
Hello71	additional points are given if it is possible to reproduce the original PDF byte-for-byte, which the "gzip" command qualifies for.	13:39.54
	yes, but AIUI, all objects must start with a dictionary, and the dictionary must list any filters (in this case, compressors) which must be used to open the object	13:40.41
alexcher	To delete some of the features of a pdf file and keep the rest intact, you need to work with the file structure. Ghostscript has a an utility to do the opposite -- decompress the compressed streams. This utility can be modified to patch the file as you want, but it requires some PostScript programming skills.	13:46.27
Hello71	what does that have to do with what I said	13:47.02
	look, forget everything I said	13:47.08
	why does 'gzip -1' significantly decrease PDF file size	13:47.26
alexcher	Only part of PDF is normally compressed.	13:47.56
sebras	and it is possible to ask a PDF producer to not compress _any_ parts of a PDF file.	13:49.58
	also PDF supports several compression filters (some lossy, some not).	13:50.35
Hello71	OK, which parts are not compressed	13:50.53
	again, I used "qpdf" which compresses stream data by default.	13:51.18
sebras	I don't know qpdf so I don't know what parts it decides to compress.	13:51.35
	but normally pictures are compressed e.g.	13:51.43
	for bitmap images DCT-based compression is common (basically embedding a JPEG inside the PDF).	13:52.29
	the lossless compression methods _can_ be used for bitmap images	13:52.49
Hello71	some output of "mutool info": https://bpaste.net/show/96ea09bed268	13:52.56
sebras	but I guess the idea of those is to be able to compression the graphic operators that describe vector graphics/text/etc.	13:53.17
Hello71	but again if the images are already compressed with DEFLATE, why would running "gzip" again help	13:53.42
sebras	Hello71: right, so you have three images compressed by flat compression.	13:53.52
	Hello71: because a PDF consists of several objects, one type of object are the images, another one the fonts you see there from mutool info	13:54.36
	Hello71: but also the content stream which contains a description of draw this line here in that color, that line here in this color and here I want a circle containing a gradient of colors e.t.c	13:55.11
	Hello71: sometimes these content streams are compressed, sometimes not. mutool info doesn't tell you whether or not that is the case though.	13:55.31
Hello71	which is why I did "grep '<<'" to find all dictionaries, which I was under the impression preceded all objects.	13:57.02
	I also did "strings -n10" which revealed no apparent text other than the dictionaries and "endstream".	13:57.35
sebras	Hello71: depending on the PDF that might be true, but not for all.	13:57.37
	Hello71: there is a special type of object that can contain other objects. the encapsulating object may be using e.g. ASCII85Encode and in that case you cannot see the internal objects using grep.	13:58.36
Hello71	yes, but if all the objects are already DEFLATEd then it should not make a difference what is inside	13:59.13
	at least from the perspective of the file size after 'gzip'	13:59.22
sebras	Hello71: finally you have the the xref at the end of the file, which basically is a table of contents describing where in the file each object can be found.	14:00.17
Hello71	OK. The file says that that must also be opened using FlateDecode.	14:01.10
sebras	that part is normally not compressed though.	14:01.18
	but it _can_ be compressed using xref streams. and in that case using a lossless compression method like LZW or Flate or RLE.	14:02.58
	or maybe a combination of those.	14:03.05
	if it compresses better.	14:03.11
	for images you also have predictor parameter that can be tweaked to losslessly compress images even better.	14:04.53
	I doubt that most PDF producers (like qpdf) attempts to try all combinations of all these options to achieve a minimal file size.	14:05.18
	Hello71: why is it important to you to make a minimal file?	14:05.40
Hello71	because I want to store my PDFs in 100 MB instead of 200 MB	14:06.30
	maybe not so useful on desktop HDD, but more important on my phone with 8 GB storage space.	14:07.26
sebras	Hello71: well, I don't know of any software that attempts to combine all these methods to get the smallest possible file.	14:07.32
Hello71	again, brute-force compression is not what I am looking for	14:07.53
	my question, for the third time, is why does gzip -9 decrease PDF file size by up to half	14:08.22
	when all objects are already /Filter /FlateDecode	14:09.02
sebras	Hello71: without looking at this file I can't tell you. I can just explain how the PDF format works.	14:09.55
alexcher	Hello71: open your file in a test editor and see, what's inside.	14:10.30
sebras	there might be some object that (surprisingly) is uncompressed.	14:11.01
	what alexchar is suggestion it what I'd do myself.	14:11.16
	suggesting.	14:11.20
Hello71	again, as I said, I used the strings utility to automatically locate all printable strings over 10 characters.	14:13.55
sebras	Hello71: if you are having issues with the output of qpdf, why are you directing the questions to mupdf/ghostscript developers instead of the qpdf developers?	14:15.34
*sebras*	foods.	14:16.30
Hello71	qpdf is irrelevant	14:24.08
	or at least only vaguely relevant	14:24.16
	see https://alxu.ca/Untitled.pdf	14:25.35
	file size: 473867, file size after gzip -1: 350736	14:25.56
	all objects in PDF are under FlateDecode already and the only human-readable parts are dictionaries and markers	14:40.51
alexcher	Hello71: almost everything in your file is already compressed. If you need to reduce the file size, the presentation graphic should be simplified or removed.	15:06.43
	The question about the compressibility of your file in addressed to the wrong forum. This is a property of gzip. Perhaps, some of the streams have common parts.	15:10.36
sebras	Hello71: https://github.com/qpdf/qpdf/blob/master/libqpdf/Pl_Flate.cc#L87 also qpdf is using the default compression level from zlib, not the best one. might matter.	17:33.58
	https://github.com/qpdf/qpdf/blob/master/libqpdf/Pl_PNGFilter.cc#L113 also it seems as if qpdf only implements the PNG up predictor when it may achieve better compression using another predictor for these particular images.	17:45.29
Hello71	even gzip -1 decreases file size by up to 30%	17:49.24
	again, I already ran mutool clean -z	17:49.43
sebras	Hello71: the two content streams for pages 2 and 3 (objects 23 and 24) each compress about 50% better (reducing in size by about ~60kbyte each). now, why does this happen? each content stream adjusts the translates and paints the same image in excess of 37000 times which presumably equates to more repetition than the flate compression window can handle, so recompressing the data allows for flate to	18:22.53
	discover more redundancy which it can compress.	18:22.59
	zlib uses the maximum window size it supports by default, and choosing Z_BEST_COMPRESSION level and allowing it to use a little more memory (DEF_MEM_LEVEL is 8, but can be 9), does not matter for compressing these content streams. in addition these two objects are located next to each other in the PDF file which likely gives gzip even more redundancy to take advantage of.	18:25.10
	I don't know if the gzip command line tool makes some more adjustments to some parameters that I'm not aware of.	18:26.41
	it might.	18:26.45
	actually if you run mutool show -b Untitled.pdf 23 > uncompressed.bin and then run gzip -vv9 uncompressed.bin you will see a similar size as the object in the PDF, but if you run gzip on the already gzipped file once more it will reduce in size again.	18:28.30
Hello71	huh. there's actually an easier way to show that: qpdf --stream-data=uncompress Untitled.pdf Untitled-decomp.pdf; gzip -c Untitled-decomp.pdf \| wc -c is 477600 (very close to the PDF size of 473867), but gzip -c Untitled-decomp.pdf \| gzip -c \| wc -c is 342985, very close to gzip -c Untitled.pdf = 343080	18:43.33
sebras	Hello71: well, if you do that you allow gzip to utilize any redundancy _between_ objects too. I wanted to see if each individual object contributed anything.	18:49.41
	:w	18:49.44
	Hello71: if I just recompressed the flated objects an extra time and keep that version if they get smaller I can get mutool clean -z to compress to 348723 where the normal gzip achieves 343046 (presumably due to the extra redundancy and also because it compressed the object structures, i.e. endstream keywords etc).	18:52.02
Hello71	am I missing something or is that then not actually a PDF	18:53.07
	or can you do /Filter [ /FlateDecode /FlateDecode ]	18:53.23
sebras	Hello71: you can.	18:53.30
	Hello71: or /Filter [ /FlateDecode /LZWDecode ] if you wish.	18:53.49
	it's 3am now. /me sleeps.	18:55.42
Hello71	huh. I thought running zopfli instead would fix that, but zopfli compresses the whole file to 420623 which gzip then reduces to 302430	18:57.25
ray_laptop	Hello71: Please read http://ghostscript.com/doc/current/VectorDevices.htm#PDFWRITE	21:19.49
	this will describe the options you have for various parameters that allow Ghostscript (gs -sDEVICE=pdfwrite -o out.pdf .... in.pdf) to process a PDF and possibly reduce the file size. Note that by default, Ghostscript will compress the "Contents" which is the sequences of PDF operators (with Flate) that are used for vector, text, etc.	21:22.24
	Hello71: The monochrome images default to compression with CCITTG4 which is lossless (and pretty much the best you can get). Ghostscript pdfwrite automatically will detect images that are used in more that one place and only include them once	21:23.40
	Hello71: Don't bother with things like -dPDFSETTINGS=/screen (they are usually not optimal. In particular you may want -dColorImageFilter=/FlateEncode and -d GrayImageFilter=/FlateEncode	21:27.47
	Hello71: Note that embedded fonts aren't compressed since they tend not to compress well.	21:29.50
Gianormagantrous	Do ghostscript have some sort of "hiding" or deleting part of a pdf ? Say for example, you had a pdf that had an awful logo on the bottom right of each page. Provided the coordinates to said logo, could I "replace it with white" or something ?	21:43.52
	I am not sure what term to look for (deleting, hiding, censuring, generate white square, etc)	21:44.52
ray_laptop	Gian... Not patient enough. Yes, Ghostscript can paint over part of a PDF. It involves setting a "EndPage" PostScript procedure to draw on the page	22:41.00
Hello71	ray_laptop: dunno what exactly you said has anything to do with my question	22:54.14
	I already found the problem with my PDF, the chunks are too large so they overflow the standard DEFLATE buffer.	22:54.58
ray_laptop	Hello71: you are trying to squash a PDF to make it as small as possible, but keep images "lossless", right ?	22:55.03
	Hello71: there is no such thing as a "standard deflate buffer"	22:55.30
Hello71	there is DEFLATE64 but it is not available in PDF (or even zlib), so might as well just use LZMA	22:55.35
	er, window, not buffer	22:55.42
	this PDF is fairly simple, so I am pretty sure I have exhausted the standard PDF compression methods.	22:57.42
ray_laptop	Hello71: PDF has the Flate data in following a line "stream<cr><lf>" and you DEFINITELY don't want any other encoding filter added, such as ASCIIHexEncode (since that will make it bigger)	22:59.52
	Hello71: and PDF doesn't have LZMA	23:00.10
	Hello71: note that if you are starting with a PDF and it has DCTEncode images, then you don't want to use ghostscript since you want to retain the JPEG ORIGINAL data (and not make it bigger by making it into a FlateEncode image)	23:02.30
Hello71	[22:59:52] <+ray_laptop> Hello71: PDF has the Flate data in following a line "stream<cr><lf>" and you DEFINITELY don't want any other encoding filter added, such as ASCIIHexEncode (since that will make it bigger)	23:49.12
	if you examine the PDF I provided it uses just Flate objects	23:49.33
	[23:00:10] <+ray_laptop> Hello71: and PDF doesn't have LZMA	23:49.38
	my point is that if DEFLATE64 and LZMA are both not supported but LZMA is better then just use LZMA	23:49.58
	[23:02:30] <+ray_laptop> Hello71: note that if you are starting with a PDF and it has DCTEncode images, then you don't want to use ghostscript since you want to retain the JPEG ORIGINAL data (and not make it bigger by making it into a FlateEncode image)	23:50.06
	again, if you examine the PDF I provided all the images are Flate	23:50.16
	probably since I selected "lossless compression" in LibreOffice	23:50.29
ray_laptop	Hello71: provided where ? (I may have missed that) was it attached to a bug report ?	23:51.02
Hello71	[14:25:35] < Hello71> see https://alxu.ca/Untitled.pdf	23:51.52
	it is not a bug in anything, just poor design of DEFLATE (relative to newer compression methods)	23:52.17
ray_laptop	Hello71: We don't deal with LibreOffice, but once you have a file that is created lossless, gs _can_ decide 'automagically" to use JPEG if an image looks to be suitable	23:52.50
	Hello71: as far as the compression methods available in PDF, talk to the ISO 32000 committee -- we have to follow the spec	23:53.35
	Hello71: since gs can "AutoFilter..." images, it may help. We have many customer/users that use that. Any "sharp" transitions are assumed to need lossless (Flate), otherwise if it is "smooth" then jpeg (DCTEncode) is used	23:55.34
	Forward 1 day (to 2016/09/11)>>>

IRC Logs

Log of #ghostscript at irc.freenode.net.