MuPDF IRC logs

	<<<Back 1 day (to 2018/03/07)	20180308
Robin_Watts	n35xdxb0: There is no config file for MuPDF.	00:30.18
	You'd need to change the C code. It's not hard.	00:30.35
n35xdxb0	Robin_Watts: and then recompile?	01:36.36
sebras	tor8: I found the reasons for the issues in 699094	10:21.01
	tor8: I have an ugly patch that appears to fix the immediate issues on sebras/master	10:21.19
tor8	sebras: ohhh!	10:22.26
	or maybe we should just detect JPX and refuse to decompress?	10:22.41
sebras	tor8: we could, but should we? after all the user has supplied -d in order to decompress them.	10:23.13
tor8	that's a lot less useful for images than content streams and fonts	10:23.38
	I would almost consider a JPX image to be a 'resource' blob that shouldn't be decompressed	10:24.30
	and almost the same with JPEG	10:24.39
sebras	tor8: what about flated images?	10:24.53
tor8	since those things are self contained files if you just extract the raw compressed data	10:24.56
sebras	tor8: ah, I see where you are coming from now.	10:25.12
tor8	flated images aren't self contained -- they have no information about colorspace and size like JPEG and JPX do	10:25.21
	which is why we have a special case for extracting JPEG images in 'mutool extract' (though that ought to be extended to also detect JPX I think)	10:25.58
sebras	tor8: yeah, I see that now.	10:26.00
	tor8: vim source/tools/pdfextract.c oh!	10:26.31
	tor8: yes, that code should probably handle JPX too.	10:27.25
	tor8: I'll whip up a patch.	10:31.17
	tor8: so the extract code actually calls pdf_load_image() which converts the JPX images to decoded images at load time.	10:59.39
	tor8: hence mutool extract has no way of knowing that the image is compressed.	10:59.57
tor8	pdf_load_image loads a 'compressed buffer'	11:00.09
sebras	tor8: not for JPXs.	11:00.31
tor8	oh... yes, you're right.	11:00.55
sebras	tor8: pdf_load_jpx() calls fz_new_image_from_pixmap() after having loaded the uncompressed JPX.	11:01.09
	tor8: but for all other image formats you are correct! :)	11:01.32
tor8	we have to do that because we don't have access to the w/h in the PDF dictionary	11:01.40
	nor the colorspace	11:01.49
sebras	tor8: right, I wonder if it would suffice to decode just the headers though..?	11:02.03
	tor8: that way we could postpone the decoding of actual image data until a later point.	11:02.16
tor8	yes. this might be the cause for Robin_Watts, jogux, paulgardiner's slow to cancel files	11:03.03
	because the decoding happens on the image loading thread, not in the rendering thread where it happens for all other image formats	11:03.24
sebras	tor8: right, if the images they are decoding are JPXs, do we know?	11:03.43
tor8	does who know?	11:08.38
sebras	tor8: has anyone mentioned if the issues are specifcally with PDFs containing JPX-images?	11:09.18
tor8	oh right, it migth have been big jpegs. I'm not sure on the details.	11:09.48
	if we add a JPX info header-only decoding (and it's faster than full decoding) then we could load the header in pdf_load_image_imp and stick the data in a compressed buffer in pdf_load_jpx_imp	11:10.12
	or not bother with a fz_compressed_buffer and have a special class of fz_image for jpx that has the data	11:10.44
	in pdf_load_jpx we currently unconditionally call fz_load_jpx	11:11.51
	but we could rejig that call to happen in the fz_image.get_pixmap callback instead and do the rest based on fz_load_jpx_info	11:12.14
	or you can just point Robin_Watts at this and let him do it :)	11:12.23
	I'm not 100% convinced of the benefits	11:12.39
	if you use JPX you deserve whatever punishment you get...	11:12.51
Robin_Watts	tor8: The file that exhibited "slow to cancel" was jpeg, not jpeg2k.	11:13.09
sebras	Robin_Watts: wait until you have a file using j2k. :)	11:13.38
Robin_Watts	We hold the compressed data in an fz_compressed_buffer, so 1) we can decode without hitting the file, and 2) so we can get the original data back (for most formats at least) in PDF writing.	11:14.30
	sebras: I'm sure JPX will be worse, yes.	11:14.45
sebras	tor8: seems like fz_load_jpx_info() costs about 1/10th of fz_load_jpx() on the JPXs from the file from 699094	11:24.59
	tor8: so it might be worth doing.	11:25.04
	with openjpeg.	11:26.27
jogux	morning all. could someone review http://git.ghostscript.com/?p=user/joseph/mupdf-ios-viewer.git;a=commitdiff;h=a5e7da5fcc2c3e196f05dda50b90829392ab97c0;hp=087aef041c257325cf16ee6a83432aeac2304972 please?	11:32.03
	(it's trivial_)	11:33.26
tor8	jogux: LGTM.	11:35.23
sebras	jogux: lgtm.	11:35.27
tor8	sebras: though it might just end upp costing us 1/16th extra time to draw jpx images	11:35.59
sebras	tor8: because we need to move data into memory and then decode it.	11:36.45
tor8	yeah, the question is how slow is 'fz_load_jpx_info+fz_load_jpx' on the same data, as compared to just 'fz_load_jpx'	11:37.57
jogux	thank you tor8/sebras	11:41.40
malc_	mupdf's (muhtml's) lack of any warnings/errors when fed something it has problem with is distressing :(	11:42.34
sebras	tor8: when I retest on other files fz_load_jpx_image and fz_load_jpx have roughly the same execution time.	11:46.52
	tor8: so that would mean that we would only get slower.	11:47.28
	I wonder why the j2k libraries are not faster at doing this decoding.	11:47.50
tor8	they probably don't consider the use case of only loading the header info	11:51.19
	or the APIs don't expose 'load a header' function	11:51.34
sebras	tor8: ehm, seems like it might be us that are silly in jpx_read_image().	11:51.54
	we actually decode the image when only needing the headers.. :-/	11:52.17
	now I get JPX info: .001902s versus JPX decode: 5.012461s	11:52.43
tor8	all we save is needing to unpack the decompressed image into a fz_pixmap	11:52.57
sebras	tor8: in pdf_load_image_imp() for non-JPX we check imagemask and forcemask and if either is true we set n == 1 (leaving colorspace == NULL) and ignore the number of components according to the colorspace. but for JPX we in pdf_load_jpx_imp() call fz_convert_pixmap() to convert it to grayscale. seems like we are not handling the two cases in the same way.	12:30.26
	why wouldn't we decode the image accoridng to the colorspcae and then convert to grayscale for non-JPX like we do for JPX?	12:30.50
tor8	you mean for softmasks?	12:31.43
	that would probably be good. I'm not sure we've ever seen any files that aren't grayscale softmasks other than JPX (which are usually pretty crappily constructed)	12:32.23
sebras	ah.	12:33.03
	tor8: I guess that's as a good reason as any. :)	12:33.41
	tor8: btw, you asked me to ping you about the commit at the top of sebras/mater.	13:09.15
	master	13:09.17
tor8	sebras: in "Do not warn..." wouldn't a simple if (obj) { ... old stuff with if (!pdf_is_stream) ... } bracketing if-statement be cleaner?	13:42.32
	Fix 699086 LGTM	13:42.48
	and the scissor stack one also LGTM (I think I've reviewed those two before)	13:43.03
	let me have a look at the crypto/md5 stuff now	13:43.19
sebras	tor8: I was worried about the indentation depth. but now i see that build_filter() indents just as deep further down. I'll fix the if that you mentioned above.	13:44.24
tor8	indentation depth takes a (very much) back seat to clear logic :)	13:45.36
sebras	sebras/master updated.	13:46.13
	tor8: the key thing (pun intended) with the crypto one is that there's an MD5 sum being computed, so using more than 16 bytes is an issue.	13:46.52
	tor8: also note that this issue was brought up by a fuzzed file.	13:47.09
tor8	sebras: see page 126 of pdfref17	13:50.28
	algorithm 3.3, step 4: "Create an RC4 encryption key using the first n bytes of the output from the final MD5 hash."	13:51.00
	note 'n' not 16	13:51.06
sebras	tor8: so the values of crypt->v == 5 but crypt->r == 4	13:51.10
tor8	we should probably check that n<=16 and throw if not	13:51.19
	pdf_compute_user_password uses the same logic	13:54.02
	so that should also verify that n <=16	13:54.10
	(or just clamp it to 0..16 and let the file fail when it can't match the keys and fail to decrypt)	13:55.03
sebras	fz_mini(n, 16) seems apt.	13:55.25
tor8	fz_clampi(n, 0, 16) seems better, since it could also be negative...	13:55.42
sebras	no it cant.	13:56.06
	the only time we read crypt->length from the file is if crypt->r == 2 or 4	13:56.20
	ehm.. v	13:56.33
	crypt->v	13:56.35
	and we check if length < 40 \|\| length > 128 and throw	13:56.48
	in all other cases length is == 40 or == 256	13:56.59
tor8	ah, right	13:57.04
	then fz_mini seems apt (or if (n>16) throw()	13:57.32
sebras	though the logic in pdf_compute_user_password() has _much_ simpler.	13:57.34
	I'll rerrange pdf_authenticate_owner_password() similarly I think.	13:58.00
malc_	sebras: has(?) _much_ simpler... is?	13:58.15
	Forward 1 day (to 2018/03/09)>>>

Log of #mupdf at irc.freenode.net.