| <<<Back 1 day (to 2018/03/07) | 20180308 |
Robin_Watts | n35xdxb0: There is no config file for MuPDF. | 00:30.18 |
| You'd need to change the C code. It's not hard. | 00:30.35 |
n35xdxb0 | Robin_Watts: and then recompile? | 01:36.36 |
sebras | tor8: I found the reasons for the issues in 699094 | 10:21.01 |
| tor8: I have an ugly patch that appears to fix the immediate issues on sebras/master | 10:21.19 |
tor8 | sebras: ohhh! | 10:22.26 |
| or maybe we should just detect JPX and refuse to decompress? | 10:22.41 |
sebras | tor8: we could, but should we? after all the user has supplied -d in order to decompress them. | 10:23.13 |
tor8 | that's a lot less useful for images than content streams and fonts | 10:23.38 |
| I would almost consider a JPX image to be a 'resource' blob that shouldn't be decompressed | 10:24.30 |
| and almost the same with JPEG | 10:24.39 |
sebras | tor8: what about flated images? | 10:24.53 |
tor8 | since those things are self contained files if you just extract the raw compressed data | 10:24.56 |
sebras | tor8: ah, I see where you are coming from now. | 10:25.12 |
tor8 | flated images aren't self contained -- they have no information about colorspace and size like JPEG and JPX do | 10:25.21 |
| which is why we have a special case for extracting JPEG images in 'mutool extract' (though that ought to be extended to also detect JPX I think) | 10:25.58 |
sebras | tor8: yeah, I see that now. | 10:26.00 |
| tor8: vim source/tools/pdfextract.c oh! | 10:26.31 |
| tor8: yes, that code should probably handle JPX too. | 10:27.25 |
| tor8: I'll whip up a patch. | 10:31.17 |
| tor8: so the extract code actually calls pdf_load_image() which converts the JPX images to decoded images at load time. | 10:59.39 |
| tor8: hence mutool extract has no way of knowing that the image is compressed. | 10:59.57 |
tor8 | pdf_load_image loads a 'compressed buffer' | 11:00.09 |
sebras | tor8: not for JPXs. | 11:00.31 |
tor8 | oh... yes, you're right. | 11:00.55 |
sebras | tor8: pdf_load_jpx() calls fz_new_image_from_pixmap() after having loaded the uncompressed JPX. | 11:01.09 |
| tor8: but for all other image formats you are correct! :) | 11:01.32 |
tor8 | we have to do that because we don't have access to the w/h in the PDF dictionary | 11:01.40 |
| nor the colorspace | 11:01.49 |
sebras | tor8: right, I wonder if it would suffice to decode just the headers though..? | 11:02.03 |
| tor8: that way we could postpone the decoding of actual image data until a later point. | 11:02.16 |
tor8 | yes. this might be the cause for Robin_Watts, jogux, paulgardiner's slow to cancel files | 11:03.03 |
| because the decoding happens on the image loading thread, not in the rendering thread where it happens for all other image formats | 11:03.24 |
sebras | tor8: right, if the images they are decoding are JPXs, do we know? | 11:03.43 |
tor8 | does who know? | 11:08.38 |
sebras | tor8: has anyone mentioned if the issues are specifcally with PDFs containing JPX-images? | 11:09.18 |
tor8 | oh right, it migth have been big jpegs. I'm not sure on the details. | 11:09.48 |
| if we add a JPX info header-only decoding (and it's faster than full decoding) then we could load the header in pdf_load_image_imp and stick the data in a compressed buffer in pdf_load_jpx_imp | 11:10.12 |
| or not bother with a fz_compressed_buffer and have a special class of fz_image for jpx that has the data | 11:10.44 |
| in pdf_load_jpx we currently unconditionally call fz_load_jpx | 11:11.51 |
| but we could rejig that call to happen in the fz_image.get_pixmap callback instead and do the rest based on fz_load_jpx_info | 11:12.14 |
| or you can just point Robin_Watts at this and let him do it :) | 11:12.23 |
| I'm not 100% convinced of the benefits | 11:12.39 |
| if you use JPX you deserve whatever punishment you get... | 11:12.51 |
Robin_Watts | tor8: The file that exhibited "slow to cancel" was jpeg, not jpeg2k. | 11:13.09 |
sebras | Robin_Watts: wait until you have a file using j2k. :) | 11:13.38 |
Robin_Watts | We hold the compressed data in an fz_compressed_buffer, so 1) we can decode without hitting the file, and 2) so we can get the original data back (for most formats at least) in PDF writing. | 11:14.30 |
| sebras: I'm sure JPX will be worse, yes. | 11:14.45 |
sebras | tor8: seems like fz_load_jpx_info() costs about 1/10th of fz_load_jpx() on the JPXs from the file from 699094 | 11:24.59 |
| tor8: so it might be worth doing. | 11:25.04 |
| with openjpeg. | 11:26.27 |
jogux | morning all. could someone review http://git.ghostscript.com/?p=user/joseph/mupdf-ios-viewer.git;a=commitdiff;h=a5e7da5fcc2c3e196f05dda50b90829392ab97c0;hp=087aef041c257325cf16ee6a83432aeac2304972 please? | 11:32.03 |
| (it's trivial_) | 11:33.26 |
tor8 | jogux: LGTM. | 11:35.23 |
sebras | jogux: lgtm. | 11:35.27 |
tor8 | sebras: though it might just end upp costing us 1/16th extra time to draw jpx images | 11:35.59 |
sebras | tor8: because we need to move data into memory and then decode it. | 11:36.45 |
tor8 | yeah, the question is how slow is 'fz_load_jpx_info+fz_load_jpx' on the same data, as compared to just 'fz_load_jpx' | 11:37.57 |
jogux | thank you tor8/sebras | 11:41.40 |
malc_ | mupdf's (muhtml's) lack of any warnings/errors when fed something it has problem with is distressing :( | 11:42.34 |
sebras | tor8: when I retest on other files fz_load_jpx_image and fz_load_jpx have roughly the same execution time. | 11:46.52 |
| tor8: so that would mean that we would only get slower. | 11:47.28 |
| I wonder why the j2k libraries are not faster at doing this decoding. | 11:47.50 |
tor8 | they probably don't consider the use case of only loading the header info | 11:51.19 |
| or the APIs don't expose 'load a header' function | 11:51.34 |
sebras | tor8: ehm, seems like it might be us that are silly in jpx_read_image(). | 11:51.54 |
| we actually decode the image when only needing the headers.. :-/ | 11:52.17 |
| now I get JPX info: .001902s versus JPX decode: 5.012461s | 11:52.43 |
tor8 | all we save is needing to unpack the decompressed image into a fz_pixmap | 11:52.57 |
sebras | tor8: in pdf_load_image_imp() for non-JPX we check imagemask and forcemask and if either is true we set n == 1 (leaving colorspace == NULL) and ignore the number of components according to the colorspace. but for JPX we in pdf_load_jpx_imp() call fz_convert_pixmap() to convert it to grayscale. seems like we are not handling the two cases in the same way. | 12:30.26 |
| why wouldn't we decode the image accoridng to the colorspcae and then convert to grayscale for non-JPX like we do for JPX? | 12:30.50 |
tor8 | you mean for softmasks? | 12:31.43 |
| that would probably be good. I'm not sure we've ever seen any files that aren't grayscale softmasks other than JPX (which are usually pretty crappily constructed) | 12:32.23 |
sebras | ah. | 12:33.03 |
| tor8: I guess that's as a good reason as any. :) | 12:33.41 |
| tor8: btw, you asked me to ping you about the commit at the top of sebras/mater. | 13:09.15 |
| master | 13:09.17 |
tor8 | sebras: in "Do not warn..." wouldn't a simple if (obj) { ... old stuff with if (!pdf_is_stream) ... } bracketing if-statement be cleaner? | 13:42.32 |
| Fix 699086 LGTM | 13:42.48 |
| and the scissor stack one also LGTM (I think I've reviewed those two before) | 13:43.03 |
| let me have a look at the crypto/md5 stuff now | 13:43.19 |
sebras | tor8: I was worried about the indentation depth. but now i see that build_filter() indents just as deep further down. I'll fix the if that you mentioned above. | 13:44.24 |
tor8 | indentation depth takes a (very much) back seat to clear logic :) | 13:45.36 |
sebras | sebras/master updated. | 13:46.13 |
| tor8: the key thing (pun intended) with the crypto one is that there's an MD5 sum being computed, so using more than 16 bytes is an issue. | 13:46.52 |
| tor8: also note that this issue was brought up by a fuzzed file. | 13:47.09 |
tor8 | sebras: see page 126 of pdfref17 | 13:50.28 |
| algorithm 3.3, step 4: "Create an RC4 encryption key using the first n bytes of the output from the final MD5 hash." | 13:51.00 |
| note 'n' not 16 | 13:51.06 |
sebras | tor8: so the values of crypt->v == 5 but crypt->r == 4 | 13:51.10 |
tor8 | we should probably check that n<=16 and throw if not | 13:51.19 |
| pdf_compute_user_password uses the same logic | 13:54.02 |
| so that should also verify that n <=16 | 13:54.10 |
| (or just clamp it to 0..16 and let the file fail when it can't match the keys and fail to decrypt) | 13:55.03 |
sebras | fz_mini(n, 16) seems apt. | 13:55.25 |
tor8 | fz_clampi(n, 0, 16) seems better, since it could also be negative... | 13:55.42 |
sebras | no it cant. | 13:56.06 |
| the only time we read crypt->length from the file is if crypt->r == 2 or 4 | 13:56.20 |
| ehm.. v | 13:56.33 |
| crypt->v | 13:56.35 |
| and we check if length < 40 || length > 128 and throw | 13:56.48 |
| in all other cases length is == 40 or == 256 | 13:56.59 |
tor8 | ah, right | 13:57.04 |
| then fz_mini seems apt (or if (n>16) throw() | 13:57.32 |
sebras | though the logic in pdf_compute_user_password() has _much_ simpler. | 13:57.34 |
| I'll rerrange pdf_authenticate_owner_password() similarly I think. | 13:58.00 |
malc_ | sebras: has(?) _much_ simpler... is? | 13:58.15 |
| Forward 1 day (to 2018/03/09)>>> | |