MuPDF IRC logs

	<<<Back 1 day (to 2018/01/21)	20180122
FLEXO_	Hi. Is it possible to change scrolling direction in MuPDF?	06:47.08
Steve_	Hi, I've tested the "mupdf viewer" and I found some annoying bug: I read some ebook (epub) and the Viewer "crops" some pages, that the last row is missing. First I thought it's a bug in the epub and I opened it with another Reader. No Problem, the missing words are there. Opened in mupdf Viewer, the words are missing. So why and where are the words gone?	07:23.28
kens	Steve_ : if you think you've found a bug, please open a bug report. Its not possible to even speculate about the problem without seeing the file in question, so you'll need to supply that and the easiest way to do that is to open a report.	08:07.28
_baskerville_	@kens: I've already made reports for this problem: 698875 and 698770	08:25.54
kens	OK then presumably someone will look at it in time	08:27.02
	Assuming its the same problem	08:27.22
_baskerville_	I've started investigating it myself: mupdf relies on ceilf(ch->html->root->h / ch->html->page_h) to compute the number of pages inside a chapter.	08:28.35
kens	_baskerville_ : I regret to say there's no point telling me about it, since I'm not a MuPDF developer..... But if you want to discuss it here, go ahead	08:29.12
_baskerville_	Sorry: I tought the devs were there.	08:30.03
kens	Bit early yet, unless sebras is around, but its late for his zime zone	08:30.32
FLEXO_	still no help for the scrolling direction problem?	10:12.31
kens	You'll need one of the developers for that	10:13.03
	Though I'm not sue exactly what you mean by 'change the scrolling direction'	10:13.33
FLEXO_	if i open a PDF in the MuPDF viewer and scroll with the mouse wheel it works the opposite way (Page order) than lets say in the firefox PDF viewer	10:32.42
	thanks anyway for the answer	10:32.58
kens	Well there are several problems there. MuPDF actually refers to the underlying library, the various viewers are implementations of code which is more or less demo code on top of that library.	10:33.30
	Each of the viewers differs in its capabilities, and you haevn't said which one you are using. Or on which OS.	10:33.57
	If you don't like it, you can easily change it I would think.	10:34.08
FLEXO_	OS is win 10 and the viewer is the one from the homepage Version 1.12.0	10:39.25
kens	I'm reasonably certain the only way to change the drection of operation of the mouse scrollwheel is going to be by changing the code and recompiling it.	10:39.57
	Though it doesn't sound like an especially hard task	10:40.11
FLEXO_	:D so thats a bit tricky for me. I only develop PLC code. Thanks anyway	10:41.08
malc_	FLEXO_: any particular reason why you want to use vanilla mupdf viewer and not something based on the mupdf library, such as, say, summatrapdf?	10:44.55
FLEXO_	I really like to have it as minimal as possible. But hey, i installed summatra. Thanks for the hint.	10:48.13
paulgardiner	What does fz_open_null do? It looks like it might be a filter that gives access to a specified range of bytes from a stream, but then, if so, I don't get the name.	13:25.35
tor8	it's a 'null' filter in postscript terminology	13:25.52
	one that doesn't modify the bytes (just restricts to a subset of the 'parent' stream)	13:26.05
kens	A bit bucket :-)	13:26.06
	Oh a read filter	13:26.26
paulgardiner	Ah "null" in the sense that it doesn't process the bytes.	13:26.39
	Could be that null and concat are just what I need. I want to pass a stream to the pkcs7 library to tell it what bytes to hash. Possibly I can use null and concat to avoid having to also pass the byte ranges.	13:28.33
	... but concat seems to take ownership of the substreams which isn't ideal.	13:29.02
	I could be completely misunderstanding what these so.	13:29.17
	do	13:29.31
tor8	yeah, the concat stream also adds a space character between the streams (it's meant to be used for concatenating the substreams in a PDF content stream)	13:29.35
paulgardiner	Oh right. Not quite what I'm looking for.	13:30.02
tor8	paulgardiner: you could make a new filter which works similar to 'null' filter (or generalise it) to take a set of ranges rather than just one range	13:30.17
paulgardiner	Yes, I was just looking to do that, when I saw there were similar things already there.	13:30.42
tor8	it should be simpler than the concat filter because you can base it on one source stream rather than needing to switch between streams	13:31.14
paulgardiner	Yes true	13:31.26
tor8	and just seek past the bits you don't want	13:31.36
paulgardiner	concat was possibly not the ideal way to do it in any case	13:32.01
	Hmmm, fz_open_null takes ownership too, I think. That may be a problem	13:36.14
	Strange, because it doesn't have "drop" in its name.	13:37.11
tor8	paulgardiner: the filters have a quirky way of owning reference counts	13:37.41
	it's a bit problematic in places, but we haven't got around to cleaning it up yet	13:37.57
	this code predates reference counting	13:38.01
	if you want to fix it up to be sane, please do! :)	13:38.25
paulgardiner	:-)	13:38.31
tor8	the complexity comes from how filter chains are built up in the pdf interpreter	13:38.41
paulgardiner	I can imagine doing a lot of damage trying to clean that up	13:39.10
tor8	paulgardiner: if you make a new filter that doesn't take ownership, just call it "fz_new_skip_filter" rather than "fz_open_skip"	13:41.00
	and when I get around to cleaning up these murky areas, I'll change the names of the filter creation functions	13:41.15
paulgardiner	I could make a wrapper filter that has no filtering effect at all but avoids droping it's argument. Bit of hack	13:41.20
	That way I could take my stream that I don't want dropped, wrap it and pass it to the generalised null filter.	13:42.04
	In one case, I'm accessing doc->file. Would be bad if I closed that I think.	13:43.28
tor8	you could fz_open_null(fz_keep_stream(doc->file))	13:44.37
	as a temporary fix until the code is fixed up	13:44.59
paulgardiner	Oh I see. I didn't think streams were reference counted. I'm confusing them with outputs possibly	14:17.30
tor8	paulgardiner: it used to be they weren't reference counted (hence why fz_open_null takes ownership)	14:22.16
sebras	titanous: https://github.com/google/oss-fuzz/blob/master/projects/mupdf/pdf_fuzzer.cc#L28 what is setting ctm here?	15:10.15
	titanous: seems like you might as well use fz_identity..?	15:10.31
	paulgardiner: ugh, why does the cluster claim that the fread you put into pdf-write as part of support for levels of incremental xref sections cause a _new_ warning when I run my code? :-/	15:58.03
	paulgardiner: that's not fair.	15:58.09
kens	Happens with GS all the time	15:58.25
	The code for detecting if a warning is new is not 100% reliable	15:58.48
	If you look at my last commit it raises new warnings on jpegxr, which I didn't touch.....	15:59.10
sebras	kens: nope, apparently not. if I had edited that file so the line had changed or something, then I'd understand it.	15:59.13
titanous	sebras: I think that was actually just copied from some example, if that's not the right way to do it, I can fix, just let me know what it should be	16:07.36
sebras	if you change line 28 to be fz_matrix ctm = fz_identity; I think we should be safe.	16:08.52
	titanous: if ctm is not initialized with any values then they might spread to other places.	16:09.21
titanous	k, will do	16:09.52
sebras	tor8: Robin_Watts: do you guys mind taking a look at sebras/master?	16:10.54
	it is clustering as we write.	16:10.59
Robin_Watts	sebras: Looking.	16:13.58
	sebras: All 4 look plausible to me.	16:16.23
sebras	Robin_Watts: is that LGTM?	16:17.09
	Robin_Watts: or should I ask tor8 too?	16:17.16
Robin_Watts	lgtm	16:22.42
sebras	fredross-perry (for the logs): call_SeekableOutputStream_tell(), call_SeekableInputStream_seek(), call_SeekableOutputStream_seek(), call_SeekableInputStream_close() has a fz_throw(...., "env is NULL...") call where the indentation is not correct.	16:23.31
titanous	sebras: the fuzzer is fixed, do you think that was the cause of any of the crashes?	16:24.51
sebras	titanous: I'm not sure. I have yet to be able to reproduce a few of them.	16:25.08
	titanous: most of them I can reproduce however.	16:25.15
titanous	sebras: can you link me to the testcases on oss-fuzz that you can't repro? I can try	16:25.45
sebras	titanous: 5502 is one of those.	16:29.40
	titanous: that would be the minimized testcase from https://oss-fuzz.com/v2/testcase-detail/6194612382203904?noredirect=1	16:30.06
	can't see it using gcc ASAN, nor valgrind locally.	16:30.33
titanous	try clang-6.0, which is the version used by oss-fuzz	16:31.13
sebras	fredross-perry (for the logs): call_SeekableOutputStream_tell(), call_SeekableInputStream_seek(), call_SeekableOutputStream_seek(), call_SeekableInputStream_close() has a fz_throw(...., "env is NULL...") call where the indentation is not correct.	16:31.16
titanous	it repros using the reproduce tool	16:31.26
sebras	titanous: ok, good to know.	16:31.33
	titanous: I haven't got clang-6.0 available to install at the moment. and looking at the other bugs is probably a better idea.	16:32.04
Robin_Watts	sebras, tor8: So, we have a customer who is using MuPDF to convert from PDF to SVG.	16:33.01
	And he's complaining that we don't output 'layer' information in the produced SVG file.	16:33.18
	The layer information would come from the 'OC' stuff in PDFs.	16:33.46
titanous	sebras: https://bugs.chromium.org/p/oss-fuzz/issues/list?can=2&q=label%3AClusterFuzz-Top-Crash+project-mupdf that's the list of bugs slowing down the fuzzers, but I'd say the security bugs should be the first priority	16:34.01
Robin_Watts	To get that stuff through, we'd need to add a new device method, I think.	16:34.01
fredross-perry	sebras - in thos cases the "if" is indented with one tab, and the fz_throw with 2. ??	16:34.35
sebras	titanous: that link didn't work for me.	16:34.46
titanous	sebras: make sure the right account is selected in the top right dropdown	16:35.15
sebras	titanous: got it.	16:36.42
	fredross-perry: ah, my bad! I hadn't fetched!	16:39.52
fredross-perry	ok!	16:40.05
tor8	Robin_Watts: fz_begin/end_layer or group or something?	16:42.19
Robin_Watts	tor8: That's what I'm thinking.	16:42.34
tor8	Robin_Watts: sounds like a reasonable extension, and would map fairly naturally to other language's constructs	16:43.03
	like svg <g> tags etc maybe	16:43.09
Robin_Watts	tor8: It's exactly svg <g> tags I need to make :)	16:43.20
	Can we assume strict nesting?	16:43.43
	If not, the interpreter needs to store what layer it's in.	16:44.13
	(I'm going to say layer, rather than group, cos groups already have meaning)	16:44.27
tor8	yes, strict nesting is the only sane approach, IMO	16:44.47
Robin_Watts	In PDF we have /OC /Foo BDC ... EMC, so strict nesting is implied.	16:44.59
sebras	fredross-perry: would we need to declare all members and constants in the interfaces public? I've read that abstract, default and static methods are implicitly public, but I'm not sure if our members fall into either class.	16:45.26
tor8	if there's a mismatch between q and Q in PDF with clip groups etc the nesting may be interleaved with trasparency groups etc	16:45.33
	may be useful to trap and end the layer early when that nesting is broken in PDF files	16:45.47
fredross-perry	I've read that you don't need to do that for if members.	16:45.52
tor8	it shouldn't happen, but you never know	16:46.01
sebras	fredross-perry: ok.	16:46.01
Robin_Watts	tor8: The problem is what to pass in fz_begin_layer.	16:46.37
	tag properties BDC	16:46.50
sebras	fredross-perry: there's still the issue in Document_openWithStream() where stm leaks if fz_open_document_with_stream() calls fz_throw().	16:46.59
	fredross-perry: you need to use fz_try() similar to PDFDocument_saveWithStream().	16:47.17
Robin_Watts	tag is a name, so no problem. properties will be hairier.	16:47.54
fredross-perry	sebras - dropping stm is handled under fail: right now. But I can rearrange that.	16:49.26
Robin_Watts	In fact, we'd only pass on for when tag = /OC.	16:49.27
sebras	fredross-perry: ah I see now, in interfaces abstract methods need not explicitly be declared abstract, just not supplying an implementation is enough.	16:49.56
	fredross-perry: yes, but doing so is not enough.	16:50.10
	fredross-perry: if fz_open_document_with_stream() throws then we will simply longjmp() out from the function.	16:50.31
	or at least try to.	16:50.38
Robin_Watts	so properties can either be an OCG or an OCM dictionary.	16:50.38
fredross-perry	sebras - oh I see.	16:50.46
sebras	fredross-perry: this is what happens inside fz_throw().	16:50.49
fredross-perry	seme with fz_new_stream, I presume?	16:51.01
sebras	fredross-perry: yes.	16:51.06
fredross-perry	ok	16:51.10
sebras	fredross-perry: fz_drop_*() and fz_free() may never call fz_throw() though.	16:51.42
fredross-perry	ok	16:51.48
sebras	fredross-perry: that's why you have to rearrange jni_attach_thread() a while back.	16:52.03
	s/have/had/	16:52.09
fredross-perry	right	16:52.57
	sebras - ok look again, thanks.	17:05.05
sebras	fredross-perry: looks safer, yes.	17:09.16
	fredross-perry: am I right in thinking that we now don't need the detach argument to jni_attach_thread() and jni_detach_thread()?	17:09.35
	fredross-perry: it seems to me that we'll return NULL on every error and hence we we never reach jni_detach_thread() unless detach == 1..?	17:09.57
tor8	Robin_Watts: yeah. I'm not sure. a name/tag string would be a start at least.	17:11.30
sebras	fredross-perry: oh, and you still add add line in Document_finalize() which adds the now unnecessary idoc variable.	17:11.34
tor8	maybe a list of key/value string attributes as well.	17:11.43
fredross-perry	i'll remove unnecessary idoc variable	17:12.04
Robin_Watts	tor8: A name string is enough, I think.	17:12.12
sebras	fredross-perry: what about the deatch thingy..?	17:12.25
Robin_Watts	(Possibly, we ought to send a list of strings, one for each layer that's in force)	17:12.49
	but for now, just 1 will do.	17:12.54
fredross-perry	there's a case: else if (state == JNI_OK) where detach might be 0.	17:13.44
	iow we were already attached.	17:14.09
sebras	fredross-perry: ah, yes, because in that case env is not NULL. I see.	17:15.07
fredross-perry	ok.	17:15.35
tor8	Robin_Watts: yes.	17:16.18
fredross-perry	sebras - pushed again (no extra idoc)	17:19.05
sebras	tor8: fred/master looks reasonable to me. did I miss anything?	17:23.58
fredross-perry	don't think so.	17:24.13
	thanks for all the fish.	17:24.26
sebras	fredross-perry: :)	17:24.51
fredross-perry	should I push this then?	17:25.36
sebras	fredross-perry: I'd want tor8 to LGTM it too.	17:26.12
fredross-perry	ok, let me know.	17:26.26
sebras	fredross-perry: will do.	17:26.30
	fredross-perry: if he chimes in and you're not here I might push your patch to master and let you know.	17:26.48
fredross-perry	that's fine too.	17:27.01
sebras	Robin_Watts: one more commit fixing 698885 on sebras/master, the fread() one is gone now that pauls thingies were merged.	18:15.33
	Robin_Watts: still around?	18:48.57
	I have a question about copy_node_types(). it asserts that low == high. why?	18:49.08
Robin_Watts	sebras: I am.	18:49.09
sebras	Robin_Watts: because I have this node in a fuzzed file: R177:^178<EMPTY>175(32,40,51,1) and of course it asserts.	18:49.55
Robin_Watts	If node->many, then we're encoding a type where low must == high.	18:50.09
	Honestly, this is out of cache, you'll need to bear with me.	18:50.20
sebras	my thinking is that low and high defines a range.	18:50.58
Robin_Watts	OK, so the only way many gets set is if add_range gets called with 1 as its last argument.	18:51.36
sebras	and the output for .low would be .out and the output for .low + 1 would be .out + 1, etc.	18:51.48
Robin_Watts	which only ever happens from add_mrange	18:52.35
	And add_mrange passes low as both low and high.	18:53.02
	So I suspect we're into 'indexing off the end of the table' territory here.	18:53.41
sebras	I put a breakpoint in add_mrange()	18:53.44
	and it has low == 49 and len == 2 on input, which seems sane.	18:54.01
Robin_Watts	have you built with CHECK_SPLAY defined?	18:54.18
sebras	and as you say, the call to add_range() has both low and high == 49 and many == 1.	18:54.18
	I have, and it doesn't trigger.	18:54.25
Robin_Watts	have you built with DUMP_SPLAY defined? :)	18:54.34
sebras	it only checks that the left and right childs parent is correct.	18:54.35
	it doesn't check that the ranges are correct.	18:54.45
	I have built with DUMP_SPLAY, yes.	18:54.55
	I had to undef for cmapdump.c though... ;)	18:55.04
Robin_Watts	And does the splay tree look sensible?	18:55.10
sebras	I can't tell reall, I'm still trying to understand both the datastructure and the way you dump it.	18:55.33
	Robin_Watts: do you want me to pastebin it?	18:55.46
Robin_Watts	sebras: Splay trees are "just" binary trees.	18:55.59
	each node has a low/high/out. And left/right/parent pointers.	18:56.40
	node->left->low < node->low < node->right->low	18:57.29
sebras	as I understand "R177:^178<EMPTY>175(32,40,51,1)"	18:57.34
	this means that node 177 has node 178 as its parent, the left child is empy and the right child is node 175 and THIS node covers the rnage 0x32-0x40 without starting at 0x51, but it is a many node.	18:58.15
	would the following also be true? node->left->high < node->low < node->right->low	18:59.07
Robin_Watts	sounds plausible.	18:59.15
	yes, I believe that's true.	18:59.32
sebras	Robin_Watts: perhaps we accidentally mess up the tree some time after having inserted the node.	18:59.47
Robin_Watts	node->left->high < node->low and node->high < node->right->low	18:59.59
	cos if node->left->high == node->low we merge the two (assuming they are the same type)	19:00.24
sebras	what does type mean in this case?	19:01.09
	Robin_Watts: the types are many, < 0xffff, and others?	19:01.41
Robin_Watts	sebras: looking.	19:01.53
titanous	sebras: as an aside, https://apt.llvm.org makes it very easy to get clang-6.0	19:02.31
Robin_Watts	sebras: So, we get values thrown at us by the CMAP file. We build the tree as a splay tree from those details.	19:03.05
	Then we break the details out into 3 arrays.	19:03.17
sebras	titanous: good to know, but I'll fix the bugs I can reproduce easily first. :)	19:03.48
titanous	cool	19:04.01
Robin_Watts	mranges are "1 to many", ranges are "1 to 1" (< 0xffff), xranges are "1 to 1" (>= 0x10000)	19:04.15
sebras	Robin_Watts: ok. I have a slight inclination of where the issue stems from: https://pastebin.com/raw/N796ueGr	19:07.57
	i.e. don't mess with my flate bytes.	19:08.29
	but we ought to be able to parse these things without assert()s though.	19:08.50
Robin_Watts	indeed.	19:10.49
sebras	Robin_Watts: ok, when I added assert(!node->many \|\| (node->many && node->low == node->high)); into do_check() tree, I get an assert when we try to call add_mrange().	19:19.57
Robin_Watts	assert(!node->many \|\| node->low ==node->high); would have done, wouldn't it ?	19:21.32
	http://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=77aa044f3378dc2fb31cca285a5ee3270857ec02	19:22.16
	sebras, tor8: That's my initial begin_layer/end_layer commit.	19:22.32
sebras	Robin_Watts: can we really end up calling svg_dev_end_layer() an unbalanced number of times?	19:26.18
	yes, you assert() would be enough, true. I was being silly though and wrote assert(node->many && node->low == node->high) first.	19:27.11
	then I just extended it to paper over my thinko.	19:27.22
	Robin_Watts: and if you care about them being unbalanced in the SVG device I think we should do so in the trace device to, no..?	19:28.16
Robin_Watts	sebras: The trace device is supposed to show us the raw device calls.	19:36.24
	And, yes, we could be unbalanced if the PDF stream contains crap.	19:36.51
sebras	Robin_Watts: ok. and now having read more there doesn't seem to be any code fixing any unbalanced calls before hand.	19:37.06
Robin_Watts	sebras: Indeed. We assume it to be balanced - isn't really much else we can do.	19:37.29
	In the SVG thing, I just do some really minimal sanity checking.	19:38.05
sebras	Robin_Watts: the id attribute to a g element must be unique though. so if we end up with multiple calls to BMC or BDC without any tag, we're in deep trouble.	19:38.18
	Robin_Watts: does PDF require the tags to be unique throughout the page?	19:38.37
Robin_Watts	sebras: No. You'd expect to see multiple things in each.	19:38.54
sebras	Robin_Watts: https://www.w3.org/TR/SVG/struct.html#IDAttribute am I understanding this incorrectly?	19:39.25
	as I read it there is a requirement of uniqueness, but we can't not only not guarantee it, but also we're expecting that the tags _will_ be the same.	19:40.41
Robin_Watts	https://www.w3.org/TR/SVG11/struct.html#IDAttribute	19:41.30
sebras	yes, I sent that link a while ago. :)	19:42.00
Robin_Watts	older version of the link, but yes :)	19:42.18
sebras	meh.	19:42.23
Robin_Watts	yeah, unique is a problem.	19:42.32
sebras	and the grouping too somehow.	19:42.46
Robin_Watts	let me get back to Vladimir and see what he says. He can probably make more examples.	19:42.53
	Thanks.	19:42.59
sebras	I'm thinking that SVG expects to have everything that belongs to one group in one g element, but we don't really do that if there are multiple calls to BMC/BDC which have identical tags.	19:43.22
	but these calls are disjoint and separate by other content.	19:43.55
	separated.	19:44.00
	structurally the code look nice though. yey for that! :)	19:44.18
	I need to eat and then I'm going back to the many ranges, see you in a bit.	19:45.01
titanous	sebras: I'm requesting CVEs for all of these bugs, do you mind triaging https://oss-fuzz.com/v2/testcase-detail/4831843418374144 to determine if the bug is in OpenJPEG or mupdf?	21:32.15
sebras	titanous: I tried to reproduce using the opj_decompress tool that openjpeg provides but it didn't trigger. I don't know why yet, it ought to.	21:47.18
	titanous: to me the bug looks like it is in openjpeg. I'd need further time to prove it though.	21:47.38
	titanous: it seems strange to come running to openjpeg people with a bug that might be in their code but can't be reproduced cleanly with their tools.	21:48.25
titanous	makes sense, I'll just file for a CVE for mupdf for it and then we can let OpenJPEG know when you have more time to sort it out	21:49.33
tor8	Robin_Watts: having the layer name be a const pointer in the display list worries me	22:02.41
	the pdf_obj string could go away while the display list is still alive	22:02.53
	I'd suggest doing a fz_strdup and fz_free on the layer name	22:03.06
	and sebras already brought up the problem of svg/xml 'id' attributes needing to be unique	22:03.36
	other than that, looks fine	22:04.05
sebras	tor8: question!	22:04.45
	tor8: quick look at sebras/master?	22:04.57
	:)	22:05.12
tor8	sebras: all except "Do not throw away byte when lexing tokens without strings." LGTM	22:07.47
	I need to look at that one more closely, remind me to do that tomorrow	22:08.07
sebras	why?	22:08.16
	we do the same thing in pdf_lex().	22:08.23
tor8	because I don't understand it by just reading the diff	22:08.27
sebras	ok.	22:08.33
	context. tomorrow! :)	22:08.37
tor8	it's probably okay, but I need to fire up my editor and look at the context and I'm just about to turn off my computer for the night	22:08.56
sebras	ok, I'll push the rest and let you look at this one tomorrow.	22:10.20
tor8	sebras: cool. ttytm.	22:10.29
	Forward 1 day (to 2018/01/23)>>>

Log of #mupdf at irc.freenode.net.