MuPDF IRC logs

	<<<Back 1 day (to 2020/07/08)	Fwd 1 day (to 2020/07/10)>>>	20200709
apollo13	hi there, we are seeing crashes with mupdf and the java bindings. We instrumented the code with asan and voila: https://gist.github.com/apollo13/eb5499fc6911ddccbbfe44169f661c2d it seems as if a thread tries to load a page while the garbage collector disposes the same page, that mupdf tries to reuse an already freed memory section		11:04.18
	Looking at the code I think it might have to do with the page list probably not beeing threadsafe?		11:05.00
ator	apollo13: the C code needs care to not trip over itself when threading, but I think the java bindings should take care of that		11:16.05
	so it looks like a bug for Robin_Watts or sebras I think		11:16.20
sebras	apollo13: what are your multiple threads doing?		11:18.30
apollo13	sebras: well, they call Document.loadPage and render the page into a bitmap		11:19.04
sebras	apollo13: is one thread doing all the accessing of the Document and then other threads working on rendered display lists?		11:19.12
apollo13	this is in a webservice, what happening is the following:		11:19.14
sebras	s/rendered//		11:19.17
apollo13	we keep the document object around and when a new http request comes in it does document.loadPage()		11:19.44
sebras	apollo13: so you have multiple threads calling Document.loadPage() on the same Document?		11:19.49
apollo13	yes		11:19.57
	but one after another		11:20.06
	not neccessarily in parallel		11:20.10
	the issue comes in conjunction with garbage collection since java calls finalize in another thread		11:20.28
	the problem here being that if (and only if) loadPage() returns an already existing page (fz_keep_page etc) while java garbage collection calls Page.finalize()		11:21.10
sebras	apollo13: that the finalize method is called means that the last reference to it has been freed according to the JVM.		11:21.17
apollo13	somehow somewhere in there seems to be a race		11:21.18
	yes		11:21.25
	but at the same time someone else calls loadPage() which would reuse (in C) the existing page		11:21.54
	we are currently working around with it by explicitly calling page.destroy() but that slows down the code quite a lot		11:24.11
sebras	ator: doc->open doesn't keep references to the pages it puts in its list.		11:25.37
	ator: no wonder apollo13 sees issues when the document tries to reuse and old page.		11:26.06
apollo13	so to summarize what we are doing: some thread does page = document.loadPage and lets the page object go out of scope. at some point the garbage collection calls Page.finalize(), if at this some point another thread calls page = document.loadPage() reusing the old page it can crash		11:26.19
	does what I say make sense? :)		11:26.33
ator	sebras: which doc->open?		11:27.14
sebras	ator: fz_document has a fz_page *open.		11:27.29
apollo13	btw can someone explain to me how fz_page next/prev work. Ie why is prev a *page and next only page?		11:27.44
sebras	ator: when we add a newly loaded page in fz_load_chapter_page() we give the reference to the caller.		11:27.51
	ator: but we do not take an extra reference when inserting the page into fz_document->open		11:28.10
ator	sebras: I think that list is supposed to be weak references		11:28.22
sebras	ator: when the JVM sees no further variables pointing to the page fron the caller it will call the page finalizer, thereby dropping the page.		11:29.13
	ator: wouldn't that situation cause a stale page pointer to remain in fz_document->open?		11:29.26
ator	fz_drop_page removes the page		11:29.27
	form the document->open list		11:29.51
sebras	ah, I didn't see that because it didn't refer to ->open		11:30.12
Robin_Watts	apollo13: prev is a pointer to the pointer to the current page.		11:30.30
	so you can unlink yourself by doing *(me->prev) = me->next		11:30.57
apollo13	sebras, ator: yes it drop and somewhere between fz_drop_page and fz_load_chapter_page (the loop there which iterates all pages) you get a dangling point I think		11:31.13
	ie if the loop find the page and then calls keep on it, drop_page could already be destroying it		11:31.43
Robin_Watts	apollo13: You know your 'next' page is always going to be NULL or an fz_page *.		11:31.46
apollo13	Robin_Watts: And the same couldn't be done for 'prev'?		11:32.25
	oh wait, I thought prev is for previous		11:32.38
	and you just said it's for the current page		11:32.52
Robin_Watts	No.		11:32.57
	prev points to "the thing that points to me". which is in most cases the previous page.		11:33.31
	Think of it this way.		11:33.45
ator	apollo13: it is a doubly-linked list. the current page's prev points to the previous page's next pointer (which is the current page)		11:33.46
Robin_Watts	next is either NULL or an fz_page.		11:34.07
apollo13	ah, slowly it's beginning to make sense :)		11:34.14
ator	or if it's the first page, the root of the linked list: the fz_document.open pointer.		11:34.21
sebras	ator: if we have one thread doing the finalizing calls and another thread calling fz_load_chapter_page() we do need some kind of lock over this list.		11:34.26
Robin_Watts	BUT if you're the first page in the list, there is no previous page, but something still has to point to you.		11:34.48
ator	sebras: I thought the get_context() stuff in JNI did the locking?		11:34.59
sebras	ator: get_context() doesn't lock anything at all, it just stores the context as TLS.		11:35.38
	and clones the first context if the currently running thrad has no context in TLS.		11:36.09
ator	sebras: so how do we not cock up when the naive java users gets into multiple threads in the normal case?		11:36.17
apollo13	ator: we only see it under high load when the GC hits at the wrong time		11:36.45
ator	the TLS makes sure we don't screw up the context and error stack, but we can crash if the user calls a document method from multiple threads?		11:36.57
apollo13	usually it just works ™		11:36.59
	ator: even from the same thread		11:37.12
sebras	ator: we can.		11:37.15
apollo13	the problem is that java's garbage collection runs in another thread (not much you can do about that afaik)		11:37.30
ator	sebras: oh. that's probably what's happening here then, the GC is running/triggered on a separate thread.		11:37.41
apollo13	yes, that is what we are seeing in the asan traces		11:37.57
sebras	ator: in C we say the same thing: work on the fz_document in one and only one thread and get your display lists, the lists can then be processed in other threads.		11:38.13
apollo13	the thread named "finalizer" is this special gc thread		11:38.21
sebras	ator: the problem here is that the GC in a second thread, which is likely not a situation we've encountered in C.		11:38.45
ator	sebras: I thought we were/did some 'synchronized' stuff in JNI to prevent accidents like that		11:38.47
Robin_Watts	Urgh, this all sounds horrid.		11:39.01
apollo13	Robin_Watts: welcome to my world		11:39.13
sebras	ator: we do not.		11:39.14
ator	sebras: right. so caveat emptor there. still, that doesn't help the current situation though :(		11:39.52
	apollo13: what is the slowdown with calling page.destroy?		11:40.30
sebras	ator: we could make all calls on Document synchronize on the document, that would help I think..?		11:40.43
apollo13	ator: what as in how much? well the page is no longer cached, so more work to do… :)		11:40.57
Robin_Watts	sebras: Yes, I'm just pondering/googling that now.		11:40.59
ator	sebras: not necessarily, the synchronized only locks per object IIRC		11:41.03
	and these are Page objects, not Document		11:41.09
	so won't synchronize to the same		11:41.14
sebras	ator: ah, yes. we're dropping the page, not the document, of course.		11:41.26
ator	that's why the first Android app JNI bindings used a global MuPDFCore object to do the synchronization		11:41.36
Robin_Watts	We'd have to make the page objects finalize by calling through a sychronised method in the document.		11:41.52
ator	Robin_Watts: or Context?		11:42.19
apollo13	Robin_Watts: I doubt that will help		11:42.20
	because loadPage will not be affected by that		11:42.39
Robin_Watts	It's Document.loadPage, right?		11:42.57
apollo13	yes		11:43.06
sebras	Robin_Watts: ator: why can't we add a lock at the C-level whenever we manipulate the document's opened page list?		11:43.09
ator	synchronized Document.loadPage and synchronized Document.destroyPage() would work I think		11:43.15
apollo13	ator: how would you sync those around one lock		11:43.31
ator	sebras: because I promise this won't be the only place :(		11:43.37
apollo13	if you just put synchronized there they'd sync independently no?		11:43.41
sebras	ator: then we're in deep trouble in C too..?		11:43.56
ator	synchronized uses a mutex on the object, not function		11:44.06
	sebras: we don't have garbage collectors running around behind the user's back poking stuff randomly		11:44.25
apollo13	ator: ah, I guess then I could at least come up with a local workaround		11:44.52
Robin_Watts	I think my proposal is this: make Document.loadPage synchronized. Add synchronized Document.destroyPage. Make Page.finalize call Document.destroyPage.		11:45.07
apollo13	that sounds like a good solution.		11:45.57
sebras	ator: "I promise this won't be the only place" what does this mean? that there are other objects where the finalizer would have the same problem?		11:46.17
Robin_Watts	of course, I'm not sure we have a Document pointer from a Page at the java level.		11:46.20
ator	I'm guessing this is only a problem because of the 'weak' references in the document list		11:46.33
sebras	ator: I wouldn't be surprised if that is the case. but if it is then Robin_Watts's suggestion wouldn't work.		11:46.41
Robin_Watts	sebras: My fix would be enough for this case.		11:46.55
ator	Robin_Watts: we can add one		11:47.07
Robin_Watts	so the can is kicked down the road until we find the next problem.		11:47.11
	ator: Yeah, that shouldn't cause problems, I hope...		11:47.35
sebras	I'm still confused as to why we can't use a lock over the page list.		11:47.39
Robin_Watts	sebras: which lock?		11:47.45
	a new lock?		11:48.15
sebras	Robin_Watts: possibly, yes.		11:48.29
ator	the malloc lock maybe?		11:48.35
Robin_Watts	ator: If we use the malloc lock, we then can't malloc :)		11:48.52
apollo13	:D		11:49.01
sebras	Robin_Watts: more importantly, we can't drop, right?		11:49.11
Robin_Watts	sebras: Indeed.		11:49.20
ator	Robin_Watts: fz_load_chapter_page has two blocks that need protecting		11:49.21
sebras	because the locks we have are not necessarily recursive.		11:49.21
ator	and fz_drop_page has one block		11:49.42
	one that scans the list, and two that insert and delete items from it		11:49.59
	none of which call any functions		11:50.10
apollo13	wouldn't fz_keep_page also need the look then?		11:50.15
Robin_Watts	We could have FZ_LOCK_DOC which we used to protect "the document structure".		11:50.50
	it'd be inserted between ALLOC and FREETYPE		11:51.04
ator	apollo13: no, that just bumps the reference count. isn't the problem happening because drop_page removes it from the list while the list is being scanned in load_page?		11:51.10
sebras	ator: do we have weak references like this elsewhere?		11:51.30
Robin_Watts	ator: right, it's 2 threads accessing the document structure at the same time.		11:51.31
apollo13	ator: yes		11:51.34
ator	Robin_Watts: all destructors will access all structures from 2 threads :(		11:51.56
	but I think the list of of destructors that do anything significant is very short		11:52.15
Robin_Watts	ator: I'm not following that assertion.		11:52.34
apollo13	ator: at least I assume so, but looking at the asan log, what is failing is fz_keep_page itself, because it gets passed and already deleted page (from the iteration)		11:52.41
sebras	apollo13: that deleted page comes from the list of opened pages stored in the document		11:53.11
apollo13	jupp		11:53.17
ator	apollo13: that's what I assume would happen. the load_page gets a page from the list at the same time another thread just freed it		11:53.31
apollo13	ok, then we are all on the same side :)		11:53.55
Robin_Watts	ator: Are you saying that a DOC lock is not sufficient?		11:54.15
	or just that it might need to be used in more places?		11:54.27
ator	Robin_Watts: I think a DOC lock is probably good enough		11:54.28
sebras	ator: Robin_Watts: is a DOC lock a good solution though?		11:54.43
ator	just might be needed in more places, depending on how often this problem crops up		11:54.43
Robin_Watts	ator, sebras: So, which of us gets to do this?		11:54.49
ator	the normal fz_drop_imp does enough locking to keep us from double-freeing in the destructor		11:55.05
sebras	or do you still prefer sychronized?		11:55.12
*ator*	touches his nose.		11:55.28
Robin_Watts	sebras: I'd rather do it in C than in Java, I think.		11:55.30
sebras	I can do it.		11:55.52
Robin_Watts	I'm happy to have a go at it. I have just finished stuff on my list, so this would be a good point.		11:56.02
sebras	Robin_Watts: ok. then you do it. :)		11:56.13
Robin_Watts	but if sebras wants it, I have other stuff to do.		11:56.14
sebras	(it sounded like you both were preoccupied)		11:56.40
apollo13	well either way I'll happily test a patch and verify that it fixes the issue since I have a good reproducer		11:56.57
Robin_Watts	OK. I'll bash at it now.		11:57.20
	apollo13: Are you around for a couple of hours?		11:57.30
	(I'm on UK time, where are you?)		11:57.40
apollo13	Robin_Watts: austria, yeah will be around		12:01.23
Robin_Watts	actually, I think we can use the malloc lock for this.		12:02.27
	which is nicer.		12:02.51
sebras	Robin_Watts: because it happens after fz_drop_imp()?		12:03.48
Robin_Watts	because we don't do any allocations/deallocations while we're modifying the structure itself.		12:04.24
	We allocate then modify, or modify then deallocate.		12:04.43
Kabouik	I am getting "fatal error: lcms2mt.h" when trying to build from source on Debian, yet I did run `apt-get build-dep` first		12:21.07
	Is there anything else I'm missing?		12:21.13
Robin_Watts	Kabouik: Our sources don't come as an apk.		12:23.43
apollo13	Kabouik: did you update the submodules first?		12:23.55
Robin_Watts	s/as an apk/via apt/		12:23.56
ator	Kabouik: if you cloned from git, you have to run 'git submodule update --init' first		12:24.20
Robin_Watts	If you're building from git, then do "git submodule update --init"		12:24.20
apollo13	I think I've seen this yesterday after just checking out the repo and forgot the submodules		12:24.21
ator	(or clone with --recursive)		12:24.24
Robin_Watts	If you're building from one of our release archives, the sources should all be in there.		12:24.46
	If you're building from your distros repackaging of our stuff, then you need to complain to them. :)		12:25.02
	apollo13, ator, sebras: This is my first attempt: https://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=e4afff800d62ae83e3a45427bf2c0c09483e04a9		12:25.30
	Just cluster testing that now.		12:25.35
apollo13	ok, rebuilding and testing		12:26.14
*Robin_Watts*	grabs lunch. bbs.		12:27.38
Kabouik	apk? What I'm trying to do has nothing to do with apks, does it?		12:28.39
	Oh right, I read too fast		12:28.54
	Actually what I'm trying to do is this: https://askubuntu.com/a/1208373 If I understand correctly, it just fetches the sources from git anyway, but the build-dep should automate the dependencies part		12:29.35
	I'm wrong, it doesn't fetch from git, it recommends fetching from git in the logs :p		12:31.04
	I'll just do that.		12:31.14
apollo13	Kabouik: build-dep will install the dependencies, that is correct, but mupdf doesn't use those		12:31.20
	at least not in the standard build setup		12:31.29
Kabouik	What I wanted is manually build mupdf to get mupdf-gl executable, and the freeglut fixes		12:32.00
	I can build from git I guess, no problem		12:32.07
apollo13	Robin_Watts: patch looks good, no segfault till now		12:41.52
ator	Robin_Watts: LGTM		12:44.04
Robin_Watts	apollo13, ator: Thanks.		13:31.56
	Pushed to master.		13:32.27
sebras	Robin_Watts: thanks for fixing.		13:32.36
	Robin_Watts: let's hope there are no more locations...		13:32.44
Robin_Watts	apollo13: Thanks for that. Such coherent bug reports are rare, and much appreciated!		13:32.55
	sebras: indeed.		13:33.06
apollo13	Robin_Watts: np, took us a while to find (and get it somewhat reproducable). Thank you for the quick fix!		13:33.34
malc_	hmm.. what determines whether meta constant deserves a define? cause there isn't one for creation date		19:56.33
sebras	malc_: it is quite likely that there should be one.		20:15.06
malc_	sebras: it's equally likely that this is a whack a mole...		20:15.58
sebras	malc_: not really. we know what the standard names are in /Info		20:16.16
	malc_: personally I prefer FZ_META_* defines since they will not change.		20:17.05
malc_	sebras: for my program - you are correct, only CreationDate is missing a define, at large - no idea, you know better		20:19.04
	https://github.com/moosotc/llpp/blob/master/link.c#L395		20:20.42
sebras	malc_: Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate, Trappaed are the fields that are specified.		20:21.03
	Trapped.		20:21.34
malc_	sebras: tack!		20:21.43
sebras	maybe we should add FZ_META_INFO_CREATION_DATE and FZ_META_INFO_MOD_DATE and be done with it.		20:21.56
	I've never seen _ANYONE_ use keywords or subject fields.		20:22.15
	and I have no idea was /Trapped means.		20:22.29
malc_	maybe it's those adobe technical writers trying to tell us something		20:23.28
	<<<Back 1 day (to 2020/07/08)	Forward 1 day (to 2020/07/10)>>>

Log of #mupdf at irc.freenode.net.