Log of #mupdf at irc.freenode.net.

Search:
 <<<Back 1 day (to 2020/07/08)Fwd 1 day (to 2020/07/10)>>>20200709 
apollo13 hi there, we are seeing crashes with mupdf and the java bindings. We instrumented the code with asan and voila: https://gist.github.com/apollo13/eb5499fc6911ddccbbfe44169f661c2d it seems as if a thread tries to load a page while the garbage collector disposes the same page, that mupdf tries to reuse an already freed memory section11:04.18 
  Looking at the code I think it might have to do with the page list probably not beeing threadsafe?11:05.00 
ator apollo13: the C code needs care to not trip over itself when threading, but I think the java bindings should take care of that11:16.05 
  so it looks like a bug for Robin_Watts or sebras I think11:16.20 
sebras apollo13: what are your multiple threads doing?11:18.30 
apollo13 sebras: well, they call Document.loadPage and render the page into a bitmap11:19.04 
sebras apollo13: is one thread doing all the accessing of the Document and then other threads working on rendered display lists?11:19.12 
apollo13 this is in a webservice, what happening is the following:11:19.14 
sebras s/rendered//11:19.17 
apollo13 we keep the document object around and when a new http request comes in it does document.loadPage()11:19.44 
sebras apollo13: so you have multiple threads calling Document.loadPage() on the same Document?11:19.49 
apollo13 yes11:19.57 
  but one after another11:20.06 
  not neccessarily in parallel11:20.10 
  the issue comes in conjunction with garbage collection since java calls finalize in another thread11:20.28 
  the problem here being that if (and only if) loadPage() returns an already existing page (fz_keep_page etc) while java garbage collection calls Page.finalize()11:21.10 
sebras apollo13: that the finalize method is called means that the last reference to it has been freed according to the JVM.11:21.17 
apollo13 somehow somewhere in there seems to be a race11:21.18 
  yes11:21.25 
  but at the same time someone else calls loadPage() which would reuse (in C) the existing page11:21.54 
  we are currently working around with it by explicitly calling page.destroy() but that slows down the code quite a lot11:24.11 
sebras ator: doc->open doesn't keep references to the pages it puts in its list.11:25.37 
  ator: no wonder apollo13 sees issues when the document tries to reuse and old page.11:26.06 
apollo13 so to summarize what we are doing: some thread does page = document.loadPage and lets the page object go out of scope. at some point the garbage collection calls Page.finalize(), if at this some point another thread calls page = document.loadPage() reusing the old page it can crash11:26.19 
  does what I say make sense? :)11:26.33 
ator sebras: which doc->open?11:27.14 
sebras ator: fz_document has a fz_page *open.11:27.29 
apollo13 btw can someone explain to me how fz_page next/prev work. Ie why is prev a **page and next only *page?11:27.44 
sebras ator: when we add a newly loaded page in fz_load_chapter_page() we give the reference to the caller.11:27.51 
  ator: but we do not take an extra reference when inserting the page into fz_document->open11:28.10 
ator sebras: I think that list is supposed to be weak references11:28.22 
sebras ator: when the JVM sees no further variables pointing to the page fron the caller it will call the page finalizer, thereby dropping the page.11:29.13 
  ator: wouldn't that situation cause a stale page pointer to remain in fz_document->open?11:29.26 
ator fz_drop_page removes the page11:29.27 
  form the document->open list11:29.51 
sebras ah, I didn't see that because it didn't refer to ->open11:30.12 
Robin_Watts apollo13: prev is a pointer to the pointer to the current page.11:30.30 
  so you can unlink yourself by doing *(me->prev) = me->next11:30.57 
apollo13 sebras, ator: yes it drop and somewhere between fz_drop_page and fz_load_chapter_page (the loop there which iterates all pages) you get a dangling point I think11:31.13 
  ie if the loop find the page and then calls keep on it, drop_page could already be destroying it11:31.43 
Robin_Watts apollo13: You know your 'next' page is always going to be NULL or an fz_page *.11:31.46 
apollo13 Robin_Watts: And the same couldn't be done for 'prev'?11:32.25 
  oh wait, I thought prev is for previous11:32.38 
  and you just said it's for the current page11:32.52 
Robin_Watts No.11:32.57 
  prev points to "the thing that points to me". which is in most cases the previous page.11:33.31 
  Think of it this way.11:33.45 
ator apollo13: it is a doubly-linked list. the current page's prev points to the previous page's next pointer (which is the current page)11:33.46 
Robin_Watts next is either NULL or an fz_page.11:34.07 
apollo13 ah, slowly it's beginning to make sense :)11:34.14 
ator *or* if it's the first page, the root of the linked list: the fz_document.open pointer.11:34.21 
sebras ator: if we have one thread doing the finalizing calls and another thread calling fz_load_chapter_page() we do need some kind of lock over this list.11:34.26 
Robin_Watts BUT if you're the first page in the list, there is no previous page, but something still has to point to you.11:34.48 
ator sebras: I thought the get_context() stuff in JNI did the locking?11:34.59 
sebras ator: get_context() doesn't lock anything at all, it just stores the context as TLS.11:35.38 
  and clones the first context if the currently running thrad has no context in TLS.11:36.09 
ator sebras: so how do we not cock up when the naive java users gets into multiple threads in the normal case?11:36.17 
apollo13 ator: we only see it under high load when the GC hits at the wrong time11:36.45 
ator the TLS makes sure we don't screw up the context and error stack, but we can crash if the user calls a document method from multiple threads?11:36.57 
apollo13 usually it just works ™11:36.59 
  ator: even from the same thread11:37.12 
sebras ator: we can.11:37.15 
apollo13 the problem is that java's garbage collection runs in another thread (not much you can do about that afaik)11:37.30 
ator sebras: oh. that's probably what's happening here then, the GC is running/triggered on a separate thread.11:37.41 
apollo13 yes, that is what we are seeing in the asan traces11:37.57 
sebras ator: in C we say the same thing: work on the fz_document in one and only one thread and get your display lists, the lists can then be processed in other threads.11:38.13 
apollo13 the thread named "finalizer" is this special gc thread11:38.21 
sebras ator: the problem here is that the GC in a second thread, which is likely not a situation we've encountered in C.11:38.45 
ator sebras: I thought we were/did some 'synchronized' stuff in JNI to prevent accidents like that11:38.47 
Robin_Watts Urgh, this all sounds horrid.11:39.01 
apollo13 Robin_Watts: welcome to my world11:39.13 
sebras ator: we do not.11:39.14 
ator sebras: right. so caveat emptor there. still, that doesn't help the current situation though :(11:39.52 
  apollo13: what is the slowdown with calling page.destroy?11:40.30 
sebras ator: we could make all calls on Document synchronize on the document, that would help I think..?11:40.43 
apollo13 ator: what as in how much? well the page is no longer cached, so more work to do… :)11:40.57 
Robin_Watts sebras: Yes, I'm just pondering/googling that now.11:40.59 
ator sebras: not necessarily, the synchronized only locks per object IIRC11:41.03 
  and these are Page objects, not Document11:41.09 
  so won't synchronize to the same11:41.14 
sebras ator: ah, yes. we're dropping the page, not the document, of course.11:41.26 
ator that's why the first Android app JNI bindings used a global MuPDFCore object to do the synchronization11:41.36 
Robin_Watts We'd have to make the page objects finalize by calling through a sychronised method in the document.11:41.52 
ator Robin_Watts: or Context?11:42.19 
apollo13 Robin_Watts: I doubt that will help11:42.20 
  because loadPage will not be affected by that11:42.39 
Robin_Watts It's Document.loadPage, right?11:42.57 
apollo13 yes11:43.06 
sebras Robin_Watts: ator: why can't we add a lock at the C-level whenever we manipulate the document's opened page list?11:43.09 
ator synchronized Document.loadPage and synchronized Document.destroyPage() would work I think11:43.15 
apollo13 ator: how would you sync those around one lock11:43.31 
ator sebras: because I promise this won't be the only place :(11:43.37 
apollo13 if you just put synchronized there they'd sync independently no?11:43.41 
sebras ator: then we're in deep trouble in C too..?11:43.56 
ator synchronized uses a mutex on the object, not function11:44.06 
  sebras: we don't have garbage collectors running around behind the user's back poking stuff randomly11:44.25 
apollo13 ator: ah, I guess then I could at least come up with a local workaround11:44.52 
Robin_Watts I think my proposal is this: make Document.loadPage synchronized. Add synchronized Document.destroyPage. Make Page.finalize call Document.destroyPage.11:45.07 
apollo13 that sounds like a good solution.11:45.57 
sebras ator: "I promise this won't be the only place" what does this mean? that there are other objects where the finalizer would have the same problem?11:46.17 
Robin_Watts of course, I'm not sure we have a Document pointer from a Page at the java level.11:46.20 
ator I'm guessing this is only a problem because of the 'weak' references in the document list11:46.33 
sebras ator: I wouldn't be surprised if that is the case. but if it is then Robin_Watts's suggestion wouldn't work.11:46.41 
Robin_Watts sebras: My fix would be enough for *this* case.11:46.55 
ator Robin_Watts: we can add one11:47.07 
Robin_Watts so the can is kicked down the road until we find the next problem.11:47.11 
  ator: Yeah, that shouldn't cause problems, I hope...11:47.35 
sebras I'm still confused as to why we can't use a lock over the page list.11:47.39 
Robin_Watts sebras: which lock?11:47.45 
  a new lock?11:48.15 
sebras Robin_Watts: possibly, yes.11:48.29 
ator the malloc lock maybe?11:48.35 
Robin_Watts ator: If we use the malloc lock, we then can't malloc :)11:48.52 
apollo13 :D11:49.01 
sebras Robin_Watts: more importantly, we can't drop, right?11:49.11 
Robin_Watts sebras: Indeed.11:49.20 
ator Robin_Watts: fz_load_chapter_page has two blocks that need protecting11:49.21 
sebras because the locks we have are not necessarily recursive.11:49.21 
ator and fz_drop_page has one block11:49.42 
  one that scans the list, and two that insert and delete items from it11:49.59 
  none of which call any functions11:50.10 
apollo13 wouldn't fz_keep_page also need the look then?11:50.15 
Robin_Watts We could have FZ_LOCK_DOC which we used to protect "the document structure".11:50.50 
  it'd be inserted between ALLOC and FREETYPE11:51.04 
ator apollo13: no, that just bumps the reference count. isn't the problem happening because drop_page removes it from the list while the list is being scanned in load_page?11:51.10 
sebras ator: do we have weak references like this elsewhere?11:51.30 
Robin_Watts ator: right, it's 2 threads accessing the document structure at the same time.11:51.31 
apollo13 ator: yes11:51.34 
ator Robin_Watts: *all* destructors will access all structures from 2 threads :(11:51.56 
  but I think the list of of destructors that do anything significant is very short11:52.15 
Robin_Watts ator: I'm not following that assertion.11:52.34 
apollo13 ator: at least I assume so, but looking at the asan log, what is failing is fz_keep_page itself, because it gets passed and already deleted page (from the iteration)11:52.41 
sebras apollo13: that deleted page comes from the list of opened pages stored in the document11:53.11 
apollo13 jupp11:53.17 
ator apollo13: that's what I assume would happen. the load_page gets a page from the list at the same time another thread just freed it11:53.31 
apollo13 ok, then we are all on the same side :)11:53.55 
Robin_Watts ator: Are you saying that a DOC lock is not sufficient?11:54.15 
  or just that it might need to be used in more places?11:54.27 
ator Robin_Watts: I think a DOC lock is probably good enough11:54.28 
sebras ator: Robin_Watts: is a DOC lock a good solution though?11:54.43 
ator just might be needed in more places, depending on how often this problem crops up11:54.43 
Robin_Watts ator, sebras: So, which of us gets to do this?11:54.49 
ator the normal fz_drop_imp does enough locking to keep us from double-freeing in the destructor11:55.05 
sebras or do you still prefer sychronized?11:55.12 
ator touches his nose.11:55.28 
Robin_Watts sebras: I'd rather do it in C than in Java, I think.11:55.30 
sebras I can do it.11:55.52 
Robin_Watts I'm happy to have a go at it. I have just finished stuff on my list, so this would be a good point.11:56.02 
sebras Robin_Watts: ok. then you do it. :)11:56.13 
Robin_Watts but if sebras wants it, I have other stuff to do.11:56.14 
sebras (it sounded like you both were preoccupied)11:56.40 
apollo13 well either way I'll happily test a patch and verify that it fixes the issue since I have a good reproducer11:56.57 
Robin_Watts OK. I'll bash at it now.11:57.20 
  apollo13: Are you around for a couple of hours?11:57.30 
  (I'm on UK time, where are you?)11:57.40 
apollo13 Robin_Watts: austria, yeah will be around12:01.23 
Robin_Watts actually, I think we *can* use the malloc lock for this.12:02.27 
  which is nicer.12:02.51 
sebras Robin_Watts: because it happens after fz_drop_imp()?12:03.48 
Robin_Watts because we don't do any allocations/deallocations while we're modifying the structure itself.12:04.24 
  We allocate then modify, or modify then deallocate.12:04.43 
Kabouik I am getting "fatal error: lcms2mt.h" when trying to build from source on Debian, yet I did run `apt-get build-dep` first12:21.07 
  Is there anything else I'm missing?12:21.13 
Robin_Watts Kabouik: Our sources don't come as an apk.12:23.43 
apollo13 Kabouik: did you update the submodules first?12:23.55 
Robin_Watts s/as an apk/via apt/12:23.56 
ator Kabouik: if you cloned from git, you have to run 'git submodule update --init' first12:24.20 
Robin_Watts If you're building from git, then do "git submodule update --init"12:24.20 
apollo13 I think I've seen this yesterday after just checking out the repo and forgot the submodules12:24.21 
ator (or clone with --recursive)12:24.24 
Robin_Watts If you're building from one of our release archives, the sources should all be in there.12:24.46 
  If you're building from your distros repackaging of our stuff, then you need to complain to them. :)12:25.02 
  apollo13, ator, sebras: This is my first attempt: https://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=e4afff800d62ae83e3a45427bf2c0c09483e04a912:25.30 
  Just cluster testing that now.12:25.35 
apollo13 ok, rebuilding and testing12:26.14 
Robin_Watts grabs lunch. bbs.12:27.38 
Kabouik apk? What I'm trying to do has nothing to do with apks, does it?12:28.39 
  Oh right, I read too fast12:28.54 
  Actually what I'm trying to do is this: https://askubuntu.com/a/1208373 If I understand correctly, it just fetches the sources from git anyway, but the build-dep should automate the dependencies part12:29.35 
  I'm wrong, it doesn't fetch from git, it recommends fetching from git in the logs :p12:31.04 
  I'll just do that.12:31.14 
apollo13 Kabouik: build-dep will install the dependencies, that is correct, but mupdf doesn't use those12:31.20 
  at least not in the standard build setup12:31.29 
Kabouik What I wanted is manually build mupdf to get mupdf-gl executable, and the freeglut fixes12:32.00 
  I can build from git I guess, no problem12:32.07 
apollo13 Robin_Watts: patch looks good, no segfault till now12:41.52 
ator Robin_Watts: LGTM12:44.04 
Robin_Watts apollo13, ator: Thanks.13:31.56 
  Pushed to master.13:32.27 
sebras Robin_Watts: thanks for fixing.13:32.36 
  Robin_Watts: let's hope there are no more locations...13:32.44 
Robin_Watts apollo13: Thanks for that. Such coherent bug reports are rare, and much appreciated!13:32.55 
  sebras: indeed.13:33.06 
apollo13 Robin_Watts: np, took us a while to find (and get it somewhat reproducable). Thank you for the quick fix!13:33.34 
malc_ hmm.. what determines whether meta constant deserves a define? cause there isn't one for creation date19:56.33 
sebras malc_: it is quite likely that there should be one.20:15.06 
malc_ sebras: it's equally likely that this is a whack a mole...20:15.58 
sebras malc_: not really. we know what the standard names are in /Info20:16.16 
  malc_: personally I prefer FZ_META_* defines since they will not change.20:17.05 
malc_ sebras: for my program - you are correct, only CreationDate is missing a define, at large - no idea, you know better20:19.04 
  https://github.com/moosotc/llpp/blob/master/link.c#L39520:20.42 
sebras malc_: Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate, Trappaed are the fields that are specified.20:21.03 
  Trapped.20:21.34 
malc_ sebras: tack!20:21.43 
sebras maybe we should add FZ_META_INFO_CREATION_DATE and FZ_META_INFO_MOD_DATE and be done with it.20:21.56 
  I've never seen _ANYONE_ use keywords or subject fields.20:22.15 
  and I have no idea was /Trapped means.20:22.29 
malc_ maybe it's those adobe technical writers trying to tell us something20:23.28 
 <<<Back 1 day (to 2020/07/08)Forward 1 day (to 2020/07/10)>>> 
ghostscript.com #ghostscript
Search: