| <<<Back 1 day (to 2020/07/08) | Fwd 1 day (to 2020/07/10)>>> | 20200709 |
apollo13 | hi there, we are seeing crashes with mupdf and the java bindings. We instrumented the code with asan and voila: https://gist.github.com/apollo13/eb5499fc6911ddccbbfe44169f661c2d it seems as if a thread tries to load a page while the garbage collector disposes the same page, that mupdf tries to reuse an already freed memory section | 11:04.18 |
| Looking at the code I think it might have to do with the page list probably not beeing threadsafe? | 11:05.00 |
ator | apollo13: the C code needs care to not trip over itself when threading, but I think the java bindings should take care of that | 11:16.05 |
| so it looks like a bug for Robin_Watts or sebras I think | 11:16.20 |
sebras | apollo13: what are your multiple threads doing? | 11:18.30 |
apollo13 | sebras: well, they call Document.loadPage and render the page into a bitmap | 11:19.04 |
sebras | apollo13: is one thread doing all the accessing of the Document and then other threads working on rendered display lists? | 11:19.12 |
apollo13 | this is in a webservice, what happening is the following: | 11:19.14 |
sebras | s/rendered// | 11:19.17 |
apollo13 | we keep the document object around and when a new http request comes in it does document.loadPage() | 11:19.44 |
sebras | apollo13: so you have multiple threads calling Document.loadPage() on the same Document? | 11:19.49 |
apollo13 | yes | 11:19.57 |
| but one after another | 11:20.06 |
| not neccessarily in parallel | 11:20.10 |
| the issue comes in conjunction with garbage collection since java calls finalize in another thread | 11:20.28 |
| the problem here being that if (and only if) loadPage() returns an already existing page (fz_keep_page etc) while java garbage collection calls Page.finalize() | 11:21.10 |
sebras | apollo13: that the finalize method is called means that the last reference to it has been freed according to the JVM. | 11:21.17 |
apollo13 | somehow somewhere in there seems to be a race | 11:21.18 |
| yes | 11:21.25 |
| but at the same time someone else calls loadPage() which would reuse (in C) the existing page | 11:21.54 |
| we are currently working around with it by explicitly calling page.destroy() but that slows down the code quite a lot | 11:24.11 |
sebras | ator: doc->open doesn't keep references to the pages it puts in its list. | 11:25.37 |
| ator: no wonder apollo13 sees issues when the document tries to reuse and old page. | 11:26.06 |
apollo13 | so to summarize what we are doing: some thread does page = document.loadPage and lets the page object go out of scope. at some point the garbage collection calls Page.finalize(), if at this some point another thread calls page = document.loadPage() reusing the old page it can crash | 11:26.19 |
| does what I say make sense? :) | 11:26.33 |
ator | sebras: which doc->open? | 11:27.14 |
sebras | ator: fz_document has a fz_page *open. | 11:27.29 |
apollo13 | btw can someone explain to me how fz_page next/prev work. Ie why is prev a **page and next only *page? | 11:27.44 |
sebras | ator: when we add a newly loaded page in fz_load_chapter_page() we give the reference to the caller. | 11:27.51 |
| ator: but we do not take an extra reference when inserting the page into fz_document->open | 11:28.10 |
ator | sebras: I think that list is supposed to be weak references | 11:28.22 |
sebras | ator: when the JVM sees no further variables pointing to the page fron the caller it will call the page finalizer, thereby dropping the page. | 11:29.13 |
| ator: wouldn't that situation cause a stale page pointer to remain in fz_document->open? | 11:29.26 |
ator | fz_drop_page removes the page | 11:29.27 |
| form the document->open list | 11:29.51 |
sebras | ah, I didn't see that because it didn't refer to ->open | 11:30.12 |
Robin_Watts | apollo13: prev is a pointer to the pointer to the current page. | 11:30.30 |
| so you can unlink yourself by doing *(me->prev) = me->next | 11:30.57 |
apollo13 | sebras, ator: yes it drop and somewhere between fz_drop_page and fz_load_chapter_page (the loop there which iterates all pages) you get a dangling point I think | 11:31.13 |
| ie if the loop find the page and then calls keep on it, drop_page could already be destroying it | 11:31.43 |
Robin_Watts | apollo13: You know your 'next' page is always going to be NULL or an fz_page *. | 11:31.46 |
apollo13 | Robin_Watts: And the same couldn't be done for 'prev'? | 11:32.25 |
| oh wait, I thought prev is for previous | 11:32.38 |
| and you just said it's for the current page | 11:32.52 |
Robin_Watts | No. | 11:32.57 |
| prev points to "the thing that points to me". which is in most cases the previous page. | 11:33.31 |
| Think of it this way. | 11:33.45 |
ator | apollo13: it is a doubly-linked list. the current page's prev points to the previous page's next pointer (which is the current page) | 11:33.46 |
Robin_Watts | next is either NULL or an fz_page. | 11:34.07 |
apollo13 | ah, slowly it's beginning to make sense :) | 11:34.14 |
ator | *or* if it's the first page, the root of the linked list: the fz_document.open pointer. | 11:34.21 |
sebras | ator: if we have one thread doing the finalizing calls and another thread calling fz_load_chapter_page() we do need some kind of lock over this list. | 11:34.26 |
Robin_Watts | BUT if you're the first page in the list, there is no previous page, but something still has to point to you. | 11:34.48 |
ator | sebras: I thought the get_context() stuff in JNI did the locking? | 11:34.59 |
sebras | ator: get_context() doesn't lock anything at all, it just stores the context as TLS. | 11:35.38 |
| and clones the first context if the currently running thrad has no context in TLS. | 11:36.09 |
ator | sebras: so how do we not cock up when the naive java users gets into multiple threads in the normal case? | 11:36.17 |
apollo13 | ator: we only see it under high load when the GC hits at the wrong time | 11:36.45 |
ator | the TLS makes sure we don't screw up the context and error stack, but we can crash if the user calls a document method from multiple threads? | 11:36.57 |
apollo13 | usually it just works ™ | 11:36.59 |
| ator: even from the same thread | 11:37.12 |
sebras | ator: we can. | 11:37.15 |
apollo13 | the problem is that java's garbage collection runs in another thread (not much you can do about that afaik) | 11:37.30 |
ator | sebras: oh. that's probably what's happening here then, the GC is running/triggered on a separate thread. | 11:37.41 |
apollo13 | yes, that is what we are seeing in the asan traces | 11:37.57 |
sebras | ator: in C we say the same thing: work on the fz_document in one and only one thread and get your display lists, the lists can then be processed in other threads. | 11:38.13 |
apollo13 | the thread named "finalizer" is this special gc thread | 11:38.21 |
sebras | ator: the problem here is that the GC in a second thread, which is likely not a situation we've encountered in C. | 11:38.45 |
ator | sebras: I thought we were/did some 'synchronized' stuff in JNI to prevent accidents like that | 11:38.47 |
Robin_Watts | Urgh, this all sounds horrid. | 11:39.01 |
apollo13 | Robin_Watts: welcome to my world | 11:39.13 |
sebras | ator: we do not. | 11:39.14 |
ator | sebras: right. so caveat emptor there. still, that doesn't help the current situation though :( | 11:39.52 |
| apollo13: what is the slowdown with calling page.destroy? | 11:40.30 |
sebras | ator: we could make all calls on Document synchronize on the document, that would help I think..? | 11:40.43 |
apollo13 | ator: what as in how much? well the page is no longer cached, so more work to do… :) | 11:40.57 |
Robin_Watts | sebras: Yes, I'm just pondering/googling that now. | 11:40.59 |
ator | sebras: not necessarily, the synchronized only locks per object IIRC | 11:41.03 |
| and these are Page objects, not Document | 11:41.09 |
| so won't synchronize to the same | 11:41.14 |
sebras | ator: ah, yes. we're dropping the page, not the document, of course. | 11:41.26 |
ator | that's why the first Android app JNI bindings used a global MuPDFCore object to do the synchronization | 11:41.36 |
Robin_Watts | We'd have to make the page objects finalize by calling through a sychronised method in the document. | 11:41.52 |
ator | Robin_Watts: or Context? | 11:42.19 |
apollo13 | Robin_Watts: I doubt that will help | 11:42.20 |
| because loadPage will not be affected by that | 11:42.39 |
Robin_Watts | It's Document.loadPage, right? | 11:42.57 |
apollo13 | yes | 11:43.06 |
sebras | Robin_Watts: ator: why can't we add a lock at the C-level whenever we manipulate the document's opened page list? | 11:43.09 |
ator | synchronized Document.loadPage and synchronized Document.destroyPage() would work I think | 11:43.15 |
apollo13 | ator: how would you sync those around one lock | 11:43.31 |
ator | sebras: because I promise this won't be the only place :( | 11:43.37 |
apollo13 | if you just put synchronized there they'd sync independently no? | 11:43.41 |
sebras | ator: then we're in deep trouble in C too..? | 11:43.56 |
ator | synchronized uses a mutex on the object, not function | 11:44.06 |
| sebras: we don't have garbage collectors running around behind the user's back poking stuff randomly | 11:44.25 |
apollo13 | ator: ah, I guess then I could at least come up with a local workaround | 11:44.52 |
Robin_Watts | I think my proposal is this: make Document.loadPage synchronized. Add synchronized Document.destroyPage. Make Page.finalize call Document.destroyPage. | 11:45.07 |
apollo13 | that sounds like a good solution. | 11:45.57 |
sebras | ator: "I promise this won't be the only place" what does this mean? that there are other objects where the finalizer would have the same problem? | 11:46.17 |
Robin_Watts | of course, I'm not sure we have a Document pointer from a Page at the java level. | 11:46.20 |
ator | I'm guessing this is only a problem because of the 'weak' references in the document list | 11:46.33 |
sebras | ator: I wouldn't be surprised if that is the case. but if it is then Robin_Watts's suggestion wouldn't work. | 11:46.41 |
Robin_Watts | sebras: My fix would be enough for *this* case. | 11:46.55 |
ator | Robin_Watts: we can add one | 11:47.07 |
Robin_Watts | so the can is kicked down the road until we find the next problem. | 11:47.11 |
| ator: Yeah, that shouldn't cause problems, I hope... | 11:47.35 |
sebras | I'm still confused as to why we can't use a lock over the page list. | 11:47.39 |
Robin_Watts | sebras: which lock? | 11:47.45 |
| a new lock? | 11:48.15 |
sebras | Robin_Watts: possibly, yes. | 11:48.29 |
ator | the malloc lock maybe? | 11:48.35 |
Robin_Watts | ator: If we use the malloc lock, we then can't malloc :) | 11:48.52 |
apollo13 | :D | 11:49.01 |
sebras | Robin_Watts: more importantly, we can't drop, right? | 11:49.11 |
Robin_Watts | sebras: Indeed. | 11:49.20 |
ator | Robin_Watts: fz_load_chapter_page has two blocks that need protecting | 11:49.21 |
sebras | because the locks we have are not necessarily recursive. | 11:49.21 |
ator | and fz_drop_page has one block | 11:49.42 |
| one that scans the list, and two that insert and delete items from it | 11:49.59 |
| none of which call any functions | 11:50.10 |
apollo13 | wouldn't fz_keep_page also need the look then? | 11:50.15 |
Robin_Watts | We could have FZ_LOCK_DOC which we used to protect "the document structure". | 11:50.50 |
| it'd be inserted between ALLOC and FREETYPE | 11:51.04 |
ator | apollo13: no, that just bumps the reference count. isn't the problem happening because drop_page removes it from the list while the list is being scanned in load_page? | 11:51.10 |
sebras | ator: do we have weak references like this elsewhere? | 11:51.30 |
Robin_Watts | ator: right, it's 2 threads accessing the document structure at the same time. | 11:51.31 |
apollo13 | ator: yes | 11:51.34 |
ator | Robin_Watts: *all* destructors will access all structures from 2 threads :( | 11:51.56 |
| but I think the list of of destructors that do anything significant is very short | 11:52.15 |
Robin_Watts | ator: I'm not following that assertion. | 11:52.34 |
apollo13 | ator: at least I assume so, but looking at the asan log, what is failing is fz_keep_page itself, because it gets passed and already deleted page (from the iteration) | 11:52.41 |
sebras | apollo13: that deleted page comes from the list of opened pages stored in the document | 11:53.11 |
apollo13 | jupp | 11:53.17 |
ator | apollo13: that's what I assume would happen. the load_page gets a page from the list at the same time another thread just freed it | 11:53.31 |
apollo13 | ok, then we are all on the same side :) | 11:53.55 |
Robin_Watts | ator: Are you saying that a DOC lock is not sufficient? | 11:54.15 |
| or just that it might need to be used in more places? | 11:54.27 |
ator | Robin_Watts: I think a DOC lock is probably good enough | 11:54.28 |
sebras | ator: Robin_Watts: is a DOC lock a good solution though? | 11:54.43 |
ator | just might be needed in more places, depending on how often this problem crops up | 11:54.43 |
Robin_Watts | ator, sebras: So, which of us gets to do this? | 11:54.49 |
ator | the normal fz_drop_imp does enough locking to keep us from double-freeing in the destructor | 11:55.05 |
sebras | or do you still prefer sychronized? | 11:55.12 |
ator | touches his nose. | 11:55.28 |
Robin_Watts | sebras: I'd rather do it in C than in Java, I think. | 11:55.30 |
sebras | I can do it. | 11:55.52 |
Robin_Watts | I'm happy to have a go at it. I have just finished stuff on my list, so this would be a good point. | 11:56.02 |
sebras | Robin_Watts: ok. then you do it. :) | 11:56.13 |
Robin_Watts | but if sebras wants it, I have other stuff to do. | 11:56.14 |
sebras | (it sounded like you both were preoccupied) | 11:56.40 |
apollo13 | well either way I'll happily test a patch and verify that it fixes the issue since I have a good reproducer | 11:56.57 |
Robin_Watts | OK. I'll bash at it now. | 11:57.20 |
| apollo13: Are you around for a couple of hours? | 11:57.30 |
| (I'm on UK time, where are you?) | 11:57.40 |
apollo13 | Robin_Watts: austria, yeah will be around | 12:01.23 |
Robin_Watts | actually, I think we *can* use the malloc lock for this. | 12:02.27 |
| which is nicer. | 12:02.51 |
sebras | Robin_Watts: because it happens after fz_drop_imp()? | 12:03.48 |
Robin_Watts | because we don't do any allocations/deallocations while we're modifying the structure itself. | 12:04.24 |
| We allocate then modify, or modify then deallocate. | 12:04.43 |
Kabouik | I am getting "fatal error: lcms2mt.h" when trying to build from source on Debian, yet I did run `apt-get build-dep` first | 12:21.07 |
| Is there anything else I'm missing? | 12:21.13 |
Robin_Watts | Kabouik: Our sources don't come as an apk. | 12:23.43 |
apollo13 | Kabouik: did you update the submodules first? | 12:23.55 |
Robin_Watts | s/as an apk/via apt/ | 12:23.56 |
ator | Kabouik: if you cloned from git, you have to run 'git submodule update --init' first | 12:24.20 |
Robin_Watts | If you're building from git, then do "git submodule update --init" | 12:24.20 |
apollo13 | I think I've seen this yesterday after just checking out the repo and forgot the submodules | 12:24.21 |
ator | (or clone with --recursive) | 12:24.24 |
Robin_Watts | If you're building from one of our release archives, the sources should all be in there. | 12:24.46 |
| If you're building from your distros repackaging of our stuff, then you need to complain to them. :) | 12:25.02 |
| apollo13, ator, sebras: This is my first attempt: https://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=e4afff800d62ae83e3a45427bf2c0c09483e04a9 | 12:25.30 |
| Just cluster testing that now. | 12:25.35 |
apollo13 | ok, rebuilding and testing | 12:26.14 |
Robin_Watts | grabs lunch. bbs. | 12:27.38 |
Kabouik | apk? What I'm trying to do has nothing to do with apks, does it? | 12:28.39 |
| Oh right, I read too fast | 12:28.54 |
| Actually what I'm trying to do is this: https://askubuntu.com/a/1208373 If I understand correctly, it just fetches the sources from git anyway, but the build-dep should automate the dependencies part | 12:29.35 |
| I'm wrong, it doesn't fetch from git, it recommends fetching from git in the logs :p | 12:31.04 |
| I'll just do that. | 12:31.14 |
apollo13 | Kabouik: build-dep will install the dependencies, that is correct, but mupdf doesn't use those | 12:31.20 |
| at least not in the standard build setup | 12:31.29 |
Kabouik | What I wanted is manually build mupdf to get mupdf-gl executable, and the freeglut fixes | 12:32.00 |
| I can build from git I guess, no problem | 12:32.07 |
apollo13 | Robin_Watts: patch looks good, no segfault till now | 12:41.52 |
ator | Robin_Watts: LGTM | 12:44.04 |
Robin_Watts | apollo13, ator: Thanks. | 13:31.56 |
| Pushed to master. | 13:32.27 |
sebras | Robin_Watts: thanks for fixing. | 13:32.36 |
| Robin_Watts: let's hope there are no more locations... | 13:32.44 |
Robin_Watts | apollo13: Thanks for that. Such coherent bug reports are rare, and much appreciated! | 13:32.55 |
| sebras: indeed. | 13:33.06 |
apollo13 | Robin_Watts: np, took us a while to find (and get it somewhat reproducable). Thank you for the quick fix! | 13:33.34 |
malc_ | hmm.. what determines whether meta constant deserves a define? cause there isn't one for creation date | 19:56.33 |
sebras | malc_: it is quite likely that there should be one. | 20:15.06 |
malc_ | sebras: it's equally likely that this is a whack a mole... | 20:15.58 |
sebras | malc_: not really. we know what the standard names are in /Info | 20:16.16 |
| malc_: personally I prefer FZ_META_* defines since they will not change. | 20:17.05 |
malc_ | sebras: for my program - you are correct, only CreationDate is missing a define, at large - no idea, you know better | 20:19.04 |
| https://github.com/moosotc/llpp/blob/master/link.c#L395 | 20:20.42 |
sebras | malc_: Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate, Trappaed are the fields that are specified. | 20:21.03 |
| Trapped. | 20:21.34 |
malc_ | sebras: tack! | 20:21.43 |
sebras | maybe we should add FZ_META_INFO_CREATION_DATE and FZ_META_INFO_MOD_DATE and be done with it. | 20:21.56 |
| I've never seen _ANYONE_ use keywords or subject fields. | 20:22.15 |
| and I have no idea was /Trapped means. | 20:22.29 |
malc_ | maybe it's those adobe technical writers trying to tell us something | 20:23.28 |
| <<<Back 1 day (to 2020/07/08) | Forward 1 day (to 2020/07/10)>>> | |