| <<<Back 1 day (to 2016/09/25) | 20160926 |
Robin_Watts | tor8: So, we should start thinking about a MuPDF release. | 10:19.18 |
| tor8: Anything we are waiting for? | 10:19.33 |
| I guess the latest openjpeg fixes would be good. | 10:19.42 |
sebras | Robin_Watts: do we want the latest luratech drop in the release? (I don't know how you guys plan to do the commercial release..?) | 10:21.58 |
Robin_Watts | sebras: I don't think we've done a commercial release up til now. | 10:22.25 |
sebras | Robin_Watts: ok, so that would be a new thing. | 10:22.33 |
Robin_Watts | yeah. | 10:22.40 |
sebras | Robin_Watts: maybe something to bring up at the meeting? | 10:22.44 |
Robin_Watts | sebras: Yeah. | 10:22.49 |
| If we did a separate commercial release with luratech, would that actually be better than openjpeg at the moment? | 10:23.24 |
tor8 | Robin_Watts: the stuff on tor/master and a round of testing the old android/ios apps to make sure we haven't broken anything | 10:23.35 |
Robin_Watts | i.e. is the luratech integration at least as good as openjpeg ? | 10:23.38 |
sebras | Robin_Watts: I feel that this is the case, but they both fail on different test cases. | 10:24.16 |
tor8 | I think we're good on the last round of fuzzing bugs that came in | 10:24.21 |
sebras | Robin_Watts: so neither is bullet proof at the moment. | 10:24.24 |
| tor8: I found a few valgrind issues in gif and png yesterday evening. | 10:24.52 |
tor8 | sebras: if the only "problem" is that they fail differently on invalid files, I'm not concerned | 10:24.58 |
sebras | tor8: but I don't think I will be able to fix them before the release. | 10:25.02 |
| tor8: by failing on j2k/jp2 I mean valgrind issues, just to be clear. :) | 10:25.23 |
tor8 | I'm not in a hurry to release | 10:25.26 |
| sebras: right. valgrind issues are a different issue :/ | 10:25.43 |
Robin_Watts | sebras: Can you open bugs for those issues please? | 10:25.51 |
sebras | Robin_Watts: the luratech ones too? | 10:25.57 |
| Robin_Watts: that will be a lot... | 10:26.03 |
| Robin_Watts: like 40..? | 10:26.08 |
tor8 | I don't feel comfortable releasing with known valgrind issues | 10:26.15 |
Robin_Watts | I meant the gif and png. | 10:26.18 |
sebras | Robin_Watts: ok, sure. | 10:26.24 |
Robin_Watts | cos I may be able to help with some of those. | 10:26.29 |
| If luratech is valgrind faily, then I don't personally feel the need to rush to a separate commercial release until we fix those. | 10:26.56 |
sebras | Robin_Watts: ok, I'll report them later today. I need to go for dinner quite soon (typhoon again tonight/tomorrow morning) | 10:26.56 |
Robin_Watts | sebras: Sure, np. | 10:27.02 |
sebras | Robin_Watts: openjpeg is also valgrind faily. | 10:27.07 |
Robin_Watts | sebras: Right, but we've already rung that bell. | 10:27.23 |
sebras | Robin_Watts: but maybe the fixes exist upstream, I haven't figured that out yet. | 10:27.33 |
Robin_Watts | I'm going to try pulling in some upstream fixes now. | 10:27.46 |
sebras | Robin_Watts: is it possible to see the full list of failing testcases on trunk? not just the diff with the previous run..? | 10:28.23 |
Robin_Watts | sebras: I think the cluster always reports all the SEGVs/asserts. | 10:28.57 |
| It doesn't report all the cases where files exit neatly with an error. | 10:29.22 |
sebras | Robin_Watts: aha. | 10:29.42 |
Robin_Watts | So if you want to see all the cases, cluster test a diff that crashes on an non zero error return :) | 10:29.49 |
sebras | Robin_Watts: the tiff issue sI fixed were indeed reported as errors. | 10:29.56 |
| tor8: Robin_Watts: I don't know if you guys saw that I fixed a few pnm-/gif-/tiff-related things recently. | 10:30.50 |
Robin_Watts | I saw changes in pbm the other day (black -> white) and an alpha thing in tiff before that. | 10:31.57 |
| Further back than that, my memory fails me. | 10:32.12 |
sebras | ok. I had a stray WIP-prefix in there, fixed. | 10:33.13 |
tor8 | sebras: "when called code throws" is a pretty vague statement :) | 10:33.52 |
sebras | tor8: agreed, care to help in rephrasing it? | 10:34.43 |
| tor8: basically fz_parse_html() and fz_read_all() may both thow. | 10:34.56 |
Robin_Watts | Hmm. 480 commits since the version.2.1 tag. | 10:35.13 |
tor8 | Fix memory leak in bla bla bla. | 10:35.22 |
| sebras: the gif fix changes the number of bytes read from info->width * height to info->image_width * height | 10:36.12 |
| is that a separate fix not mentioned in the comment? | 10:36.19 |
sebras | tor8: yes, probably. | 10:36.59 |
| tor8: info->width/height is the size of the entire image, while info->image_width info->image_height is the size of this gif frame. | 10:37.29 |
| tor8: if I split the two commits I'll check what terminology the spec uses and change the variable naming. | 10:38.00 |
| tor8: Fix memory leak when opening html/loading raw stream. ? | 10:38.17 |
tor8 | sebras: sounds good | 10:42.06 |
sebras | tor8: hm.. gif->width is called "logical screen width" in the spec and gif->image_width is "image width"... I think I'll stick with the existing naming. | 10:44.29 |
| tor8: now two gif-commits. | 10:47.50 |
tor8 | sebras: LGTM | 10:50.05 |
| sebras: hm, can we really handle bits per sample 16 in the tiff code? | 10:50.34 |
sebras | Robin_Watts: there is a getcomp() in there that seems plausible and the images I tested with looked ok. | 10:50.56 |
tor8 | sebras: oh yeah. that should be okay then I guess. | 10:51.29 |
| sebras: where do we run into tiled TIFF images? | 10:52.50 |
| not as part of XPS, I hope. | 10:53.04 |
sebras | tor8: no, I ran into them in a testsuite. | 10:53.20 |
| tor8: and was annoyed enough to fix them. :) | 10:53.32 |
| tor8: incidentally this is where I also found the problematic gifs and pngs. | 10:53.59 |
Robin_Watts | sebras: Have we added those to the cluster? | 10:54.14 |
sebras | Robin_Watts: no, not yet. | 10:54.20 |
Robin_Watts | I imported the mupdf TIFF code (with a lot of munging) into SOT. | 10:54.47 |
| so I should check your fixes to see if they apply there too. | 10:54.57 |
sebras | Robin_Watts: there is some code reorg. going on to make the tiled commit smaller. | 10:55.19 |
| Robin_Watts: but that should be easy enough to follow. | 10:55.29 |
| tor8: so...do I push to golden? | 10:55.38 |
| tor8: I'm fearing you have last minute worries. :) | 10:55.51 |
tor8 | sebras: "image is not stripped" ... | 11:01.24 |
sebras | tor8: striped? | 11:02.00 |
tor8 | that's also an inaccurate description | 11:02.34 |
| image is missing strip data | 11:02.42 |
sebras | tor8: not necessarily, it might have stripbytecounts and stripdata but no stripoffsets tag. | 11:03.23 |
tor8 | if we can't find the data, that's close enough to missing for me | 11:04.12 |
sebras | tor8: do you prefer the string to be "image is missing strip/tile metadata"? | 11:04.14 |
tor8 | it's not metadata. | 11:04.39 |
sebras | I thinking of the pixel values themselves (regardles sof compression) as the strip data, that's why I'm thinking that the offsets and bytecounts are metadata. | 11:05.23 |
| tor8: "image is missing strip data"? | 11:07.18 |
| tor8: and later "image is missing both strip and tile data" ? | 11:07.40 |
tor8 | sebras: yes. | 11:07.50 |
sebras | tor8: final lgtm? | 11:10.18 |
tor8 | sebras: LGTM | 11:13.32 |
sebras | tor8: btw, I _really_ need to fix those openjpeg issues now because I think those errors cause cluster runs to run even slower. :-/ | 11:16.21 |
tor8 | sebras: go for it. we won't make the release before I'm back from staff meeting. | 11:16.57 |
sebras | tor8: I'll do my worst. | 11:17.07 |
tor8 | so you've got at least one week :) | 11:17.14 |
Robin_Watts | sebras: I'm just looking at updating openjpeg now. | 11:19.39 |
sebras | Robin_Watts: http://bugs.ghostscript.com/show_bug.cgi?id=697151 | 11:22.52 |
Robin_Watts | sebras: Right. That might possibly be fixed by my plan to move away from fz_unpack_tile to a stream based unpacker. | 11:27.30 |
| It's what I do in the SOT version. | 11:27.36 |
sebras | Robin_Watts: http://bugs.ghostscript.com/show_bug.cgi?id=697152 | 11:28.38 |
| Robin_Watts: I reported a slew of new fuzzing tickets. I have few more to report though. | 12:45.38 |
| Robin_Watts: then I'll get on with the openjpeg ones...? | 12:45.47 |
Robin_Watts | sebras: The new openjpeg has introduced threads... | 12:58.51 |
| Does openjpeg ignore fz_malloc etc and go straight to malloc ? | 13:10.47 |
| It does. ASS. | 13:18.19 |
| and it doesn't pass a context around either. | 13:18.31 |
| so I need to do a hb_alloc style fix. | 13:18.40 |
tor8 | Robin_Watts: ugh. | 13:18.47 |
Robin_Watts | after lunch. | 13:18.50 |
sebras | Robin_Watts: is that only for the new verision? | 14:04.17 |
Robin_Watts | sebras: No, it's been a problem historically too. | 14:11.28 |
sebras | is there any reason to keep a fz_page around after having called fz_run_page()? | 14:32.40 |
| i.e. are there any references from the displaylist or something to some objects inside the page? | 14:32.57 |
| s/objects/data/ | 14:33.08 |
Robin_Watts | sebras: There are reasons for keeping it around (you might want to run it again). | 14:33.53 |
| It is not REQUIRED that you keep it around though. | 14:34.06 |
| Any references from (say) the displaylist to objects within the page should be sufficient to keep those objects around. | 14:34.41 |
| For instance, in the gproof device, we run the page and get a load of fz_images put into the displaylist. | 14:35.06 |
| These fz_images all have references to the underlying file on disc for the page. | 14:35.22 |
| the file is only closed when the page AND the images are closed. | 14:35.35 |
sebras | Robin_Watts: right. | 14:40.49 |
| Robin_Watts: I think there's some funny business going on in dodrawpage() with some code assuming it has to clean up a page in case of a throw, while other code thinks it is _its_ responsibility. | 14:47.09 |
| Robin_Watts: tor8: also allocating a pixmap just to transfer colorspace and w/h to fz_load_jpx_info() feels a bit iffy. :-/ | 14:56.41 |
tor8 | sebras: you've only got yourself to blame there. you're responsible for adding the support for loading jpx metadata info to allow for JPX files in CBZ ;) | 14:58.20 |
sebras | tor8: I know, I'm not blaming anyone else! | 14:58.49 |
| tor8: I'll fix it. | 14:59.08 |
tor8 | but yes, it does seem a bit awkward so we should probably patch it up some other way | 14:59.11 |
sebras | tor8: also if fz_convert_pixmap() and friends fail then we never call opj_image_destroy()... | 15:00.51 |
| though... I could blame the ones reviewing it a bit... | 15:02.03 |
Robin_Watts | I've just pushed a commit to thirdparty/openjpeg that updates it with the latest stuff. | 15:09.01 |
| I need to push it to be able to test it. | 15:09.13 |
| We can force push if we find problems. | 15:09.20 |
sebras | Robin_Watts: why is not enough to do clusterpush.pl instead of git cluster? | 15:09.40 |
Robin_Watts | cos I'm on windows :) | 15:10.03 |
sebras | Robin_Watts: so that's why you implemented git cluster! | 15:10.49 |
Robin_Watts | sebras: Exactly. | 15:11.02 |
| Lack of rsync on windows makes it a pain. | 15:11.14 |
sebras | Robin_Watts: but now you have the win32 linux subsystem, so you can just download a random binary and run that, right? | 15:14.07 |
tor8 | sebras: it's not been released to the public yet, you need to be part of some beta test program | 15:16.57 |
sebras | tor8: oh. | 15:17.21 |
tor8 | https://msdn.microsoft.com/commandline/wsl/install_guide | 15:17.22 |
Robin_Watts | Is it just me, or are mupdf runs much slower in the cluster runs nowadays? | 16:55.16 |
| sebras: Did you forget to update the VS projects with filter-thunder? | 17:07.10 |
| And the first if in filter-thunder looks like an oopsie. | 17:14.55 |
| Have you added the thunder tiffs to the cluster? | 17:15.29 |
sebras | Robin_Watts: the runs are indeed much slower. you added the jp2/j2k files... | 17:20.53 |
| Robin_Watts: oopsie. | 17:21.20 |
| Robin_Watts: two oneliners on sebras/master | 17:25.55 |
| Robin_Watts: and no, I haven't added any tiffs to the cluster. | 17:26.14 |
| Robin_Watts: I don't know how to. | 17:26.18 |
Robin_Watts | both lgtm. | 17:26.38 |
| sebras: OK, so on casper, look in ~regression/cluster/tests_private/ | 17:26.59 |
sebras | Robin_Watts: test_private.git? | 17:27.26 |
Robin_Watts | sebras: OK, so on casper, look in ~regression/cluster/tests_private.git/ | 17:27.26 |
| That's a non bare repo of the test files. | 17:27.58 |
| so if you copy files in and then do a commit, they are added. | 17:28.08 |
| There is a 'tiff' directory with a random smattering of files in. | 17:28.22 |
sebras | Robin_Watts: ok. is this the repo that the cluster is running from? | 17:28.22 |
Robin_Watts | It is. | 17:28.28 |
sebras | i.e. is the live directory? | 17:28.30 |
Robin_Watts | sebras: It is. | 17:28.36 |
sebras | Robin_Watts: ok, so don't mess about in this directory _while_ a cluster run is ongoing. noted. | 17:28.47 |
Robin_Watts | The cluster runs every file in named subdirs. | 17:29.13 |
sebras | Robin_Watts: named? | 17:29.41 |
Robin_Watts | so if you have a whole new testsuite to add, then add a directory (so the 'foo' test suite might be 'tiff/foo/') and then add files there. | 17:29.56 |
| Then I'll update the cluster to include tiff/foo. | 17:30.07 |
| If you just had a couple of random files then stick 'em in tiff. | 17:30.25 |
sebras | Robin_Watts: do we list where the files came from? | 17:30.42 |
Robin_Watts | This repo is rsynced across the cluster nodes at the start of each run (I think - or it might be done with git). | 17:31.03 |
| hence you can fairly freely add files in. | 17:31.10 |
| sebras: Only files with matching suffixes are tested, so you can add a README that says where they came from if you want. | 17:31.37 |
sebras | Robin_Watts: and once a file is added, _never_ rename it or move it again or else it looks like a regression? | 17:33.30 |
Robin_Watts | sebras: Yeah. Though we're big enough and ugly enough to cope with that if we have to. | 17:33.58 |
| http://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=aedb713d7b3f32d1d612e55073fb743cea38a38c | 17:34.02 |
sebras | Robin_Watts: so the suffixes for tiff is .tif and for JPEG it is... .jpg? .jpeg? | 17:34.05 |
Robin_Watts | big commit. | 17:34.05 |
| and: little commit: http://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=7a4b28eb46be1867078d5034f05d4331822ea322 | 17:34.20 |
| sebras: We have no jpegs currently. | 17:34.36 |
| I can add them if required. | 17:34.48 |
sebras | Robin_Watts: I'm working on the jpx code already. | 17:34.50 |
Robin_Watts | Ah, j2k or jp2. | 17:34.59 |
| sebras: In what sense "working" ? | 17:35.17 |
sebras | Robin_Watts: there are more issues with that. like not cleaning up the img and colorspace if we fz_throw() and an intermediate pixmap.. :-( | 17:35.18 |
Robin_Watts | Please do not make me redo the merge of everything since v2.1 again. | 17:35.37 |
sebras | Robin_Watts: I'm prepareing a number of commits to clean up my own mess. | 17:35.40 |
Robin_Watts | OK. | 17:35.50 |
sebras | fetches | 17:36.18 |
| Robin_Watts: do you mind skipping the colorspace patch and just go for the openjpeg update? then I'll refine the code after that? | 17:37.21 |
| Robin_Watts: I'll rebase on top of your openjpeg changes, no worries. it was me causing this mess. | 17:37.40 |
Robin_Watts | sebras: Sure, I can skip that commit easily enough. | 17:37.41 |
| sebras: Sure. I just didn't want both of us messing about inside thirdparty/openjpeg at the same time :) | 17:38.03 |
sebras | Robin_Watts: no, I'm not doing anyting there at all. | 17:38.15 |
| Robin_Watts: I'll notify you before adding anything to the cluster. | 17:39.43 |
| I'm worried I might mess with the gs regressions. | 17:40.06 |
Robin_Watts | sebras: nothing you add outside of the comparefiles/customerfiles/PS/pdf/xps directories will affect anything other than mupdf. | 17:41.15 |
sebras | Robin_Watts: re: openjpeg commit | 17:46.41 |
| is it wise to use FZ_LOCK_FREETYPE? I can see how it kind of makes sense for harfbuzz, but here..? | 17:46.41 |
Robin_Watts | yeah, I might add a new lock for it. | 17:46.41 |
| At the moment I'm just trying to get *something* through the sodding cluster. | 17:46.41 |
sebras | Robin_Watts: do you mind looking a bit at pdf_load_simple_font_by_name()? | 17:51.27 |
Robin_Watts | sebras: Sure. | 17:51.44 |
sebras | half way through we take FZ_LOCK_FREETYPE | 17:51.45 |
| then we call pdf_dict_get(). | 17:51.50 |
| what happens if it throws? | 17:52.04 |
| oh! has_lock. | 17:52.11 |
| ignore me. | 17:52.14 |
| Robin_Watts: though I feel that we should be able mess about with the pdf objects without taking the lock. | 17:53.03 |
| _then_ take the lock, when we know we are only calling functions that cannot throw. | 17:53.22 |
| that way we actually don't need to unlock in the catch. | 17:53.57 |
Robin_Watts | We could delay taking the lock until just before the "if (kind == ...)" stuff. | 17:54.02 |
sebras | exactly. | 17:54.10 |
| then remove has_lock. | 17:54.18 |
| because it is no longer needed. | 17:54.22 |
Robin_Watts | But I don't feel confident in saying that things don't throw. | 17:54.25 |
| Does pdf_lookup_agl never allocate? | 17:54.49 |
sebras | Robin_Watts: it doesn't take ctx... | 17:55.01 |
Robin_Watts | Does ft_name_index never allocate for that matter? | 17:55.10 |
| good answer :) | 17:55.13 |
sebras | Robin_Watts: the only thing taking ctx is fz_warn(). | 17:55.30 |
Robin_Watts | sebras: Personally, I still prefer it with has_lock in it. Doesn't cost us much, and it protects us from inadvertently adding something that might throw later on. | 17:56.50 |
sebras | Robin_Watts: true, but we can safely move the pdf object thingies outside of the lock. | 17:57.11 |
Robin_Watts | Certainly. | 17:57.54 |
| ok, cluster finished. | 17:59.45 |
| No reported SEGVs any more. | 17:59.53 |
sebras | Robin_Watts: I'm looking at this cause I was wondering if harfbuzz and openjpeg might interfere with freetype. | 17:59.54 |
| Robin_Watts: cool! | 17:59.58 |
Robin_Watts | harfbuzz calls happen as part of the same pipeline as the freetype ones. | 18:00.21 |
| so while we *could* have a new lock just for harfbuzz it seemed unnecessary. | 18:00.58 |
sebras | Robin_Watts: running all tests only took 268 seconds now! nice one. | 18:01.06 |
Robin_Watts | openjpeg on the other hand probably justifies its own lock. | 18:01.17 |
| We do however have a bunch of differences, and 78 files now report errors that didn't before. | 18:02.06 |
| Possibly because the code recognises they are broken where they were silently accepted before. | 18:02.30 |
sebras | Robin_Watts: mm, the luratech drop has similar issues. while the previous version would make a half-hearted attempt at decoding the file and doing out of range accesses they now just fail. | 18:04.01 |
| Robin_Watts: I'm not sure what behaviour we want. | 18:04.11 |
| Robin_Watts: both?! :) | 18:04.15 |
| Robin_Watts: i.e. decode everything you can and never step out of bounds. | 18:04.37 |
Robin_Watts | sebras: Annoyingly it's now failing on quality logic test files though. | 18:06.22 |
| Let me check I didn't screw the merge. | 18:06.33 |
ray_pc | Robin_Watts: you may want to try those on gs with luratech. I don't recall QL files failing there | 18:07.05 |
| Robin_Watts: or if you have a list of files that fail, I can try them on gs+luratech | 18:07.51 |
Robin_Watts | ray_pc: Thanks, I may take you up on that, but let me look to be sure it's not just a problem with the new openjpeg first | 18:08.32 |
ray_laptop | Robin_Watts: I misunderstood -- I thought your problem was with luratech (the one you and sebras were discussing) | 18:09.36 |
Robin_Watts | ray_laptop: No. Today I have been pulling the new openjpeg fixes in. | 18:10.09 |
sebras | ray_laptop: I was just recognizing that some of the problematic files fail for luratech as well. | 18:10.35 |
Robin_Watts | It seems reasonable to me that they should read jp2->enumcs in BEFORE testing it. | 18:10.39 |
sebras | ray_laptop: sorry for the confusion. :) | 18:10.42 |
Robin_Watts | But maybe I'm just picky like that. | 18:10.53 |
ray_laptop | sebras: no, I apologize for not following the dsicussion properly | 18:11.16 |
Robin_Watts | OK, so this looks like a bad merge to me, but with them having some dodgy code anyway. | 18:13.02 |
sebras | Robin_Watts: can you talk me through (int)(intptr_t)ptr ? | 18:17.28 |
| Robin_Watts: I can see why casting to intptr_t makes a bit of sense, but why do we need to cast it further? | 18:17.46 |
Robin_Watts | Urm... running all tests takes 2 hrs, 9 mins, 15 seconds :) | 18:17.48 |
| cos doing & on 64bit things can send compilers wibbly. | 18:18.19 |
| oh, I see, elapsed time, yes, sorry. | 18:19.20 |
| Oh, boy, as opposed to 631. Having some errors in there clearly slows things down massively. | 18:20.00 |
sebras | Robin_Watts: so memento needs an extra cast here? if ((intptr_t)p & 1) | 18:20.34 |
Robin_Watts | sebras: Bah. That's unsporting. | 18:20.54 |
sebras | looks up unsporting. | 18:21.30 |
Robin_Watts | sebras: OK, so the line in question is: | 18:21.54 |
| off = 16-(((int)(intptr_t)ptr) & 15); | 18:22.01 |
sebras | Robin_Watts: twice in memento, once in template_solid_color_N_256(). | 18:22.18 |
Robin_Watts | Without the (int) cast, the whole right hand side would be an intptr_t | 18:22.34 |
sebras | Robin_Watts: indeed. I get what you are doing, but I didn't get the extra cast. | 18:22.45 |
Robin_Watts | which would then be assigned into an int, and I'd get a warning. | 18:22.48 |
| (probably, possibly) | 18:23.01 |
sebras | Robin_Watts: so it is not doing & on 64-bit entities that is the problem, but rather the assignment. | 18:23.25 |
ray_laptop | what the heck is up with the cluster. It seems *really* confused | 18:23.36 |
Robin_Watts | so I could have done: (int)(16-(((intptr_t)ptr)&15)) | 18:23.38 |
| sebras: I'm going to disable the j2k tests until the cluster has caught up. | 18:24.01 |
sebras | Robin_Watts: fine with me. | 18:24.09 |
Robin_Watts | but that would have meant that compilers would have done all the 64bit calculations, and then thrown stuff away. | 18:25.05 |
ray_laptop | it seems to be taking nodes down (went down, re-running) over and over | 18:25.11 |
Robin_Watts | so better (IMHO) to cast early. | 18:25.13 |
| ray_laptop: That may be because nodes run slower when they hit failures. | 18:25.26 |
| and I added a load of j2k tests the other day which currently fail. | 18:25.59 |
| I've just disabled them so hopefully it should start to behave in a couple of jobs time. | 18:26.16 |
ray_laptop | so is it going to keep going running a problem job until all nodes have been taken down trying it ? | 18:26.25 |
Robin_Watts | ray_laptop: No, it should be fine in a mo. | 18:26.42 |
ray_laptop | Robin_Watts: the list of jobs may have been "collected" and sitting around -- I don't think it regenerates the list of jobs when a node goes down | 18:27.22 |
Robin_Watts | ray_laptop: it does not. | 18:27.46 |
| but I don't think it's "if a node is asked to run this job it fails and disconnects" | 18:28.06 |
| I think its "with all the failures going on, nodes sometimes disconnect" | 18:28.23 |
ray_laptop | Robin_Watts: so if there is job that none of the nodes can complete before being classified as "went down", we keep killing nodes and re-running on the remaining nodes | 18:28.56 |
sebras | Robin_Watts: "Update OpenJPEG to the latest (git) version." lgtm | 18:29.14 |
Robin_Watts | sebras: Thanks. | 18:29.55 |
| ray_laptop: But as I said, I don't believe it's that. | 18:30.14 |
ray_laptop | Robin_Watts: well, it just started another cycle | 18:30.51 |
Robin_Watts | I see it at 87% and counting. | 18:31.14 |
ray_laptop | Robin_Watts: so this time 'picas' went down | 18:32.41 |
Robin_Watts | It's always possible that there are genuine network problems :) | 18:33.26 |
ray_laptop | down to about 7 nodes for "re-running" | 18:33.27 |
| Robin_Watts: well, we are all still getting regression dashboard updates | 18:34.13 |
Robin_Watts | issue495.jp2 seems to be the slow one. | 18:35.39 |
| interestingly that appears to be running on 2 nodes :) | 18:35.50 |
ray_laptop | Robin_Watts: right, hubbles and w7 | 18:36.07 |
sebras | Robin_Watts: when I ran that particular file using the old openjpeg valgrind reported Argument 'size' of function malloc has a fishy (possibly negative) value: -2621400 | 18:36.28 |
Robin_Watts | If it starts again, I'll take a shotgun to clustermaster. | 18:36.33 |
sebras | Robin_Watts: maybe the OOM-kille on the machines kills some processes? | 18:36.48 |
Robin_Watts | sebras: Oh, there are various watchdogs, yes. | 18:37.05 |
ray_laptop | Robin_Watts: so hubbles went to "uploading log files" | 18:37.26 |
| (as did w7) | 18:37.35 |
Robin_Watts | ray_laptop: Yeah, I think it's through. | 18:37.40 |
ray_laptop | Robin_Watts: we'll know if it starts your user job | 18:37.57 |
Robin_Watts | With the new code, it bales immediately on that file. | 18:38.49 |
| Error reading SPCod SPCoc element, apparently. | 18:39.08 |
sebras | Robin_Watts: luratech says invalid bps and quits. seems correct. | 18:39.59 |
Robin_Watts | Gah. Shotgun time. | 18:42.18 |
ray_laptop | Robin_Watts: lock and load | 18:42.47 |
sebras | I wasn't aware that my patch series would hold up the cluster. | 18:43.38 |
Robin_Watts | sebras: It's not your patch series, it was our addition of the j2k stuff. | 18:44.12 |
| and I didn't realise it would do this either. | 18:44.24 |
| I guess I need to learn stuff like this, now Marcos is off. | 18:44.40 |
| Never mind, it'll all work its way though. | 18:44.54 |
| but the addition of those files did prompt us to bring openjpeg up to date. | 18:45.20 |
| ok, off and running again. | 18:48.26 |
ray_laptop | I have to run an errand. My cable ISP at home called and tried to sell me stuff, and asked if I was happy with the speed. I said, it could be better and after I told them I see 20M down, 2M up, they said I an supposed to be getting 100 down 10 up -- apparently I was never notified that I needed to swap out my modem | 18:49.17 |
| going to do that now (doesn't even cost anything more) :-) | 18:49.46 |
| Robin_Watts: I had peeved disabled while I was doing performance runs. I just re-enabled it, but it just said "removing from node pool" | 18:51.16 |
Robin_Watts | ray_laptop: Wait til the next job starts. | 18:51.52 |
ray_laptop | Robin_Watts: looks funky - it is supposed to start with user jobs (over normal commit jobs) | 18:52.09 |
Robin_Watts | it has? | 18:52.26 |
ray_laptop | oh, nm. I just did | 18:52.28 |
| and it looks like peeved is playing nicely with others now | 18:53.12 |
Robin_Watts | ok, cluster is cleared/. | 19:59.42 |
| I've reenabled the j2k stuff, and am testing the latest version now. | 19:59.58 |
| If that works, I'll push it based on sebras lgtm earlier. | 20:00.11 |
| Some of these j2k files have different width components. | 20:20.25 |
sebras | Robin_Watts: yes. | 20:20.38 |
| Robin_Watts: so hstep/vstep comes into play. | 20:20.48 |
Robin_Watts | Urm... no, we just bale on them. | 20:21.14 |
sebras | Robin_Watts: oh, yes, for openjpeg. | 20:21.29 |
| Robin_Watts: there is a bug on that. | 20:21.36 |
| Robin_Watts: but I handle it in the luratech jpx dec. | 20:21.49 |
| Robin_Watts: well. jpx_write() to be precise. | 20:22.00 |
| Robin_Watts: if you see a palettized j2k with signed component values in the palette. let me know. | 20:23.51 |
| Forward 1 day (to 2016/09/27)>>> | |