| <<<Back 1 day (to 2018/04/22) | 20180423 |
Robin_Watts | tor8: http://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=67a7449fc1f186f318942b9c6b8d66d4458b7d87 | 09:17.27 |
tor8 | Robin_Watts: LGTM. | 09:18.01 |
Robin_Watts | Ta. | 09:18.07 |
tor8 | Robin_Watts: there's a long list of commits on tor/master that need reviewing/discussing | 09:18.16 |
Robin_Watts | tor8: OK, just a mo... | 09:18.38 |
cosimone | Hi everyone, is this channel appropriate to ask minor questions and doubts about the android application? (i'm referring to the one you can find on fdroid) | 09:45.03 |
moolc | cosimone: i'm not a mupdf developer, but those guys are here and will most likely not spank you for asking droid questions... sebras/tor are the ones mainly responsible for droid stuff | 09:58.48 |
| Robin_Watts: i already asked that in the past - no resolution.. Is it somehow possible to clone mupdf and submodules (important part) shallowly? | 10:07.24 |
Robin_Watts | tor8: Sorry, back. | 10:14.40 |
| moolc: I don't follow. | 10:15.25 |
| If you "git clone" then you get just the main repo, no submodules. | 10:15.41 |
| If you git clone --recursive then you get the thirdparty submodules too. | 10:15.58 |
| None of the submodules have subsubmodules as far as I know. | 10:16.14 |
tor8 | moolc: git submodules and shallow clones don't coexist well last I checked (about a year ago) | 10:16.52 |
| moolc: you can sort of work around it by cloning mupdf (deep or shallow) without --recursive | 10:17.37 |
| and then initialize the submodules manually, shallow or deep as you wish, but caveat emptor, etc, etc. | 10:17.54 |
Robin_Watts | ok, shallow clones is clearly something I don't understand. | 10:18.31 |
| tor8: So, your commits... | 10:18.41 |
| The PDF_NAME one, I am warming to. | 10:19.38 |
moolc | tor8: oh :( well let's keep our fingers crossed that not many people will use http://repo.or.cz/llpp.git/blob/b828b5a0553f3810c61628fcddb007066cdde389:/misc/bootstrap.sh | 10:20.00 |
Robin_Watts | In the current code, if I do PDF_Name_Bogus, it'll tell me there is no such name. | 10:20.11 |
tor8 | Robin_Watts: it will now too (but maybe slightly less clearly) | 10:21.03 |
| error: use of undeclared identifier | 10:21.08 |
| 'PDF_ENUM_NAME_fooFont' | 10:21.08 |
Robin_Watts | In the new code if I do PDF_Name(Bogus), it'll tell me there is no PDF_ENUM_NAME_Bogus. | 10:21.17 |
| I can live with that. | 10:21.21 |
| One minor idea... | 10:21.31 |
| The vast majority of the PDF_MAKE_NAME(A,B) things have A == "B" | 10:21.50 |
| is is worth having #define PDF_MAKE_NAME(A) and PDF_MAKE_AWKWARD_NAME(A,B) ? | 10:22.22 |
| The second commit, the pdf_name_eq one... | 10:23.46 |
tor8 | Robin_Watts: I did that at first, but it makes it harder to keep the list alphabetically sorted | 10:24.27 |
| now I just pipe it through 'sort' and all is safe and sound | 10:24.44 |
Robin_Watts | Good answer. | 10:24.52 |
tor8 | with the AWKWARD macro, that won't work... | 10:24.56 |
Robin_Watts | Currently pdf_name_eq(A,B) checks for A and B being names, and A == B, and if that fails, it resolves A and resolves B, and then compares the two. | 10:25.18 |
tor8 | I figured I'd take the hit to awkwardness and not run into the same problem that we had once when sebras sorted the list, but had his LOCALE set to not-C and got some non-bytewise sort order... | 10:25.40 |
Robin_Watts | So stuff like pdf_name_eq(PDF_NAME(Foo), pdf_new_name(ctx, "Foo")) works now, but won't work in future. | 10:26.10 |
tor8 | Robin_Watts: right. you mean if either of A or B in an indirect object pointing to a numbered pdf object that is a name | 10:26.27 |
Robin_Watts | and pdf_name_eq(PDF_NAME(Foo), pdf_new_reference(ctx, PDF_NAME(Foo)); | 10:26.33 |
tor8 | pdf_new_name("Foo") will always return the constant enum thing for PDF_NAME_Foo | 10:26.57 |
Robin_Watts | tor8: It will ? | 10:27.10 |
tor8 | yes. we bsearch the PDF_NAME_LIST looking for a hit, before we alloc a new pdf_obj | 10:27.27 |
Robin_Watts | Ah, ok. | 10:27.38 |
| So my objection is just about the references. | 10:27.48 |
tor8 | and I'm not sure I care for the (oh god, my brain hurts, please don't do that) case of indirect references | 10:28.13 |
| but I guess I should run that through the cluster just to make sure we don't actually have that problem | 10:28.32 |
Robin_Watts | If we're concerned about the overhead, then we should use a static inline for doing pdf_name_eq (for the simple case, falling back to a non-inline for the ref case) | 10:28.56 |
| also, pdf_name_eq(PDF_TRUE, PDF_TRUE) should presumably not pass ? | 10:29.20 |
| Hmm. The existing code will pass that. | 10:29.31 |
tor8 | Robin_Watts: the next commit after the PDF_NAME() one removes the pdf_name_eq function completely | 10:30.24 |
| I thought you were talking about that one | 10:30.29 |
Robin_Watts | I am. | 10:30.33 |
tor8 | okay. just so we're on the same page. | 10:30.45 |
Robin_Watts | I'm saying that I don't like the direct comparison. | 10:30.54 |
| pdf_name_eq(A,B) is at pains to only say true, if A and B are both names. | 10:31.17 |
| not generic objects. | 10:31.22 |
tor8 | after that commit, pdf_name_eq is called in one place only | 10:32.58 |
| all other comparisons were with a constant | 10:33.06 |
Robin_Watts | tor8: ok... and that place is? | 10:33.26 |
tor8 | pdf_add_portfolio_schema | 10:33.47 |
Robin_Watts | Not dict_get ? | 10:34.03 |
tor8 | so is your objection that we no longer return true for pdf_name_eq when comparing /Foo with 10 0 R where it points to 10 0 obj /Foo endobj | 10:34.13 |
Robin_Watts | That is one objection. | 10:34.28 |
tor8 | pdf_dict_get does not use pdf_name_eq | 10:35.54 |
| I have to admit I didn't think about the /Foo == 10 0 R case, because I don't think I've ever seen that sort of structure ever used | 10:36.43 |
Robin_Watts | I'm struggling to build this version :( | 10:37.54 |
| Ok, so the name change commit screws the windows build royally. | 10:42.46 |
| does generate.bat delete the .h file maybe? | 10:43.26 |
| No, the solution file does. | 10:43.59 |
| pdf-font.c calls pdf_name_eq twice. | 10:45.11 |
| but probably doesn't need to. | 10:45.50 |
| OK, so broadly I'm happy with that, *if* we are happy that we don't need to redirect through references. Which seems like a big if. | 10:46.31 |
| I worry that we'll strip this out now, only to hit a problem file in the future, and have to shove it all back in again. | 10:46.54 |
| The next commit... the reordering of NULL/TRUE/FALSE. | 10:47.56 |
| I seem to remember pondering this at a time. | 10:48.04 |
| The attraction of having PDF_NULL being NULL etc. | 10:48.18 |
| and also the idea of having PDF_TRUE == 1 | 10:48.26 |
tor8 | I looked at the pdf reference, and there it says that one should never make the distinction between 'null' and a missing value | 10:48.28 |
Robin_Watts | but we can't have PDF_FALSE == 0 too. | 10:48.52 |
tor8 | so having the distinction between PDF_NULL and NULL doesn't seem useful | 10:48.53 |
| Robin_Watts: unfortunately no | 10:49.00 |
| but having them be pointer values 0, 1, and 2 is easier when debugging, at least | 10:49.16 |
Robin_Watts | OK. So the change looks wrong to me. | 10:49.37 |
| but then I'm confused by it looking wrong anyway. | 10:50.12 |
| In pdf-object.h | 10:50.21 |
| we have PDF_NAME_LIST | 10:50.28 |
| The old code used to allow slots at the start for NULL, TRUE and FALSE. | 10:50.44 |
| and then have all the others. | 10:50.52 |
| But the old code used to actually list the objects in the other order. | 10:51.16 |
tor8 | the old code had a slot at the start for a DUMMY value that was never used (to reserve the NULL pointer) | 10:51.24 |
Robin_Watts | DUMMY, then names, then NULL, TRUE, FALSE | 10:51.30 |
tor8 | and then the PDF_NULL, TRUE, FALSE after the names | 10:51.35 |
| this code puts three dummy slots ath the start for NULL, TRUE, FALSE then the names starting at index 3 | 10:51.52 |
Robin_Watts | D'Oh. I can't read diffs. | 10:52.11 |
tor8 | and in pdf_new_name I start the bsearch with left=3 (skip the dummy slots) | 10:52.51 |
Robin_Watts | pdf_name_eq is wrong. | 10:53.12 |
| cos pdf_name_eq(PDF_NULL, PDF_NULL) will return 1 | 10:53.23 |
| (looks like it was wrong before too :( ) | 10:53.52 |
| pdf_name_eq should only ever return true, if both elements are names. | 10:54.31 |
tor8 | that's easy enough to fix | 10:54.42 |
Robin_Watts | Indeed. With that fix, I'm happy. | 10:54.50 |
| So, I'm happy with the first (if we fix the VS build). | 10:55.19 |
| I'm unhappy with the second (because dereferencing seems important to me) | 10:55.52 |
| I'm happy with the third (if we fix it to only equate names) | 10:56.12 |
tor8 | yeah. I'm starting to have second thoughts about the second (losing auto-dereferencing there may not be worth it) | 10:59.15 |
Robin_Watts | The next one... pdf_new_obj_from_str dates from c69a9ace94 | 11:00.53 |
| Which states it was added for zeniko. | 11:01.00 |
tor8 | They haven't pulled from us in 3 years | 11:01.55 |
Robin_Watts | tor8: Fair enough. | 11:02.27 |
tor8 | https://github.com/sumatrapdfreader/sumatrapdf/tree/master/mupdf/source/pdf | 11:02.28 |
Robin_Watts | (tor8: Is it worth us forking with an up to date version?) | 11:02.53 |
tor8 | so I figure they're happy enough with their fork (and don't care enough about new features to suffer through our API instability, given how many local patches they've added) | 11:03.11 |
Robin_Watts | (probably work we don't need. Depends on how much of a "shop window" sumatra is for us) | 11:03.14 |
| Next one, the removal of 'doc'. | 11:04.07 |
| I don't object. The argument against the change is regularity, but I'm not offended by it. | 11:04.34 |
| Arguably it's clearer only to put 'doc' into things that actually remember the doc. | 11:04.47 |
| Next one looks good. | 11:05.19 |
tor8 | with the dict_put_int, etc functions, we also don't need to call the pdf_new_int, etc functions nearly as often | 11:05.43 |
Robin_Watts | tor8: yeah. | 11:08.47 |
tor8 | Robin_Watts: one random idea occurred to me (feel free to hate it): PDF_NAME_EQ(ctx, var, Foo) that resolves to pdf_name_eq(ctx, var, PDF_NAME(Foo)) | 11:08.49 |
| not sure if it's worth it, but it would be shorter | 11:09.06 |
| nah, I already hate it myself. | 11:09.10 |
| just needed to type it out | 11:09.13 |
Robin_Watts | The worry I have with that is that PDF_NAME_EQ(ctx, Foo, var) won't work. | 11:09.21 |
| ok :) | 11:09.39 |
tor8 | exactly, and it hides the third argument processing, which is icky | 11:09.45 |
Robin_Watts | The cmap stuff looks clever. | 11:09.53 |
cosimone | ok thanks. so, first of all, what does the button near the search button do? it switches from grey to blue and vice versa when pressed, but nothing seems to happen | 11:11.48 |
Robin_Watts | cosimone: The chain icon? | 11:12.03 |
cosimone | yes, that one | 11:12.13 |
Robin_Watts | It makes "links" active. | 11:12.16 |
| i.e. if you click on a hyperlink it follows it. | 11:12.28 |
cosimone | oh, i see. i tried it on a document without links, that's why i didn't notice it, thanks | 11:12.51 |
Robin_Watts | no worries. | 11:12.57 |
| tor8: So how does the merging stuff work? | 11:13.17 |
cosimone | another small thing, is there any way to select text? long press doesn't seem to work | 11:13.25 |
Robin_Watts | cosimone: I can't remember a way at the moment. | 11:14.32 |
| Possibly you can go into reflow mode, and then select from there? | 11:14.41 |
cosimone | i'll try later | 11:14.52 |
| no problems if you can't recall now, it's not urgent. thanks anyway, and keep up the good work! | 11:15.24 |
Robin_Watts | tor8: So the plan is to check in cmaps produced using this. | 11:15.45 |
| cos if windows users are relying on mutool cmapdump to be able to do dump cmaps, we need some to get started with. | 11:16.56 |
tor8 | Robin_Watts: yeah. the plan is to check in all but the humongous font dumps | 11:19.12 |
| Robin_Watts: which 'merging' stuff? | 11:19.37 |
Robin_Watts | the "share" stuff. | 11:19.53 |
tor8 | first it creates a flattened representation of the involved CMaps | 11:20.18 |
| which just has all the ranges expanded, so it only maps single characters | 11:20.47 |
Robin_Watts | /UniCNS-X usecmap. Gottit. | 11:20.49 |
| Nice. | 11:20.59 |
tor8 | then I extract the common subset into a -X cmap which both inherit using usecmap | 11:21.05 |
Robin_Watts | yeah, that's the bit I was struggling to see. | 11:21.24 |
| Nice trick. | 11:21.27 |
| Are there other potential savings still lurking in here? | 11:21.41 |
tor8 | I did it before, but forgot to check in and then lost my scripts | 11:21.46 |
| so I recreated them properly again, and reran it on the latest CMaps | 11:22.06 |
| possibly, the GB* cmaps may have some common bits that could be extracted likewise | 11:23.19 |
| yeah. that could shave another 80kb | 11:26.57 |
Robin_Watts | Is it worth an exhaustive run to compare every cmap with every other cmap? | 11:27.22 |
tor8 | possibly, but the savings would get smaller and smaller | 11:29.03 |
| sharing the GB* cmaps would save ~160k | 11:29.29 |
Robin_Watts | tor8: yeah, I just wondered if there was maybe some smaller lumps that were common to lots of files (like symbols etc) | 11:29.57 |
tor8 | the next on the list would be KSC and RKSJ and those might save 10k-20k each | 11:29.58 |
| Robin_Watts: if we run flattencmap.py on all of them, is there a tool that can do a similarity score? | 11:30.36 |
| well, diffstat I guess | 11:30.40 |
Robin_Watts | pass, "comm" is a new one on me as it is :) | 11:30.54 |
| comm | wc -l ? :) | 11:31.16 |
| (Does gs include these cmaps too? Possibly we should use the reduced ones there as well.) | 11:32.26 |
| Next few look fine. Looking at the "Try other CJK languages to find missing characters" one now. | 11:37.48 |
| Is there a way to know what chars a font has without loading the whole thing? | 11:38.09 |
| Like, can we check in a CMAP first? | 11:38.17 |
| I bet fonts don't correspond 1:1 with cmaps. | 11:38.33 |
tor8 | Robin_Watts: for the " Try other CJK languages to find missing characters." commit? | 11:39.05 |
Robin_Watts | tor8: Yeah. | 11:39.17 |
tor8 | sadly no, we need to load the TTF/OTF/TTC to look at the SFNT 'cmap' table | 11:39.21 |
Robin_Watts | I feared that might be the case. Looks good anyway. | 11:39.34 |
| This is presumably for epub ? | 11:39.46 |
tor8 | if it weren't for the Han unification, we wouldn't be in this mess :) | 11:39.58 |
Robin_Watts | mmm. | 11:40.03 |
tor8 | yes. this is for epub (and eventually PDF form filling and appearance synthesis) | 11:40.18 |
| the 'japan' font doesn't have non-japanese characters, but they could still be used in a japanese language context | 11:40.47 |
| a fairly rare occurrence, but if it happens, we should look through the other fonts we have | 11:41.05 |
| and of course, there's always the case where we have a unicode character but not a specified language | 11:41.27 |
| Robin_Watts: okay, so there are more savings to be had by that sharing trick... | 11:45.24 |
| but I think I'll need to write a new 'comm' tool that can take 3 input files... | 11:46.38 |
| or more | 11:46.47 |
Robin_Watts | other commits all look good. | 11:50.53 |
tor8 | okay, I'll revert the "Use direct comparison to compare pdf_obj with constant name objects." commit, fix pdf_name_eq | 11:51.53 |
| and I might need your help to make "Remove need for namedump by using macros and preprocessor." work with MSVC | 11:52.09 |
Robin_Watts | Sure. | 11:52.21 |
tor8 | and I've got some ideas, I might remake the 'cmapshare.py' script | 11:52.37 |
| or rather the calling script | 11:52.47 |
| to merge more cmap subsets | 11:52.52 |
Robin_Watts | ok. just yell when you need me. | 11:53.02 |
tor8 | will do | 11:55.42 |
| Robin_Watts: success! | 13:12.06 |
| I have squeezed the CMap data down from 1.5M to 840k by sharing more subsets | 13:12.18 |
| oh, wait... I might be a bit over-optimistic | 13:12.46 |
| Robin_Watts: I have managed to save 50kb by sharing the common bits between GBK-EUC-H, GBKp-EUC-H, and GBK2K-H | 14:06.20 |
Robin_Watts | tor8: So the cmaps were what size? | 14:06.38 |
tor8 | I think I shall leave it at that. the remaining CMap source files are already pretty optimal | 14:06.41 |
| those three were 82kb, 82kb, and 89kb in size | 14:07.46 |
Robin_Watts | tor8: I was just wondering what the "full" set was before you tried squeezing, and what it is now. | 14:08.12 |
| 50K is always nice to have. | 14:08.36 |
| Can we push the squashed sets into gs too ? | 14:08.52 |
tor8 | no idea; you'd have to consult chrisl or kens | 14:09.10 |
kens | Eh what ? | 14:09.22 |
tor8 | it might be that PS expects more from the CMap resources that these squeezed ones don't do | 14:09.27 |
| kens: I have massaged some of the CMap resources in mupdf to 'usecmap' common subsets | 14:09.54 |
| kens: a significant amount of savings for Uni*-UCS2-H and Uni*-UTF16-H and the GBK*-H CMap resources | 14:10.25 |
kens | Hmmm | 14:10.25 |
| I can't really comment without looking at what you;ve done to be honest | 14:11.48 |
| How much saving are you suggesting, what's the performance overhead of reconstructing the CMap ? | 14:13.33 |
chrisl | tor8: doesn't that make updating the cmaps a pain? | 14:15.06 |
kens | I guess we could write a customer findresource to coinstruct a CMap dictionary from a 'modified' CMap. | 14:17.18 |
| s/customer/custom/ | 14:17.27 |
chrisl | AIUI, tor8 is using the usecmap operator, so it should just work, as long as the Postscript names work out | 14:18.07 |
kens | Do we actually support usecmap ? | 14:18.21 |
chrisl | Given it's used all over the joint in the standard ones, I'd rather hope so | 14:18.59 |
kens | Hmm, seems we do | 14:19.01 |
tor8 | chrisl: it does, which is why I haven't suggested it for gs | 14:19.18 |
| and gs doesn't embed them directly in the binary, does it? | 14:19.26 |
kens | It does for romfs builds | 14:19.37 |
chrisl | By default, yes | 14:19.39 |
moolc | tor8: you guys host git on some aws like thingie? | 16:37.01 |
Robin_Watts | moolc: We do. | 16:52.31 |
| We have an aws instance that hosts various things, including our git server. | 16:52.52 |
moolc | Robin_Watts: git remote update just took ~15 minutes (from tor's branch) Receiving objects: 100% (470/470), 24.01 MiB | 58.00 KiB/s, done. | 16:56.06 |
| no wonder a friend of mine mentioned how long it took him to bootstrap my stuff that does a clone of mupdf. | 16:56.32 |
Robin_Watts | That's unusually slow. | 16:56.44 |
moolc | Robin_Watts: perhaps it makes sense for me to switch git url in http://repo.or.cz/llpp.git/blob/6f05faf8e0f8bef1697edec342e8e8cfe02b43d0:/misc/bootstrap.sh to github mirror? | 17:14.57 |
Robin_Watts | moolc: Perhaps. | 17:15.33 |
moolc | Robin_Watts: okay.. i'll try | 17:16.19 |
| Forward 1 day (to 2018/04/24)>>> | |