MuPDF IRC logs

	<<<Back 1 day (to 2018/04/22)	20180423
Robin_Watts	tor8: http://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=67a7449fc1f186f318942b9c6b8d66d4458b7d87	09:17.27
tor8	Robin_Watts: LGTM.	09:18.01
Robin_Watts	Ta.	09:18.07
tor8	Robin_Watts: there's a long list of commits on tor/master that need reviewing/discussing	09:18.16
Robin_Watts	tor8: OK, just a mo...	09:18.38
cosimone	Hi everyone, is this channel appropriate to ask minor questions and doubts about the android application? (i'm referring to the one you can find on fdroid)	09:45.03
moolc	cosimone: i'm not a mupdf developer, but those guys are here and will most likely not spank you for asking droid questions... sebras/tor are the ones mainly responsible for droid stuff	09:58.48
	Robin_Watts: i already asked that in the past - no resolution.. Is it somehow possible to clone mupdf and submodules (important part) shallowly?	10:07.24
Robin_Watts	tor8: Sorry, back.	10:14.40
	moolc: I don't follow.	10:15.25
	If you "git clone" then you get just the main repo, no submodules.	10:15.41
	If you git clone --recursive then you get the thirdparty submodules too.	10:15.58
	None of the submodules have subsubmodules as far as I know.	10:16.14
tor8	moolc: git submodules and shallow clones don't coexist well last I checked (about a year ago)	10:16.52
	moolc: you can sort of work around it by cloning mupdf (deep or shallow) without --recursive	10:17.37
	and then initialize the submodules manually, shallow or deep as you wish, but caveat emptor, etc, etc.	10:17.54
Robin_Watts	ok, shallow clones is clearly something I don't understand.	10:18.31
	tor8: So, your commits...	10:18.41
	The PDF_NAME one, I am warming to.	10:19.38
moolc	tor8: oh :( well let's keep our fingers crossed that not many people will use http://repo.or.cz/llpp.git/blob/b828b5a0553f3810c61628fcddb007066cdde389:/misc/bootstrap.sh	10:20.00
Robin_Watts	In the current code, if I do PDF_Name_Bogus, it'll tell me there is no such name.	10:20.11
tor8	Robin_Watts: it will now too (but maybe slightly less clearly)	10:21.03
	error: use of undeclared identifier	10:21.08
	'PDF_ENUM_NAME_fooFont'	10:21.08
Robin_Watts	In the new code if I do PDF_Name(Bogus), it'll tell me there is no PDF_ENUM_NAME_Bogus.	10:21.17
	I can live with that.	10:21.21
	One minor idea...	10:21.31
	The vast majority of the PDF_MAKE_NAME(A,B) things have A == "B"	10:21.50
	is is worth having #define PDF_MAKE_NAME(A) and PDF_MAKE_AWKWARD_NAME(A,B) ?	10:22.22
	The second commit, the pdf_name_eq one...	10:23.46
tor8	Robin_Watts: I did that at first, but it makes it harder to keep the list alphabetically sorted	10:24.27
	now I just pipe it through 'sort' and all is safe and sound	10:24.44
Robin_Watts	Good answer.	10:24.52
tor8	with the AWKWARD macro, that won't work...	10:24.56
Robin_Watts	Currently pdf_name_eq(A,B) checks for A and B being names, and A == B, and if that fails, it resolves A and resolves B, and then compares the two.	10:25.18
tor8	I figured I'd take the hit to awkwardness and not run into the same problem that we had once when sebras sorted the list, but had his LOCALE set to not-C and got some non-bytewise sort order...	10:25.40
Robin_Watts	So stuff like pdf_name_eq(PDF_NAME(Foo), pdf_new_name(ctx, "Foo")) works now, but won't work in future.	10:26.10
tor8	Robin_Watts: right. you mean if either of A or B in an indirect object pointing to a numbered pdf object that is a name	10:26.27
Robin_Watts	and pdf_name_eq(PDF_NAME(Foo), pdf_new_reference(ctx, PDF_NAME(Foo));	10:26.33
tor8	pdf_new_name("Foo") will always return the constant enum thing for PDF_NAME_Foo	10:26.57
Robin_Watts	tor8: It will ?	10:27.10
tor8	yes. we bsearch the PDF_NAME_LIST looking for a hit, before we alloc a new pdf_obj	10:27.27
Robin_Watts	Ah, ok.	10:27.38
	So my objection is just about the references.	10:27.48
tor8	and I'm not sure I care for the (oh god, my brain hurts, please don't do that) case of indirect references	10:28.13
	but I guess I should run that through the cluster just to make sure we don't actually have that problem	10:28.32
Robin_Watts	If we're concerned about the overhead, then we should use a static inline for doing pdf_name_eq (for the simple case, falling back to a non-inline for the ref case)	10:28.56
	also, pdf_name_eq(PDF_TRUE, PDF_TRUE) should presumably not pass ?	10:29.20
	Hmm. The existing code will pass that.	10:29.31
tor8	Robin_Watts: the next commit after the PDF_NAME() one removes the pdf_name_eq function completely	10:30.24
	I thought you were talking about that one	10:30.29
Robin_Watts	I am.	10:30.33
tor8	okay. just so we're on the same page.	10:30.45
Robin_Watts	I'm saying that I don't like the direct comparison.	10:30.54
	pdf_name_eq(A,B) is at pains to only say true, if A and B are both names.	10:31.17
	not generic objects.	10:31.22
tor8	after that commit, pdf_name_eq is called in one place only	10:32.58
	all other comparisons were with a constant	10:33.06
Robin_Watts	tor8: ok... and that place is?	10:33.26
tor8	pdf_add_portfolio_schema	10:33.47
Robin_Watts	Not dict_get ?	10:34.03
tor8	so is your objection that we no longer return true for pdf_name_eq when comparing /Foo with 10 0 R where it points to 10 0 obj /Foo endobj	10:34.13
Robin_Watts	That is one objection.	10:34.28
tor8	pdf_dict_get does not use pdf_name_eq	10:35.54
	I have to admit I didn't think about the /Foo == 10 0 R case, because I don't think I've ever seen that sort of structure ever used	10:36.43
Robin_Watts	I'm struggling to build this version :(	10:37.54
	Ok, so the name change commit screws the windows build royally.	10:42.46
	does generate.bat delete the .h file maybe?	10:43.26
	No, the solution file does.	10:43.59
	pdf-font.c calls pdf_name_eq twice.	10:45.11
	but probably doesn't need to.	10:45.50
	OK, so broadly I'm happy with that, if we are happy that we don't need to redirect through references. Which seems like a big if.	10:46.31
	I worry that we'll strip this out now, only to hit a problem file in the future, and have to shove it all back in again.	10:46.54
	The next commit... the reordering of NULL/TRUE/FALSE.	10:47.56
	I seem to remember pondering this at a time.	10:48.04
	The attraction of having PDF_NULL being NULL etc.	10:48.18
	and also the idea of having PDF_TRUE == 1	10:48.26
tor8	I looked at the pdf reference, and there it says that one should never make the distinction between 'null' and a missing value	10:48.28
Robin_Watts	but we can't have PDF_FALSE == 0 too.	10:48.52
tor8	so having the distinction between PDF_NULL and NULL doesn't seem useful	10:48.53
	Robin_Watts: unfortunately no	10:49.00
	but having them be pointer values 0, 1, and 2 is easier when debugging, at least	10:49.16
Robin_Watts	OK. So the change looks wrong to me.	10:49.37
	but then I'm confused by it looking wrong anyway.	10:50.12
	In pdf-object.h	10:50.21
	we have PDF_NAME_LIST	10:50.28
	The old code used to allow slots at the start for NULL, TRUE and FALSE.	10:50.44
	and then have all the others.	10:50.52
	But the old code used to actually list the objects in the other order.	10:51.16
tor8	the old code had a slot at the start for a DUMMY value that was never used (to reserve the NULL pointer)	10:51.24
Robin_Watts	DUMMY, then names, then NULL, TRUE, FALSE	10:51.30
tor8	and then the PDF_NULL, TRUE, FALSE after the names	10:51.35
	this code puts three dummy slots ath the start for NULL, TRUE, FALSE then the names starting at index 3	10:51.52
Robin_Watts	D'Oh. I can't read diffs.	10:52.11
tor8	and in pdf_new_name I start the bsearch with left=3 (skip the dummy slots)	10:52.51
Robin_Watts	pdf_name_eq is wrong.	10:53.12
	cos pdf_name_eq(PDF_NULL, PDF_NULL) will return 1	10:53.23
	(looks like it was wrong before too :( )	10:53.52
	pdf_name_eq should only ever return true, if both elements are names.	10:54.31
tor8	that's easy enough to fix	10:54.42
Robin_Watts	Indeed. With that fix, I'm happy.	10:54.50
	So, I'm happy with the first (if we fix the VS build).	10:55.19
	I'm unhappy with the second (because dereferencing seems important to me)	10:55.52
	I'm happy with the third (if we fix it to only equate names)	10:56.12
tor8	yeah. I'm starting to have second thoughts about the second (losing auto-dereferencing there may not be worth it)	10:59.15
Robin_Watts	The next one... pdf_new_obj_from_str dates from c69a9ace94	11:00.53
	Which states it was added for zeniko.	11:01.00
tor8	They haven't pulled from us in 3 years	11:01.55
Robin_Watts	tor8: Fair enough.	11:02.27
tor8	https://github.com/sumatrapdfreader/sumatrapdf/tree/master/mupdf/source/pdf	11:02.28
Robin_Watts	(tor8: Is it worth us forking with an up to date version?)	11:02.53
tor8	so I figure they're happy enough with their fork (and don't care enough about new features to suffer through our API instability, given how many local patches they've added)	11:03.11
Robin_Watts	(probably work we don't need. Depends on how much of a "shop window" sumatra is for us)	11:03.14
	Next one, the removal of 'doc'.	11:04.07
	I don't object. The argument against the change is regularity, but I'm not offended by it.	11:04.34
	Arguably it's clearer only to put 'doc' into things that actually remember the doc.	11:04.47
	Next one looks good.	11:05.19
tor8	with the dict_put_int, etc functions, we also don't need to call the pdf_new_int, etc functions nearly as often	11:05.43
Robin_Watts	tor8: yeah.	11:08.47
tor8	Robin_Watts: one random idea occurred to me (feel free to hate it): PDF_NAME_EQ(ctx, var, Foo) that resolves to pdf_name_eq(ctx, var, PDF_NAME(Foo))	11:08.49
	not sure if it's worth it, but it would be shorter	11:09.06
	nah, I already hate it myself.	11:09.10
	just needed to type it out	11:09.13
Robin_Watts	The worry I have with that is that PDF_NAME_EQ(ctx, Foo, var) won't work.	11:09.21
	ok :)	11:09.39
tor8	exactly, and it hides the third argument processing, which is icky	11:09.45
Robin_Watts	The cmap stuff looks clever.	11:09.53
cosimone	ok thanks. so, first of all, what does the button near the search button do? it switches from grey to blue and vice versa when pressed, but nothing seems to happen	11:11.48
Robin_Watts	cosimone: The chain icon?	11:12.03
cosimone	yes, that one	11:12.13
Robin_Watts	It makes "links" active.	11:12.16
	i.e. if you click on a hyperlink it follows it.	11:12.28
cosimone	oh, i see. i tried it on a document without links, that's why i didn't notice it, thanks	11:12.51
Robin_Watts	no worries.	11:12.57
	tor8: So how does the merging stuff work?	11:13.17
cosimone	another small thing, is there any way to select text? long press doesn't seem to work	11:13.25
Robin_Watts	cosimone: I can't remember a way at the moment.	11:14.32
	Possibly you can go into reflow mode, and then select from there?	11:14.41
cosimone	i'll try later	11:14.52
	no problems if you can't recall now, it's not urgent. thanks anyway, and keep up the good work!	11:15.24
Robin_Watts	tor8: So the plan is to check in cmaps produced using this.	11:15.45
	cos if windows users are relying on mutool cmapdump to be able to do dump cmaps, we need some to get started with.	11:16.56
tor8	Robin_Watts: yeah. the plan is to check in all but the humongous font dumps	11:19.12
	Robin_Watts: which 'merging' stuff?	11:19.37
Robin_Watts	the "share" stuff.	11:19.53
tor8	first it creates a flattened representation of the involved CMaps	11:20.18
	which just has all the ranges expanded, so it only maps single characters	11:20.47
Robin_Watts	/UniCNS-X usecmap. Gottit.	11:20.49
	Nice.	11:20.59
tor8	then I extract the common subset into a -X cmap which both inherit using usecmap	11:21.05
Robin_Watts	yeah, that's the bit I was struggling to see.	11:21.24
	Nice trick.	11:21.27
	Are there other potential savings still lurking in here?	11:21.41
tor8	I did it before, but forgot to check in and then lost my scripts	11:21.46
	so I recreated them properly again, and reran it on the latest CMaps	11:22.06
	possibly, the GB* cmaps may have some common bits that could be extracted likewise	11:23.19
	yeah. that could shave another 80kb	11:26.57
Robin_Watts	Is it worth an exhaustive run to compare every cmap with every other cmap?	11:27.22
tor8	possibly, but the savings would get smaller and smaller	11:29.03
	sharing the GB* cmaps would save ~160k	11:29.29
Robin_Watts	tor8: yeah, I just wondered if there was maybe some smaller lumps that were common to lots of files (like symbols etc)	11:29.57
tor8	the next on the list would be KSC and RKSJ and those might save 10k-20k each	11:29.58
	Robin_Watts: if we run flattencmap.py on all of them, is there a tool that can do a similarity score?	11:30.36
	well, diffstat I guess	11:30.40
Robin_Watts	pass, "comm" is a new one on me as it is :)	11:30.54
	comm \| wc -l ? :)	11:31.16
	(Does gs include these cmaps too? Possibly we should use the reduced ones there as well.)	11:32.26
	Next few look fine. Looking at the "Try other CJK languages to find missing characters" one now.	11:37.48
	Is there a way to know what chars a font has without loading the whole thing?	11:38.09
	Like, can we check in a CMAP first?	11:38.17
	I bet fonts don't correspond 1:1 with cmaps.	11:38.33
tor8	Robin_Watts: for the " Try other CJK languages to find missing characters." commit?	11:39.05
Robin_Watts	tor8: Yeah.	11:39.17
tor8	sadly no, we need to load the TTF/OTF/TTC to look at the SFNT 'cmap' table	11:39.21
Robin_Watts	I feared that might be the case. Looks good anyway.	11:39.34
	This is presumably for epub ?	11:39.46
tor8	if it weren't for the Han unification, we wouldn't be in this mess :)	11:39.58
Robin_Watts	mmm.	11:40.03
tor8	yes. this is for epub (and eventually PDF form filling and appearance synthesis)	11:40.18
	the 'japan' font doesn't have non-japanese characters, but they could still be used in a japanese language context	11:40.47
	a fairly rare occurrence, but if it happens, we should look through the other fonts we have	11:41.05
	and of course, there's always the case where we have a unicode character but not a specified language	11:41.27
	Robin_Watts: okay, so there are more savings to be had by that sharing trick...	11:45.24
	but I think I'll need to write a new 'comm' tool that can take 3 input files...	11:46.38
	or more	11:46.47
Robin_Watts	other commits all look good.	11:50.53
tor8	okay, I'll revert the "Use direct comparison to compare pdf_obj with constant name objects." commit, fix pdf_name_eq	11:51.53
	and I might need your help to make "Remove need for namedump by using macros and preprocessor." work with MSVC	11:52.09
Robin_Watts	Sure.	11:52.21
tor8	and I've got some ideas, I might remake the 'cmapshare.py' script	11:52.37
	or rather the calling script	11:52.47
	to merge more cmap subsets	11:52.52
Robin_Watts	ok. just yell when you need me.	11:53.02
tor8	will do	11:55.42
	Robin_Watts: success!	13:12.06
	I have squeezed the CMap data down from 1.5M to 840k by sharing more subsets	13:12.18
	oh, wait... I might be a bit over-optimistic	13:12.46
	Robin_Watts: I have managed to save 50kb by sharing the common bits between GBK-EUC-H, GBKp-EUC-H, and GBK2K-H	14:06.20
Robin_Watts	tor8: So the cmaps were what size?	14:06.38
tor8	I think I shall leave it at that. the remaining CMap source files are already pretty optimal	14:06.41
	those three were 82kb, 82kb, and 89kb in size	14:07.46
Robin_Watts	tor8: I was just wondering what the "full" set was before you tried squeezing, and what it is now.	14:08.12
	50K is always nice to have.	14:08.36
	Can we push the squashed sets into gs too ?	14:08.52
tor8	no idea; you'd have to consult chrisl or kens	14:09.10
kens	Eh what ?	14:09.22
tor8	it might be that PS expects more from the CMap resources that these squeezed ones don't do	14:09.27
	kens: I have massaged some of the CMap resources in mupdf to 'usecmap' common subsets	14:09.54
	kens: a significant amount of savings for Uni-UCS2-H and Uni-UTF16-H and the GBK*-H CMap resources	14:10.25
kens	Hmmm	14:10.25
	I can't really comment without looking at what you;ve done to be honest	14:11.48
	How much saving are you suggesting, what's the performance overhead of reconstructing the CMap ?	14:13.33
chrisl	tor8: doesn't that make updating the cmaps a pain?	14:15.06
kens	I guess we could write a customer findresource to coinstruct a CMap dictionary from a 'modified' CMap.	14:17.18
	s/customer/custom/	14:17.27
chrisl	AIUI, tor8 is using the usecmap operator, so it should just work, as long as the Postscript names work out	14:18.07
kens	Do we actually support usecmap ?	14:18.21
chrisl	Given it's used all over the joint in the standard ones, I'd rather hope so	14:18.59
kens	Hmm, seems we do	14:19.01
tor8	chrisl: it does, which is why I haven't suggested it for gs	14:19.18
	and gs doesn't embed them directly in the binary, does it?	14:19.26
kens	It does for romfs builds	14:19.37
chrisl	By default, yes	14:19.39
moolc	tor8: you guys host git on some aws like thingie?	16:37.01
Robin_Watts	moolc: We do.	16:52.31
	We have an aws instance that hosts various things, including our git server.	16:52.52
moolc	Robin_Watts: git remote update just took ~15 minutes (from tor's branch) Receiving objects: 100% (470/470), 24.01 MiB \| 58.00 KiB/s, done.	16:56.06
	no wonder a friend of mine mentioned how long it took him to bootstrap my stuff that does a clone of mupdf.	16:56.32
Robin_Watts	That's unusually slow.	16:56.44
moolc	Robin_Watts: perhaps it makes sense for me to switch git url in http://repo.or.cz/llpp.git/blob/6f05faf8e0f8bef1697edec342e8e8cfe02b43d0:/misc/bootstrap.sh to github mirror?	17:14.57
Robin_Watts	moolc: Perhaps.	17:15.33
moolc	Robin_Watts: okay.. i'll try	17:16.19
	Forward 1 day (to 2018/04/24)>>>

Log of #mupdf at irc.freenode.net.