MuPDF IRC logs

	<<<Back 1 day (to 2017/05/15)	20170516
user875	here there! have you considered to add a back-function? For example if I click on link jumping to another page, and then to be able to return to the page where I came from?	07:54.27
	(hopefully I have not overlooked it in the man page ...)	07:54.53
kens	tor8 that new bug report about text extraction is 'interesting'	08:59.49
tor8	kens: yeah... to say the least	09:08.06
kens	I cut the file down to just '2008'	09:08.18
tor8	in the body text, the digits come out to things like """%	09:08.44
kens	WHich is 0x32 0x30 0x30 0x38, all single bytes in the CMap	09:08.50
tor8	can you put up the cut down file?	09:09.46
kens	If you look in the ToUnicode CMap it maps <0030> <20> <0032> <20> <0038> <20>	09:09.49
	Sure	09:09.53
	err, when I dfind it	09:10.10
	ah found it	09:11.28
	OK there it is	09:12.19
	decompressed and reduced, still 700Kb	09:12.28
tor8	thanks	09:12.36
kens	So when I run this file through GS it extracts the '2008' as ' ' which looks correct to me.	09:13.06
tor8	mupdf also extracts that one as " "	09:13.47
	<span font="HAIWKP+FZDBSJW--GB1-0" wmode="0" trm="29.014 0 0 29.014">	09:13.59
	<g unicode=" " glyph="19" x="287.20976" y="993.08688" />	09:13.59
	<g unicode=" " glyph="17" x="301.65873" y="993.08688" />	09:13.59
	<g unicode=" " glyph="17" x="316.1077" y="993.08688" />	09:13.59
	<g unicode=" " glyph="25" x="330.55668" y="993.08688" />	09:13.59
	</span>	09:13.59
kens	So I wonder if Acrobat is not using the ToUnicode CMap, but instead using maybe the GBK-EUC-H ENcoding	09:14.18
	Oh oops, I messed up slightly with that file. I had doubled up the <32> to <0032>	09:14.46
	You'll want to delete the leading 00 from those text characters	09:15.03
	I was just running an experiment there	09:15.13
tor8	if I search the mutool draw -Ftrace output for instances using the same font name	09:15.19
	I get the same output	09:15.25
	<span font="HAIWKP+FZDBSJW--GB1-0" wmode="0" trm="29.0135 0 0 29.0135">	09:15.29
	<g unicode=" " glyph="19" x="287.2106" y="993.08877" />	09:15.30
	<g unicode=" " glyph="17" x="301.66883" y="993.08877" />	09:15.30
	<g unicode=" " glyph="17" x="316.12705" y="993.08877" />	09:15.30
	<g unicode=" " glyph="25" x="330.58528" y="993.08877" />	09:15.30
kens	And yet, Acrobat gets '2008' :-(	09:15.45
tor8	maybe they ignore the ToUnicode?	09:16.02
kens	Possibly	09:16.12
	But they definitely don't usually ignore it, so how to know when they will ?	09:16.24
	It does seem clear its ignoring the ToUnicode	09:18.02
tor8	yeah, if I nuke the ToUnicode it comes out as 2008	09:18.10
kens	In MuPDF ? I haven't tried GS doing that, but I'm sure that's the result I would get	09:18.33
tor8	in object 39	09:18.52
kens	If you go back to the original file and copy/paste from Acrobat the line with 2008 in it, then it comes out as you would expect	09:19.22
	2-byte text is correct as well as single byte text	09:19.38
tor8	in mupdf with a nuked ToUnicode for that fornt	09:20.06
	the text comes out correct for both the latin and han characters	09:20.31
	with the ToUnicode entry intact, only the han characters come out correct	09:21.02
kens	Really ? You just treat the Han as Unicode ?	09:21.07
	When no ToUnicode I mean	09:21.17
	It does appear that Acrobat is simply ignoring the ToUnicode CMap, but Icannot see any justification for doing that	09:21.52
tor8	in mupdf we use a builtin ToUnicode based on the CID system info	09:21.55
kens	Oh OK	09:22.08
tor8	if the font does not have its own ToUnicode	09:22.23
kens	THe system info is OK, I wonder if Acrobat is doign that also	09:22.27
	Given the Registry is Adobe and the Ordering is GB1	09:22.44
	I wonder what happens if I nuke that	09:22.51
tor8	I suspect it might decide not to even look at a ToUnicode if the system info is one of the CJK ones	09:23.07
kens	If I garble the system info it still gets 2008 correct	09:23.33
	Let zee what happens if I mess wiht the ToUnicode	09:23.49
	LOL still gets 2008 correct	09:24.02
	Its clearly not using the ToUnicode at all	09:24.12
	Oh there are 2 orderings	09:25.04
tor8	could it base it on the /Encoding /GBK-EUC-H entry?	09:25.33
kens	I tried garbling that too earlier	09:25.46
tor8	hm, the /CIDSet maybe?	09:26.32
avih_	tor8: morning! 1. i think toFixed(0) should not be an alias to toFixed(1) but it seems mujs behaves like it is. 2. i don't think mujs should print to stdout. e.g. on gc or "warning: function statements are not standard". maybe the solution to 2 is use stdout by default, but allow registering a print function (plain str, no va_list)?	09:26.34
tor8	I've never used that	09:26.36
kens	Ah, if I mess up the Encodng then Acrobat throws an error	09:26.38
avih	also, congrats on the gpl win? :)	09:28.04
kens	Changing it to 83-pv-RKSJ-H didn't change the Latin, I guess I need to put back the Han to test it	09:28.17
	avih long way to go yet	09:28.26
tor8	avih: morning	09:28.31
avih	as always :)	09:28.35
	kens: better than losing right off the bat though...	09:28.54
kens	Sure	09:28.59
	It'll probably end up in an out of courst settlement is my guess	09:29.12
tor8	avih: toFixed (and similar) are very simplistic implementations at the moment	09:29.23
avih	that's one way to call it. "non standard" would be another :)	09:29.49
	do you have a list yet of non compliant stuff?	09:30.13
	kens: yeah, hopefully a nice one	09:30.44
kens	Well we'll see. Stopping them infringing would be good, sends a message out if nothing else	09:31.05
	tor8 looks like it is using the CIDSystemInfo	09:31.37
	sorry, the ENmcoding I mean	09:32.44
avih	kens: problem with settlements though, other than typically being shy on the details, is that they don't set a precedent :/	09:34.26
kens	If I garbel the CIDSystemInfo for the font and descendant, then the text is OK, if I additionally change the Encoding from GBK-EUC-H to 83pv-RKSJ-H then the text disappears on rendering, and copy/paste returns garbage latin text for the Han	09:34.32
	avih, not legally	09:34.41
tor8	kens: with this knowledeg in hand, I'm inclined to close the bug as "wontfix" it works as expected -- don't put a bad ToUnicode in the file if you want proper output.	09:34.49
kens	But they tell infringers we will sue	09:34.49
	tor8 I would agree completely	09:35.00
	The ToUnicode is incorrect and that's where the problem comes from	09:35.13
tor8	it is good to know that Adobe uses the /Encoding only if it's a known encoding	09:35.24
kens	You might watn to put the information about the ENcoding there, just to capture it	09:35.28
*kens*	goes back to PDF gstates	09:35.48
	more coffeee...	09:36.52
avih	tor8: what about stdout? i don't think non debugging APIs should print to stdout. but i do find the info useful. both on gc and the function warning, even if those two clearly serve different functions.	09:38.09
	(or maybe it's stderr, didn't check. but still the same)	09:38.58
	so maybe warning and info print callbacks. that should cover nicely a need to print stuff	09:39.45
	oh, i _think_ there's a bug in gc where it always prints.	09:40.36
tor8	kens: will do. thanks!	09:41.13
	avih: yeah, that's probably just an oversight.	09:41.23
	avih: the stdout/stderr printouts	09:41.32
avih	yeah. the info in those prints is good, but i'm using mujs as a lib, with multiple threads, i don't want the lib to print stuff	09:42.13
	if it goes through me then i can channel it correctly to log files or other outputs	09:43.03
tor8	avih: yes, I hear you.	09:44.34
avih	(i'd say the same for dofile/dostring, but those are easily rewritable using other APIs)	09:45.49
	(so i consider those "non production")	09:47.08
kens	tor8 just for giggles, it looks like that PDF file was produced by 'PSPNT' which as far as I can tell is the Founder Rip.	09:54.20
	Looks like they are calling it EagleRIP today	09:56.31
tor8	user875 (for the logs): to go back to before where you clicked the link, use the 't' key.	10:08.06
	avih: there's a commit on mujs:tor/master that adds a report callback function	10:11.48
avih	hmm.. so no warning/info?	10:13.55
tor8	I only found one warning message, and that one should strictly speaking be an error	10:14.22
	function statements are not in the ES5 spec	10:14.34
	but IMO they should be, and everything supports them	10:14.48
avih	gc is pure info	10:16.20
tor8	every other message is informational	10:17.45
avih	(i'm fine with the function statement thingy going into the "not a good thing happened" bucket)	10:17.57
	oh, i see your point.	10:18.21
	what is js_report good for?	10:21.52
	(as public api)	10:21.59
tor8	client functions may also want to print warning/debug/report messages and have them go to the same place	10:22.52
avih	also the regex overflow.. does the reason appear in an exception? otherwise that's a really useful info	10:23.33
tor8	it does not; I'm not sure quite how to tackle it	10:24.12
	the regexp.c code is from a separate non-JS project and I've wanted to keep the files independent of mujs	10:24.48
	otherwise it'd be easy :)	10:24.52
avih	hmm.. :)	10:25.05
	you mean the entire file is a verbatim copy from another project? that can't be true..	10:25.57
	or maybe it can. no "js" there	10:26.31
tor8	avih: it's also why the file doesn't have a 'js' prefix	10:30.19
avih	maybe it should have some reporting facility which both the js and the other project could use? but you know your code better. all i know is that it's a i'd like to get that info as a user	10:31.44
	s/it's a//	10:32.40
	regex are already hard to debug. some code uses big structures with regex in them and the line number on such is not always useful (maybe i'll recheck after your line number fix)	10:34.44
	(iirc it was a big object with many properties, where some of them were regex, and the error line number was on the last line of the struct)	10:36.01
	(which had nothing but '}')	10:36.21
tor8	we only track lines by statement	10:51.00
avih	tor8: thx for the lines thing. what does JS_ASTLIMIT serve other than using it because you can? as far as i can tell you don't pre-allocate anything depending on its size. a statement cannot be infinite by definition, so where does it become useful?	13:06.48
	(i'd understand recursion limit which you can't predict if infinite or not, but i think it's not the case here)	13:07.36
tor8	avih: prevents stack smashing and segfaults when parsing ridiculously deep (maliciously constructed) expressions	13:12.07
	something like "((((((((((((((((((((((" but for a few kilobytes more	13:12.18
avih	wouldn't malloc just fail at some point?	13:12.39
tor8	it's the stack that runs out, not the heap	13:12.51
avih	oh, it's on thee stack	13:12.52
	-e	13:12.59
tor8	recursive descent parsing has its drawbacks	13:13.02
avih	and before this patch?	13:14.09
	stack overflow on malicious input?	13:14.31
	(you could still overflow though, depending on your stack size. and musl for instance has a notoriously small default stack)	13:15.12
	(iirc 2k by default and alpine linux for instance set it to 8k. _iirc_)	13:15.57
tor8	avih: that's why I made the JS_ASTLIMIT a #define	13:16.09
	before this patch we'd segfault on such input	13:16.20
avih	right	13:16.26
	it doesn't sound too little for hand written code for sure. but generated is harder to expect (minified/amalgamated/asm.js/etc)	13:17.57
	s/expect/predict/	13:20.23
tor8	nothing to do with minification or asm.js would add such deeply nested expressions	13:22.00
	there's always a question of where to draw the line; if it turns out to be problematic we can just bump the magic number	13:22.42
avih	true	13:22.53
	how i do think it's related though. things which are compiled to js can use the language as they see fit, and nesting is a perfectly useful tool	13:24.34
	however*	13:24.39
	but anyway, cross the bridge when the time comes :)	13:25.27
	Forward 1 day (to 2017/05/17)>>>

Log of #mupdf at irc.freenode.net.