MuPDF IRC logs

	<<<Back 1 day (to 2020/05/13)	Fwd 1 day (to 2020/05/15)>>>	20200514
myopia	is artifex planning on muhtml after mujs? all current browser and html engines suffer from memory leaks from those I have tried		10:51.11
	a clean-room C implementation of a leak-proof html engine to be coupled with mujs would be nice		10:52.01
	a candidate to be squared away for a truly scriptable minimalistic browser		10:52.36
	scriptable and minimalistic		10:52.54
	probably also coupled with security proofs from a haskell code model		10:53.39
ator	myopia: no. a minimalist browser that is usable on todays web would be a universe-imploding type paradox.		11:49.46
	sebras: a couple of mujs commits on tor/master that could do with a review if you got time		11:50.53
	avih: that goes for you too. I'm thinking of finally pushing the WTF-8 mess...		11:51.52
avih	ator: summary? also what is the "WTF8 mess"?		11:52.20
ator	allow >16bit characters in JS strings that would pass from C to C untouched, without automatic SMP conversion		11:52.55
avih	(i've been using this successfully https://github.com/avih/mpv/commit/81306684 )		11:52.57
	well, i know what the problem is quite exactly. i mean, what's your solution?		11:53.27
ator	basically the same hack as duktype		11:53.27
	my solution is to do nothing and leave it up to the user :)		11:53.47
avih	ah, so js char is a codepoint?		11:53.47
	(that's the implication of char > 16 bit, right?)		11:54.25
ator	with the commits I'm proposing, you won't be able to create a >16bit char from Javascript, nor by loading it from source		11:55.08
avih	but it will preserve them if they came from the C api?		11:55.35
	what about slice etc?		11:55.40
ator	but if you pass in a string with >16bit characters via js_pushstring(), they will pass through unmolested back to js_tostring()		11:56.08
avih	what happens with CP > BMP in source files/code?		11:56.15
ator	SMP characters in UTF-8 source code will be converted to surrogate pairs		11:56.37
avih	not fun		11:57.02
	the C client will still need to handle surrogates then, in addition to UTF8 as non-surrogates		11:57.28
	ator: i don't know why you keep messing with half solutions. make it proper. let the C api be UTF8 and handle everything which results inside mujs, IMO		11:58.55
	be graceful with invalid codepoints, but for valid ones just make them work. remove this burden from the user.		11:59.33
ator	fine. then I'll be strict and only allow 16-bit characters as originally intended by ECMAScript.		12:00.07
avih	my solution is external and complete. internally it will be a bit simpler, but there's no getting away from converting every C api string if required.		12:00.37
	you can create another version of all string APIs which will be used internally only, so it doesn't get tested for conversion, but user-facing APIs should just be UTF8.		12:02.39
ator	I'm honestly more tempted to say "screw compatibility" and allow >16bit code points in the JS code		12:03.21
avih	i think this would be better than half-baked solutions. not as good as both UTF8 API and 16 bits chars, but better than half-solutions IMO		12:06.10
ator	the main problem with that is representing strings as ascii. there's no escape syntax for >16bit unicode characters.		12:06.56
	\x is 1-byte, and \u is 2-byte only		12:07.09
avih	i don't follow example?		12:07.13
	hmm		12:07.16
	isn't there \U ?		12:07.24
ator	(of course, I could just not escape them)		12:07.26
avih	so this is only for JSON?		12:07.39
ator	JSON and console.log()/print()		12:08.20
avih	iirc JSON allows both surrogate and \U<any-codepoint>. I don't know if the js JSON spec requires differently		12:08.42
	I was wrong. JSON allows either surrogates or unescaped utf8		12:09.54
	https://en.wikipedia.org/wiki/JSON#Data_portability_issues		12:10.19
ator	I'm leaning towards not escaping SMP characters		12:10.39
avih	15.12.1.1 says "SourceCharacter but not one of " or \ or U+0000 through U+001F"		12:13.32
	i think it could be interpreted that that unescaped SMP is allowed?		12:13.49
	actually, any source utf8 except 0-31 seems to be allowed		12:14.34
ator	well, the JSON.stringify algorithm actually only says to escape 0-31, not >127		12:25.29
	avih: okay, how about what's on tor/master now then?		12:35.42
avih	ator: i assume only HEAD~1 and HEAD~2 ?		12:37.26
ator	yes. feel free to look at the others too, but those are the ones I think you're most interested in.		12:37.54
avih	ator: so the first commit only converts 0 in js_pushlstring into whatever else you encode it as?		12:40.36
ator	no. it does not affect pushlstring. it affects how "foo\0bar" is parsed		12:41.20
avih	in source code only?		12:41.44
ator	yes.		12:41.48
avih	hmm..		12:41.52
	well, i think it should be applied in js_pushlstring as well		12:42.10
	(not sure yet how to retrieve it. there's no js_tolstring...)		12:42.37
ator	in other words, we represent embedded 0 in strings as <c080>. same as Java does in the .class bytecode and elsewhere.		12:42.42
	js_pushlstring is to push a non-zero-terminated string		12:43.12
avih	exactly, because it may include embedded 0 which you don't want to terminate it		12:43.39
ator	making it a public function was a mistake		12:43.45
	no, because you're pushing a slice of another string as a new string		12:43.56
	like when the regexp code finds a match		12:44.07
avih	that may have been your intention, but generally non-0-terminated also means a binary blob		12:44.26
ator	pushing strings with an embedded zero was never the intention :)		12:44.31
avih	anyway, i think this is off topic. source files rarely have 0 in them, so only for source files i don't think you should push it.		12:45.20
	(i.e. just terminate the parsing on 0, or skip it, or whatever, it's not worth all this code IMO)		12:45.59
	(fwiw, i don't disagree that pushlstring should not handle embedded 0. if you want blob, use whatever which supports arbitrary sequences - which is not *tf8. it could be some array buffer or some encoding etc)		12:47.44
	"+If you have Javascript code that expects to work with UTF-16 surrogate pairs,		12:49.24
	+you will need to manually convert any extended characters to surrogate pairs		12:49.24
	+and back when passing strings between C and Javascript."		12:49.24
	i think that it you already broke it, at least make it complete - allow >16bit also from js, and provide tehse functions which convert to/from surrogate pairs		12:49.57
	(i can write them for you once you allow >16bit from inside js)		12:50.31
	however, my solution is still complete, does not break the standard, does not have perf implications for running js code, and has effectively negligible perf impact for C APIs.		12:52.31
	you keep trying to find half solutions, even with this.		12:53.28
	(i understand why.. but you can also choose to solve it correctly)		12:54.34
ator	nobody wants the "correct" solution because it's a piece of crap and the world has moved on to UTF-8 and is not stuck in 1993 when windows NT came out with its 2-byte "unicode" support...		12:57.23
	nobody in their right mind wants to mess with surrogate pairs.		12:57.35
avih	i agree, but it's still the correct js way		12:58.03
ator	your patch should still work unaffected by what is on tor/master		12:58.06
avih	i know. i was hoping to not need it.		12:58.29
ator	but now, if someone wants to use mujs with utf-8, unrestricted by having to convert to surrogate pairs, they can. if they just don't know or don't care, their code will work.		12:58.58
avih	regardless, if you allow SMP chars from inside js and provide the conversion function, i think it will be good enough		12:59.00
ator	you talk as if you haven't actually looked at the code :)		12:59.20
avih	i haven't		12:59.29
	iirc you said it won't work from inside js		12:59.41
ator	here I am thinking you object to something in the code, when you're actually objecting to something that doesn't exist >.<		13:02.47
avih	i wasn't objecting to anything. i said that if you support SMP chars in js, then you should also support it from inside js if it's not supported already.		13:03.47
	and the docs should provide the js conversion to/from surrogates, and i can write them for you once using SMP from inside js is supported.		13:05.16
	i.e. str[1]=0x10005 should work		13:05.54
	or rather FromCharCode(0x10005) should work		13:07.02
ator	the current patch supports them inside and outside of js. String.fromCharCode and String.prototype.charCodeAt() are not limited to 16-bits (as per the commit message)		13:09.25
avih	great. then it should be good enough. i can write the to/from surrogate pairs conversion if you want to add them to the docs. i expect ~5-10 LOC each		13:11.17
	(probably more like 5)		13:13.11
	i still think supporting embedded 0 in source files is a waste of code. but up to you.		13:13.47
ator	avih: again you haven't read the source. I don't support embedded 0 in the source code file ... I'm supporting it in strings, like this: var myZeroString = "Hello\x00World!";		13:14.46
	or \u0000. not a literal 0-value byte in the source text.		13:15.47
avih	yeah, i've looked at it but didn't follow it, which is why i asked what it does, and from your replies i thought it only applies to source files. i probably understood wrong.		13:16.00
	anyway. i don't have stong opinion about the embedded 0 and i think the other patch to support SMP chars is good enough according to what you said.		13:17.10
ator	fair enough. it's hard to convey the meaning in english :)		13:17.26
	language is too ambiguous.		13:17.33
avih	sorry, it was probably me not reading carefully enough. shit happens :)		13:18.50
	<<<Back 1 day (to 2020/05/13)	Forward 1 day (to 2020/05/15)>>>

Log of #mupdf at irc.freenode.net.