MuPDF IRC logs

	<<<Back 1 day (to 2020/02/18)	Fwd 1 day (to 2020/02/20)>>>	20200219
avih	ator: so any thoughts on what/if you'll do with CESU-8 API and/or js_loadfile? fwiw, the more i think about it, the more i'm think that internal representation as utf8 would be the best solution. i think it's the smallest leap from current code. strings are already variable length encoding which requires scanning them during random access. making a utf8 codepoint above BMP be a virtual two surrogates on access/write is relatively small leap, compared to testing		08:28.03
	for conversion on all APIs which use strings.		08:28.03
	(poop emojis et al are indeed rare, but less rare than they were 5 years ago)		08:31.30
	(and providing conversion utils, while better than nothing, still puts an additional meaningful burden on users which want their programs to handle supplementary codepoints correctly)		08:34.26
	also, i _think_ it should be relatively simple to make mujs allow 0 in strings. it will require few (definitely not all) places to consider len directly rather than strlen, and will require an additional api js_tolstring which will output the length as well, but otherwise all apis remain only aware of 0-terminated.		08:57.34
ator	avih: I think just using WTF-8 may be the best way. i.e. extend the utf.c to cover 4-byte sequences, and leave data as-is.		09:00.52
	avih: yeah, having embedded 0 will need a length counter in strings (could force use of the JS_TMEMSTR subtype for any strings with embedded 0)		09:01.52
avih	ator: so that "<one-codepoint-above-BMP>".length would be 1 ?		09:02.11
ator	but should be fairly easily doable		09:02.15
	avih: yes, if encoded as utf-8. or 2 if encoded as a surrogate pair.		09:02.43
	surrogate pairs are forbidden in strict utf-8, required in cesu-8, and wtf-8 doesn't care.		09:03.04
avih	what if js code does String.fromCharCode(0x10000) ?		09:03.23
ator	node.js returns '\u0000' from that :)		09:04.20
	the spec says fromCharCode turns the number into a UInt16 first.		09:04.59
avih	i think the main point is keeping consistency between strings into mujs, strings out of mujs, and manipulating strings inside mujs		09:05.27
ator	yes, and I think wtf-8 will be the least surprising (and still allow emojis and other non-BMP text)		09:05.52
avih	if you allow s[N] to hold a codepoint rather than 16 bits, then this solves all the issues		09:05.55
	(at the expense of not complying with the spec)		09:06.23
ator	I'm fine with that, TBH		09:06.53
	compliant code and input will still work as expected		09:07.14
avih	atso make Rune 32 bits, allow fromCharCode(0x1000), and that charCodeAt(N) can return 0x1000 ?		09:09.00
	ator: well, compliant code which creates >BMP codepoint as surrogate will be read incorrectly at the C API		09:10.18
ator	not really, since the C api documentation will say WTF-8 :)		09:11.10
avih	i think the most consistent option is utf8 or wtf8 internal representation, and making codepoints above MBP a virtual surrogate. it does need consideration of how to handle ill-formed content, but it's possible		09:11.52
	and this will not break the ES5 spec, and also allow consistency of in/out/within		09:12.39
	basically, it will need a mapping of internal representation to virtual js string, and the other way around, both for arbitrary content		09:13.41
ator	javascript itself doesn't enforce properly encoded surrogate pairs though		09:14.25
avih	right		09:14.34
ator	so WTF-8 and no checking on the C side is probably for the best		09:14.34
avih	that's why it needs to work with arbitrary content, content being js string and internal representation		09:14.55
ator	that allows passing data in and out of strings properly (if also fixing the embedded 0)		09:15.01
avih	ator: i still didn't read the whole wtf8 spec. do you know if it can hold arbitrary binary content?		09:16.45
	well, hold and preserve		09:16.53
	(preserve as long as it's not slice and diced, i assume)		09:17.33
ator	"This specification defines WTF-8, a superset of UTF-8 that can losslessly represent arbitrary sequences of 16-bit code unit (even if ill-formed in UTF-16) but preserves the other well-formedness constraints of UTF-8." is the key sentence		09:18.36
avih	yes, but also "WTF-8 (Wobbly Transformation Format − 8-bit) is an encoding of code point sequences".		09:19.17
	so if you try to encode the c-string { 0x80, 0x80, 0x80, 0x80, 0x80, 0x80 } then it's not a representation of a codepoint		09:19.56
	it's possible to convert such string into a sequence of 0x80 codepoints, however that ends up encoded		09:20.29
	but then it would also need conversion on the way out.		09:21.28
ator	that c-string would end up being "\uFFFD\uFFFD\uFFFD...		09:21.46
avih	maybe conversions should only happen for js_pushlstring and js_tolstring. these are almost by definition binary APIs and not string ones. they just use the String class in js to hold the value		09:22.20
ator	if you wanted "\x80\x80\x80\x80\x80\x80" then you have to encode the code points using WTF-8 before passing to mujs		09:22.58
avih	exactly, wtf8 does not preserve arbitrary binary content, but any arbitrary content can be converted into a wtf8 which can be preserved and converted back into the original binary content		09:24.34
ator	yep.		09:25.00
	which is how you used to handle binary data in javascript before the TypedArray extensions, stick arbitrary data into its UTF-16 strings		09:25.38
	which don't actually require well formed UTF-16		09:25.49
avih	ator: also, both wtf8 and cesu8 should not be used for communication, storage, etc. i don't think i've seen anything mentions library API though		09:26.19
ator	because it's all an afterthought, much like unicode handling in Java and Windows and everyone else who jumped on the unicode bandwagon early		09:26.22
avih	aye		09:27.17
	we know how we got here. but how do we handle it now? :)		09:27.33
ator	"oops, you mean you need MORE than 16 bit characters? we're so fucked!"		09:27.53
avih	lol		09:28.02
	so true		09:28.04
ator	640Kb is more than anyone will ever need!		09:28.35
avih	it's enough for mujs, most of teh time :)		09:28.51
	2038 will never arrive		09:30.31
ator	btw, node.js doesn't support >16bit characters in its strings		09:31.21
	"we won't be alive when 2038 arrives so who cares?" seems like a more apt description		09:31.52
avih	that's ok. it's by spec. but i assume it does convert to/from utf8 when doing io, at the very least if it consists of well formed content		09:32.15
ator	duktape does what we propose to do here		09:33.21
avih	well, today most people can't hope to not be alive in 2038...		09:33.22
	which of the proposals?		09:33.51
ator	no, but the folks back in the 70's who made the initial assumption		09:34.06
avih	yup		09:34.13
ator	allowing >16bit chars in strings, using relaxed superset of utf-8		09:34.37
avih	interesting		09:34.54
ator	https://wiki.duktape.org/howtononbmpcharacters		09:35.45
avih	it _is_ the simplest approach, but you have to break the spec for that, or else you cannot slice and dice such strings in js code		09:35.56
	and it will break for "correct" js code which constructs surrogate pairs		09:36.24
	but i also think it's the simplest solution		09:36.42
	i.e. acceptable until a better solution comes up		09:36.59
ator	if you want JS code to handle strings, make sure to only push 16-bit code points in your C code		09:37.06
	we can add a helper js_pushcesu8 which converts any >BMP characters into surrogates		09:37.24
avih	yeah, that would be trivial. the problem is the other way around		09:37.59
ator	js_tostring? well, we could add a js_toutf8 that combines surrogate pairs but needs a place to store the memory		09:38.39
avih	also, this would invalidate most APIs which take char*		09:38.48
ator	there I don't follow you		09:39.07
avih	i.e. can't use these apis anymore with utf8 content. you have to push as cesu8 first, then run js code which uses it		09:39.33
	or maybe do some jumping through hoops like js_pushcesu8(utf8); cesu8 = js_tostring(J, -1); /* now you can't pop it */ js_setproperty(J, cesu8)		09:41.33
ator	js_pushstring_as_cesu() and js_tostring_from_cesu()		09:41.39
	like your wrappers that you posted a while ago		09:41.51
avih	push/to string is not enough. you have property names too		09:42.17
ator	oh, you mean the property names		09:42.20
avih	you have error objects, etc		09:42.31
ator	yeah. and then you have malloc issues.		09:42.55
	or the API becomes more complicated		09:43.08
avih	yes.		09:43.18
	just look how many APIs use const char * at mujs.h		09:43.58
ator	but why would it matter for property names -- either you have literal strings or it's data that you use opaquely		09:44.18
avih	i think not, but let me think about it a bit more. i'm pretty sure in mpv context you want to move property names back and forth between mpv and mujs, where in mpv they're utf8		09:45.35
	granted, you're not likely to want to slice and dice such names in js code though.		09:46.10
ator	and also, would it matter in mpv that the .length of a utf-8 stirng is actual characters rather than code points?		09:46.13
	surrogate pairs are madness, and being able to avoid them must be a good thing all things considered		09:46.48
avih	currently mpv does not access the .length property of a string, because there's no api which outputs it as js_tostring		09:46.59
	(and i assumed, though now i think i could have been wrong) that data is clipped at '\0' inside mujs		09:47.39
	but iirc i looked at js_pushlstring recently and it doesn't crop it iirc		09:48.05
ator	the intent is to support them, but there are bits of the code that still terminate at 0		09:48.57
avih	ator: how about this: rune can be >16bits, but fromCharCode doesn't allow >16bits. if this can happen while substrings etc don't corrupt the data, it could work		09:50.00
	so basically, the limitation is that js code which constructs or assumes surrogate pairs will be work incorrectly at the C-API user, but everything else will work transparently		09:52.03
ator	js_pushlstring is currently used to push strings without a 0-terminator		09:52.05
	but eventually should be able to be used to push strings with embedded 0 too		09:52.16
avih	yes, i'm looking at jsV_newmemstring and it doesn't care about termination. it just copies n bytes		09:54.18
ator	we're halfway there. need to add a length counter to memstrings, and fix the lexer to return memstrings and not always interned strings.		09:54.51
avih	the only missing thing is js_tolstring. though it could probably be worked around by reading the .length property and doing some work with it. maybe.		09:55.06
ator	and then look at the length counter in some jsstring.c functions		09:55.14
	js_tolstring will be added		09:55.27
avih	ator: i still think that extending the virtuality of String to handle CP >= U+10000 at the underlaying utf8 data is the best solution. it doesn't need the spec to be broken, it can handle utf8 input and output, surrogate pairs expectation works for js code, etc. it can be a bit delicate in handling of ill-formed underlaying data, but that's about it		09:59.53
	String is already virtual anyway. you don't have real random access, so just extend it a bit.		10:00.51
ator	ill-formed underlying data is what prevents us from doing automatic surrogate pair splitting/combining		10:02.26
	because ill-formed data is very much allowed by the spec		10:02.41
avih	(also delicate in handling ill-formed utf8 code points from the js-string side)		10:02.45
	utf16 *		10:02.53
ator	javascript strings are only utf16 in name -- in reality they're just arrays of unsigned short		10:03.24
avih	yes, if ill-formed data was disallowed, there would be no issue. but i don't think it's not solvable		10:03.33
ator	mujs shouldn't (IMO) enforce a policy.		10:04.03
avih	yes, as i said ill-formed utf8, and js string data which would be considered ill-formed if treated as utf16		10:04.20
ator	the current code only allows 16-bit values in its strings, and shows them as utf8-ish at the C api level		10:04.55
avih	so these cases should definitely not be ignored. which is exactly the delicate part. but it's well defined and well contained in scope		10:05.03
ator	I'm proposing to allow 21-bit values in strings, and let the user beware		10:05.19
	so naive passing of non-BMP UTF8 will go unnoticed		10:05.45
	javascript code itself can't create strings with >16bit characters (fromCharCode etc prevent it)		10:06.00
avih	and the tradeoff is broken spec, yes?		10:06.04
	because charCodeAt (and maybe also fromCharCode) can be >= 0x1000 ?		10:06.22
ator	tradeoff is we support more than the spec allows		10:06.23
avih	0x10000 *		10:06.52
ator	given input data that conforms (up to the C api user to check when calling js_pushstring if he/she cares)		10:06.56
	then nothing should be a problem		10:07.07
	yes, charCodeAt can be >0x1000 in my proposal		10:07.27
	(and I wouldn't be too averse to allowing fromCharCode to allow >0x1000 either)		10:07.50
avih	well, it's definitely the easiest solution, code wise		10:07.53
	(not suggesting in a bad way)		10:08.09
ator	if you just want to pass emoji strings back and forth, this is the way. only legacy APIs use utf-16 and surrogate pairs these days.		10:08.28
pink_mist	shouldn't it be 0x10000 not 0x1000?		10:08.40
ator	> 0xffff		10:08.59
	so yes, >= 0x10000		10:09.09
avih	pink_mist: yes, but 0x1000 is a lot easyer to ttoe, and we both understand it means 0x10000 :)		10:09.10
	easier*		10:09.18
pink_mist	right :P		10:09.21
avih	type* (grrr)		10:09.23
	i suggest ABMP (above BMP)		10:11.29
	ator: will you work on it at some branch? i still want to think about extending the virtuality to include underlaying ABMP		10:13.12
	(i.e. without breaking the spec)		10:13.38
ator	avih: see top commit on tor/master		10:15.53
	that's all that's needed now		10:16.05
	supporting embedded 0 is going to be a lot trickier		10:16.11
	because of all the builtin functions that also need to support them, that currently just use C strings		10:16.36
	SMP, SIP, and SSP		10:17.57
avih	(looking)		10:19.21
	ator: what's "SMP, SIP, and SSP" ?		10:19.49
ator	supplementary multilingual plane, supplementary ideographic plane, supplementary special-case plane		10:20.51
avih	aka ABMP ?		10:21.04
ator	yeh		10:21.12
	just SMP will do when communicating		10:21.24
avih	right		10:21.32
ator	'supplementary characters' are those outside the BMP in unicode lingo		10:21.41
avih	aye. SMP is fine		10:21.53
	ator: at the new docs, what kind of attention/actions are required for "but requires attention when passing strings using supplementary unicode characters to and rom the MuJS library"		10:22.53
	it's important that client know what they should do to handle it as best as they can		10:23.45
	maybe "this maintains compatibility with valid UTF-8 at the C-API, at the expense that JS code will not see SMP as surrogate pairs. similarly, JS code which construct surrogate pairs will not be converted to UTF-8 at the C-API"		10:26.07
sebras	ator: cgdae noticed in Makefile we have source/pdf/cmaps/%.h: resources/cmaps/% scripts/cmapdump.py since git controls in what order it checks out files can we really be certain about the dates and whether that rule is run?		10:32.21
ator	avih: I still need to write that up properly.		10:41.55
	avih: that's a good summary.		10:42.15
	sebras: we check in the results, only if you forgot to commit the changes should that matter, no?		10:47.30
	but forgetting to commit the autogenerated files would be twice as bad in this scenario, yes!		10:48.37
sebras	ator: we do, but make is triggered by timestamps.		10:48.43
	ator: since git doesn't keep timestamps then nothing prevents the rule from being triggered if cmapdump.py is younger than the checked in .h-files.		10:49.21
ator	in a clean checkout, the files "should" be in sync and not need generating		10:49.25
	right, you mean we generate the file needlessly from a clean checkout?		10:49.54
sebras	ator: exactly, but if the timestamp of cmapdump.py happens to be later than those if the headers then the rule would be run, no?		10:49.56
ator	harmless I think :)		10:50.00
sebras	ator: not if you don't have python installed on your system.		10:50.10
ator	but maybe not if you don't have python3 installed		10:50.10
sebras	in the oss-fuzz environment there is no python binary, but there is a python3 one.		10:50.42
ator	we run python3, or should		10:50.54
	I believe I converted all the scripts to be python3		10:51.02
	(or python2 and 3 compatible)		10:51.17
	the #! says python3		10:51.24
	and the makefile calls python3		10:51.53
sebras	oh! but we were running an earlier commit!		10:52.09
	where it just said python in the Makefile.		10:52.21
ator	in that case you're hosed :D		10:52.23
sebras	yes, but should we even needlessly run cmapdump.py is a clean checkout?		10:52.45
ator	do you know of a way to prevent it?		10:53.26
avih	ator: thx. so the docs would need to specify these IMO: 1. what's the best way to use strings at the API (convert to/from cesu8?) 2. what are the limitations of this best way (no middle '\0', non valid CESU8 will not be seen correctly at JS code? cnnot use substrings? etc), 3. what are the implications of using utf8 at the C api		10:54.11
ator	(other than forcibly removing the dependencies and require a separate step to regenerate them)		10:54.12
	ator: see tor/master for a slightly reworded documentation		11:05.07
sebras	ator: since we have the headers checked in, why can't we just: generate-cmaps: resources/cmaps/% scripts/cmapdump.py		11:05.48
	ator: or something along those lines.		11:05.56
ator	sebras: because you can forget to regenerate them when you change them.		11:06.07
	sebras: the same is true of the javascript and font dumps too, btw.		11:06.52
sebras	ator: you mean when we update them with a new version from upstream?		11:06.56
ator	which are updated more often than the cmaps		11:07.10
sebras	true.		11:07.14
	well, then we need python3.		11:07.27
ator	also, generate-cmaps: wouldn't work because of tracking dependencies		11:07.39
	the c files that include the generated header files depend on them, etc.		11:07.55
	so for your proposal, we'd have to disconnect the dependencies		11:08.34
	and that would mean having to 'make clean' everytime you change a file that is autogenerated		11:08.51
	(like we used to, and that led to many build skew errors)		11:09.03
	so yes, we need python3!		11:09.19
avih	ator: also, i think the docs can recommend (maybe with provided utils) to handle arbitrary binary blobs as BCD or base64 strings. this guarantees perfect handling of arbitrary data.		11:15.54
	while the plain C API assume utf8 or cesu8 or whatever you specify		11:16.29
sebras	ator: ok.		11:18.53
avih	or just plain hex where "\xDE\xED" is stored as the normal string "DEAD" in JS		11:20.11
	"\xDE\xAD" -> "DEAD"		11:20.33
ator	avih: if you want to process raw data, there's the TypedArray extension that I've been working on		12:33.47
avih	yeah, i saw you pushed it earlier (currently i can live without it)		12:48.22
	ator: in general though, we currently need mujs 1.0 for mpv. i don't think i want to require a minimum mujs version later than 1-below the latest		12:51.39
	so if there's new which mpv will use, it will have to be there at last some while		12:52.30
	new API *		12:52.36
	ator: btw, what did/do you think of not interning small property names (which covers array indices if set to e.g. 12) like so https://0x0.st/isYD.txt		14:30.19
	ator: in your current master (with SMP handling), something is not right. i have such emoji at the source file, the s[N] enumerate "correctly" (one char is SMP), but it seems that something breaks at js_tostring. i _think_ i'm not getting a correct utf8 sequence		14:45.40
ator	avih: okay... can you provide an example?		15:39.53
avih	ator: not yet. but i think print("<SMP>") at a source file on disk via mujs would reproduce it. but didn't try yet.		18:24.39
	ator: yeah, reproducible with \|mujs <this-file>\| :		18:31.15
	var s = "Hello 😀!\n";		18:31.16
	for (var i = 0; i < s.length; i++)		18:31.16
	print(s.charCodeAt(i));		18:31.16
	print(s);		18:31.16
	so for that emoji it prints 62976 for charCodeAt, but the output doesn't show correctly (and it does it i cat <this-file>)		18:32.11
	if i cat *		18:32.26
	i.e. the output of print(s) is incorrect.		18:34.34
	<<<Back 1 day (to 2020/02/18)	Forward 1 day (to 2020/02/20)>>>

Log of #mupdf at irc.freenode.net.