MuPDF IRC logs

	<<<Back 1 day (to 2020/02/24)	Fwd 1 day (to 2020/02/26)>>>	20200225
avih	ator: how much do you care about ABI? when splitting the API to cesu8 and utf8 variants, one of two (i think) approaches can be taken: 1. js_foo becomes both js_c_foo and js_u foo, where clients can choose one explicitly or use js_foo which is defined to one of those per their choice via mujs.h, however, this breaks ABI because js_foo is no longer at the lib. 2. cesu8 variants keep their existing name (e.g. js_foo), while the utf8 variants are e.g. js_u_foo.		10:36.11
	for clients which prefer, it can define all js_foo to js_u_foo for automatic utf8 API, but then the cesu8 declarations are not available. as a workaround, then can use another C file which prefers the cesu8 APIs (via mujs.h), and at that file js_foo is cesu8 and js_u_foo is utf8. i think i currently prefer the latter approach, even if it's not possible to have one c file which defaults to utf8 APIs but where the cesu8 APIs are also accessible.		10:36.11
	the problem with 2 though is that it's not unlikely scenario IMO that a client wants to use utf8 APIs by default, but use the cesu8 variant when it knows that no conversion is required (e.g. pushing most literals). so it can be awkward to use.		10:50.17
	i'd love to have a solution where the cesu8 apis keep their ABI name, and clients can choose to map these names to the utf8 variants, but keep the cesu8 APIs accessible, but i don't know if that's possible.		10:51.47
	well, i guess i could keep the original name, and add two actual functions js_c_foo and js_u foo, where js_c_foo is a straight wrapper to js_foo...		11:16.33
	so \|type js_foo(...) { actual implementation } type js_c_foo(...) { return js_foo(...) }; type js_u_foo(...) { utf8 wrapper to js_foo }		11:19.51
ator	the more I've thought about this, the less I like automatically changing the string contents. this conversion will have to be explained to folks who go "utf-what?" which will be the majority of users...		11:32.27
avih	interesting. but it's mujs which expects a non standard string format. users expect the api to be utf8, even if they don't know what utf8 is.		11:33.44
ator	I think, adding support for >16-bit characters in mujs strings (like duktype) and providing 'to-surrogate-pair' and 'from-surrogate-pair' variants of public functions		11:33.59
avih	they will go WTF even more when you tell them the strings should be CESU-8 or WTF-8		11:34.16
ator	will be the least surprising api, and still allow people who want their js scripts to work with surrogate pairs to do so		11:34.34
	but most people who just want to pass utf-8 strings back and forth between their scripting engine, won't care about the nuances		11:34.51
	they can just go on being oblivious -- the stuff they pass in will look like they expect from JS and come back the same way		11:35.20
avih	ator: will these functions work on strings or on codepoints?		11:35.29
	but it won't look like they expect in JS if they pass in a poop emoji		11:36.28
ator	js_pushstring_encode_surrogates and js_tostring_decode_surrogates		11:36.52
	depends on what they expect, which I think will vary widely from programmer to programmer :)		11:37.15
avih	ok, so basically you want variants only of pushstring (et al) and tostring ?		11:37.20
	what about property names, error messages, etc?		11:37.44
	ator: keep in mind that with my approach JS code has zero performance penalty or code changes, because mujs itself only uses the cesu8 api which is exactly what it was before. it's only external api which goes through automatic conversion, and only if required to convert		11:39.13
ator	js_encode_surrogates() and js_decode_surrogates() then		11:39.19
	I don't like having to use malloc for the conversion		11:40.26
avih	effectively all the mujs code remains identical as before, and users see the utf8 wrappers by default via mujs.h which maps their js_foo to js_u_foo		11:40.29
ator	and I also prefer not changing the API		11:41.22
avih	well, conversions would be rare.but if they're required, how can you avoid malloc when cesu8 strings are longer than utf8? the other way around it can be converted in-place though		11:41.30
	i'm not changing the api		11:41.39
ator	but yes, adding a set of wrapper functions for property names too (though I see little use case for it)		11:41.44
	and keeping the same 16-bit strings internally would be one possible way forward, like your proposal		11:42.05
avih	so that's my patch exactly. all the mujs code remains identical, and the _u_ APIs are wrappers which convert if required, but not used internally by mujs code		11:42.33
ator	I'm just not convinced it's worth the trouble. allowing poop emoji to be one character in JS and be oblivious could just be simpler and more in line with what people desire.		11:42.53
avih	effectively, all the changes at the patch i posted could be entirely client-side. they don't modify existing code, they just add wrappers which use the public API		11:43.36
ator	of course, we could do both. have _u_ (or _surrogate_) wrappers and allow >16-bit strings		11:43.41
avih	my patch does both (it adds conversion utils at the first commit)		11:44.26
ator	I don't see how you handle js_tostring in your patch		11:45.33
avih	(however, utf8 variants of tostring and next_iterator could not be client side)		11:45.55
	ator: yes, it's not yet at that patch. it's at a later code. here it is at my WIP: https://0x0.st/iq0P.txt		11:47.18
	(normalize_utf8 is renamed from write_utf8_inplace, which converts inplace if required)		11:48.11
	ator: i can move almost all the _u_ wrappers to utf8_wrapper.c to make more obvious that existing code does not change, but mujs.h still somehow has to support access to both, generated or not.		12:05.21
	and if we don't want to allow mapping js_foo to js_u_foo automatically, then much of the kludge goes away		12:07.34
	there would be just js_foo and js_u foo, pick your poison, but then code which used js_foo so far will need to either change all js_foo to js_u_foo, or define the mapping themselves		12:09.12
	and ensure to only do that if the lib supports the _u_ variants		12:09.53
	while my solution covers everything automatically		12:10.02
	existing code would magically become correct when recompiled and linked against a new lib, or remain as is when linked with older lib.		12:11.43
ator	maybe just throw out the baby and the bathtub, or how the idiom goes, and use 16-bit strings internally and always convert to/from in the public API.		12:13.02
	we're still missing 0 character codes		12:13.30
avih	ator: re 16 bits, i don't think so. i expect that in the vast vast vast majority of cases no conversion or allocation is required. re 2, correct.		12:14.47
ator	quickjs converts to surrogate pairs when creating strings		12:15.22
	and pairs proper surrogate pairs when passing to C, passing unpaired ones unchanged		12:16.28
avih	and this conversion test is very very fast. i still didn't benchmark currently, but previously i tested it in mpv and barely added 5% call overhead. and only for external clients at the API level		12:16.45
	running JS code has zero penalty		12:17.22
	ator: that's what my wrappers do too (re quickjs, both ways), except usually no actual conversion is required		12:18.53
ator	ah, there's a 'bool cesu8' argument to quickjs's ToCString function		12:19.54
avih	(and fixing js_dofile et al to consider source on disk as utf8 is 2 chars change at some inner function)		12:20.12
	just one js_pushstring becomes a js_u_pushstring		12:20.36
	ator: the quickjs api is highly painful in general. the client has to do all the cleanups, and it's a huge PITA		12:21.29
	ator: my test and conversion only check for valid things which can be converted back and forth losslessly. this means that unpaired surrogates and not considered convertible, and similarly encodings of codepoints above U+10FFFF in utf8 are also not considered convertible, and passed as is		12:24.31
ator	yeah, it doesn't look like it's optimized for ease of creating extensions/embedding like lua and tcl		12:24.54
avih	ator: do you like this better https://0x0.st/iqDy.txt ? it's the same code just arranged differently. mujs.h is not generated anymore (but with some new lines in it) and almost all the new code is at utf8wrap.c		15:20.39
	i think this is only missing next_iterator and do_file which should use utf8 from disk. there are potential optimizations with one less allocation at pushstring et al but this patch still works fine, and the optimization only matters if a conversion is performs, which should be very rare anyway.		15:22.30
	i think the code is reasonable and quick and well contained and the user can opt out if they want. the pain here is the declarations at mujs.h, only because it has so much duplications		15:31.36
	the generated mujs.h avoided that pain relatively simply, but, generating things can be a pain on its own.		15:32.20
	<<<Back 1 day (to 2020/02/24)	Forward 1 day (to 2020/02/26)>>>

Log of #mupdf at irc.freenode.net.