| <<<Back 1 day (to 2020/02/18) | Fwd 1 day (to 2020/02/20)>>> | 20200219 |
avih | ator: so any thoughts on what/if you'll do with CESU-8 API and/or js_loadfile? fwiw, the more i think about it, the more i'm think that internal representation as utf8 would be the best solution. i think it's the smallest leap from current code. strings are already variable length encoding which requires scanning them during random access. making a utf8 codepoint above BMP be a virtual two surrogates on access/write is relatively small leap, compared to testing | 08:28.03 |
| for conversion on all APIs which use strings. | 08:28.03 |
| (poop emojis et al are indeed rare, but less rare than they were 5 years ago) | 08:31.30 |
| (and providing conversion utils, while better than nothing, still puts an additional meaningful burden on users which want their programs to handle supplementary codepoints correctly) | 08:34.26 |
| also, i _think_ it should be relatively simple to make mujs allow 0 in strings. it will require few (definitely not all) places to consider len directly rather than strlen, and will require an additional api js_tolstring which will output the length as well, but otherwise all apis remain only aware of 0-terminated. | 08:57.34 |
ator | avih: I think just using WTF-8 may be the best way. i.e. extend the utf.c to cover 4-byte sequences, and leave data as-is. | 09:00.52 |
| avih: yeah, having embedded 0 will need a length counter in strings (could force use of the JS_TMEMSTR subtype for any strings with embedded 0) | 09:01.52 |
avih | ator: so that "<one-codepoint-above-BMP>".length would be 1 ? | 09:02.11 |
ator | but should be fairly easily doable | 09:02.15 |
| avih: yes, if encoded as utf-8. or 2 if encoded as a surrogate pair. | 09:02.43 |
| surrogate pairs are forbidden in strict utf-8, required in cesu-8, and wtf-8 doesn't care. | 09:03.04 |
avih | what if js code does String.fromCharCode(0x10000) ? | 09:03.23 |
ator | node.js returns '\u0000' from that :) | 09:04.20 |
| the spec says fromCharCode turns the number into a UInt16 first. | 09:04.59 |
avih | i think the main point is keeping consistency between strings into mujs, strings out of mujs, and manipulating strings inside mujs | 09:05.27 |
ator | yes, and I think wtf-8 will be the least surprising (and still allow emojis and other non-BMP text) | 09:05.52 |
avih | if you allow s[N] to hold a codepoint rather than 16 bits, then this solves all the issues | 09:05.55 |
| (at the expense of not complying with the spec) | 09:06.23 |
ator | I'm fine with that, TBH | 09:06.53 |
| compliant code and input will still work as expected | 09:07.14 |
avih | atso make Rune 32 bits, allow fromCharCode(0x1000), and that charCodeAt(N) can return 0x1000 ? | 09:09.00 |
| ator: well, compliant code which creates >BMP codepoint as surrogate will be read incorrectly at the C API | 09:10.18 |
ator | not really, since the C api documentation will say WTF-8 :) | 09:11.10 |
avih | i think the most consistent option is utf8 or wtf8 internal representation, and making codepoints above MBP a virtual surrogate. it does need consideration of how to handle ill-formed content, but it's possible | 09:11.52 |
| and this will not break the ES5 spec, and also allow consistency of in/out/within | 09:12.39 |
| basically, it will need a mapping of internal representation to virtual js string, and the other way around, both for arbitrary content | 09:13.41 |
ator | javascript itself doesn't enforce properly encoded surrogate pairs though | 09:14.25 |
avih | right | 09:14.34 |
ator | so WTF-8 and no checking on the C side is probably for the best | 09:14.34 |
avih | that's why it needs to work with arbitrary content, content being js string and internal representation | 09:14.55 |
ator | that allows passing data in and out of strings properly (if also fixing the embedded 0) | 09:15.01 |
avih | ator: i still didn't read the whole wtf8 spec. do you know if it can hold arbitrary binary content? | 09:16.45 |
| well, hold and preserve | 09:16.53 |
| (preserve as long as it's not slice and diced, i assume) | 09:17.33 |
ator | "This specification defines WTF-8, a superset of UTF-8 that can losslessly represent arbitrary sequences of 16-bit code unit (even if ill-formed in UTF-16) but preserves the other well-formedness constraints of UTF-8." is the key sentence | 09:18.36 |
avih | yes, but also "WTF-8 (Wobbly Transformation Format − 8-bit) is an encoding of code point sequences". | 09:19.17 |
| so if you try to encode the c-string { 0x80, 0x80, 0x80, 0x80, 0x80, 0x80 } then it's not a representation of a codepoint | 09:19.56 |
| it's possible to convert such string into a sequence of 0x80 codepoints, however that ends up encoded | 09:20.29 |
| but then it would also need conversion on the way out. | 09:21.28 |
ator | that c-string would end up being "\uFFFD\uFFFD\uFFFD... | 09:21.46 |
avih | maybe conversions should only happen for js_pushlstring and js_tolstring. these are almost by definition binary APIs and not string ones. they just use the String class in js to hold the value | 09:22.20 |
ator | if you wanted "\x80\x80\x80\x80\x80\x80" then you have to encode the code points using WTF-8 before passing to mujs | 09:22.58 |
avih | exactly, wtf8 does not preserve arbitrary binary content, but any arbitrary content can be converted into a wtf8 which can be preserved and converted back into the original binary content | 09:24.34 |
ator | yep. | 09:25.00 |
| which is how you used to handle binary data in javascript before the TypedArray extensions, stick arbitrary data into its UTF-16 strings | 09:25.38 |
| which don't actually require well formed UTF-16 | 09:25.49 |
avih | ator: also, both wtf8 and cesu8 should not be used for communication, storage, etc. i don't think i've seen anything mentions library API though | 09:26.19 |
ator | because it's all an afterthought, much like unicode handling in Java and Windows and everyone else who jumped on the unicode bandwagon early | 09:26.22 |
avih | aye | 09:27.17 |
| we know how we got here. but how do we handle it now? :) | 09:27.33 |
ator | "oops, you mean you need MORE than 16 bit characters? we're so fucked!" | 09:27.53 |
avih | lol | 09:28.02 |
| so true | 09:28.04 |
ator | 640Kb is more than anyone will ever need! | 09:28.35 |
avih | it's enough for mujs, most of teh time :) | 09:28.51 |
| 2038 will never arrive | 09:30.31 |
ator | btw, node.js doesn't support >16bit characters in its strings | 09:31.21 |
| "we won't be alive when 2038 arrives so who cares?" seems like a more apt description | 09:31.52 |
avih | that's ok. it's by spec. but i assume it does convert to/from utf8 when doing io, at the very least if it consists of well formed content | 09:32.15 |
ator | duktape does what we propose to do here | 09:33.21 |
avih | well, today most people can't hope to not be alive in 2038... | 09:33.22 |
| which of the proposals? | 09:33.51 |
ator | no, but the folks back in the 70's who made the initial assumption | 09:34.06 |
avih | yup | 09:34.13 |
ator | allowing >16bit chars in strings, using relaxed superset of utf-8 | 09:34.37 |
avih | interesting | 09:34.54 |
ator | https://wiki.duktape.org/howtononbmpcharacters | 09:35.45 |
avih | it _is_ the simplest approach, but you have to break the spec for that, or else you cannot slice and dice such strings in js code | 09:35.56 |
| and it will break for "correct" js code which constructs surrogate pairs | 09:36.24 |
| but i also think it's the simplest solution | 09:36.42 |
| i.e. acceptable until a better solution comes up | 09:36.59 |
ator | if you want JS code to handle strings, make sure to only push 16-bit code points in your C code | 09:37.06 |
| we can add a helper js_pushcesu8 which converts any >BMP characters into surrogates | 09:37.24 |
avih | yeah, that would be trivial. the problem is the other way around | 09:37.59 |
ator | js_tostring? well, we could add a js_toutf8 that combines surrogate pairs but needs a place to store the memory | 09:38.39 |
avih | also, this would invalidate most APIs which take char* | 09:38.48 |
ator | there I don't follow you | 09:39.07 |
avih | i.e. can't use these apis anymore with utf8 content. you have to push as cesu8 first, then run js code which uses it | 09:39.33 |
| or maybe do some jumping through hoops like js_pushcesu8(utf8); cesu8 = js_tostring(J, -1); /* now you can't pop it */ js_setproperty(J, cesu8) | 09:41.33 |
ator | js_pushstring_as_cesu() and js_tostring_from_cesu() | 09:41.39 |
| like your wrappers that you posted a while ago | 09:41.51 |
avih | push/to string is not enough. you have property names too | 09:42.17 |
ator | oh, you mean the property names | 09:42.20 |
avih | you have error objects, etc | 09:42.31 |
ator | yeah. and then you have malloc issues. | 09:42.55 |
| or the API becomes more complicated | 09:43.08 |
avih | yes. | 09:43.18 |
| just look how many APIs use const char * at mujs.h | 09:43.58 |
ator | but why would it matter for property names -- either you have literal strings or it's data that you use opaquely | 09:44.18 |
avih | i think not, but let me think about it a bit more. i'm pretty sure in mpv context you want to move property names back and forth between mpv and mujs, where in mpv they're utf8 | 09:45.35 |
| granted, you're not likely to want to slice and dice such names in js code though. | 09:46.10 |
ator | and also, would it matter in mpv that the .length of a utf-8 stirng is actual characters rather than code points? | 09:46.13 |
| surrogate pairs are madness, and being able to avoid them must be a good thing all things considered | 09:46.48 |
avih | currently mpv does not access the .length property of a string, because there's no api which outputs it as js_tostring | 09:46.59 |
| (and i assumed, though now i think i could have been wrong) that data is clipped at '\0' inside mujs | 09:47.39 |
| but iirc i looked at js_pushlstring recently and it doesn't crop it iirc | 09:48.05 |
ator | the intent is to support them, but there are bits of the code that still terminate at 0 | 09:48.57 |
avih | ator: how about this: rune can be >16bits, but fromCharCode doesn't allow >16bits. if this can happen while substrings etc don't corrupt the data, it could work | 09:50.00 |
| so basically, the limitation is that js code which constructs or assumes surrogate pairs will be work incorrectly at the C-API user, but everything else will work transparently | 09:52.03 |
ator | js_pushlstring is currently used to push strings without a 0-terminator | 09:52.05 |
| but eventually should be able to be used to push strings with embedded 0 too | 09:52.16 |
avih | yes, i'm looking at jsV_newmemstring and it doesn't care about termination. it just copies n bytes | 09:54.18 |
ator | we're halfway there. need to add a length counter to memstrings, and fix the lexer to return memstrings and not always interned strings. | 09:54.51 |
avih | the only missing thing is js_tolstring. though it could probably be worked around by reading the .length property and doing some work with it. maybe. | 09:55.06 |
ator | and then look at the length counter in some jsstring.c functions | 09:55.14 |
| js_tolstring will be added | 09:55.27 |
avih | ator: i still think that extending the virtuality of String to handle CP >= U+10000 at the underlaying utf8 data is the best solution. it doesn't need the spec to be broken, it can handle utf8 input and output, surrogate pairs expectation works for js code, etc. it can be a bit delicate in handling of ill-formed underlaying data, but that's about it | 09:59.53 |
| String is already virtual anyway. you don't have real random access, so just extend it a bit. | 10:00.51 |
ator | ill-formed underlying data is what prevents us from doing automatic surrogate pair splitting/combining | 10:02.26 |
| because ill-formed data is very much allowed by the spec | 10:02.41 |
avih | (also delicate in handling ill-formed utf8 code points from the js-string side) | 10:02.45 |
| utf16 * | 10:02.53 |
ator | javascript strings are only utf16 in name -- in reality they're just arrays of unsigned short | 10:03.24 |
avih | yes, if ill-formed data was disallowed, there would be no issue. but i don't think it's not solvable | 10:03.33 |
ator | mujs shouldn't (IMO) enforce a policy. | 10:04.03 |
avih | yes, as i said ill-formed utf8, and js string data which would be considered ill-formed if treated as utf16 | 10:04.20 |
ator | the current code only allows 16-bit values in its strings, and shows them as utf8-ish at the C api level | 10:04.55 |
avih | so these cases should definitely not be ignored. which is exactly the delicate part. but it's well defined and well contained in scope | 10:05.03 |
ator | I'm proposing to allow 21-bit values in strings, and let the user beware | 10:05.19 |
| so naive passing of non-BMP UTF8 will go unnoticed | 10:05.45 |
| javascript code itself can't create strings with >16bit characters (fromCharCode etc prevent it) | 10:06.00 |
avih | and the tradeoff is broken spec, yes? | 10:06.04 |
| because charCodeAt (and maybe also fromCharCode) can be >= 0x1000 ? | 10:06.22 |
ator | tradeoff is we support more than the spec allows | 10:06.23 |
avih | 0x10000 * | 10:06.52 |
ator | given input data that conforms (up to the C api user to check when calling js_pushstring if he/she cares) | 10:06.56 |
| then nothing should be a problem | 10:07.07 |
| yes, charCodeAt can be >0x1000 in my proposal | 10:07.27 |
| (and I wouldn't be too averse to allowing fromCharCode to allow >0x1000 either) | 10:07.50 |
avih | well, it's definitely the easiest solution, code wise | 10:07.53 |
| (not suggesting in a bad way) | 10:08.09 |
ator | if you just want to pass emoji strings back and forth, this is the way. only legacy APIs use utf-16 and surrogate pairs these days. | 10:08.28 |
pink_mist | shouldn't it be 0x10000 not 0x1000? | 10:08.40 |
ator | > 0xffff | 10:08.59 |
| so yes, >= 0x10000 | 10:09.09 |
avih | pink_mist: yes, but 0x1000 is a lot easyer to ttoe, and we both understand it means 0x10000 :) | 10:09.10 |
| easier* | 10:09.18 |
pink_mist | right :P | 10:09.21 |
avih | type* (grrr) | 10:09.23 |
| i suggest ABMP (above BMP) | 10:11.29 |
| ator: will you work on it at some branch? i still want to think about extending the virtuality to include underlaying ABMP | 10:13.12 |
| (i.e. without breaking the spec) | 10:13.38 |
ator | avih: see top commit on tor/master | 10:15.53 |
| that's all that's needed now | 10:16.05 |
| supporting embedded 0 is going to be a lot trickier | 10:16.11 |
| because of all the builtin functions that also need to support them, that currently just use C strings | 10:16.36 |
| SMP, SIP, and SSP | 10:17.57 |
avih | (looking) | 10:19.21 |
| ator: what's "SMP, SIP, and SSP" ? | 10:19.49 |
ator | supplementary multilingual plane, supplementary ideographic plane, supplementary special-case plane | 10:20.51 |
avih | aka ABMP ? | 10:21.04 |
ator | yeh | 10:21.12 |
| just SMP will do when communicating | 10:21.24 |
avih | right | 10:21.32 |
ator | 'supplementary characters' are those outside the BMP in unicode lingo | 10:21.41 |
avih | aye. SMP is fine | 10:21.53 |
| ator: at the new docs, what kind of attention/actions are required for "but requires attention when passing strings using supplementary unicode characters to and rom the MuJS library" | 10:22.53 |
| it's important that client know what they should do to handle it as best as they can | 10:23.45 |
| maybe "this maintains compatibility with valid UTF-8 at the C-API, at the expense that JS code will not see SMP as surrogate pairs. similarly, JS code which construct surrogate pairs will not be converted to UTF-8 at the C-API" | 10:26.07 |
sebras | ator: cgdae noticed in Makefile we have source/pdf/cmaps/%.h: resources/cmaps/% scripts/cmapdump.py since git controls in what order it checks out files can we really be certain about the dates and whether that rule is run? | 10:32.21 |
ator | avih: I still need to write that up properly. | 10:41.55 |
| avih: that's a good summary. | 10:42.15 |
| sebras: we check in the results, only if you forgot to commit the changes should that matter, no? | 10:47.30 |
| but forgetting to commit the autogenerated files would be twice as bad in this scenario, yes! | 10:48.37 |
sebras | ator: we do, but make is triggered by timestamps. | 10:48.43 |
| ator: since git doesn't keep timestamps then nothing prevents the rule from being triggered if cmapdump.py is younger than the checked in .h-files. | 10:49.21 |
ator | in a clean checkout, the files "should" be in sync and not need generating | 10:49.25 |
| right, you mean we generate the file needlessly from a clean checkout? | 10:49.54 |
sebras | ator: exactly, but if the timestamp of cmapdump.py happens to be later than those if the headers then the rule would be run, no? | 10:49.56 |
ator | harmless I think :) | 10:50.00 |
sebras | ator: not if you don't have python installed on your system. | 10:50.10 |
ator | but maybe not if you don't have python3 installed | 10:50.10 |
sebras | in the oss-fuzz environment there is no python binary, but there is a python3 one. | 10:50.42 |
ator | we run python3, or should | 10:50.54 |
| I believe I converted all the scripts to be python3 | 10:51.02 |
| (or python2 *and* 3 compatible) | 10:51.17 |
| the #! says python3 | 10:51.24 |
| and the makefile calls python3 | 10:51.53 |
sebras | oh! but we were running an earlier commit! | 10:52.09 |
| where it just said python in the Makefile. | 10:52.21 |
ator | in that case you're hosed :D | 10:52.23 |
sebras | yes, but should we even needlessly run cmapdump.py is a clean checkout? | 10:52.45 |
ator | do you know of a way to prevent it? | 10:53.26 |
avih | ator: thx. so the docs would need to specify these IMO: 1. what's the best way to use strings at the API (convert to/from cesu8?) 2. what are the limitations of this best way (no middle '\0', non valid CESU8 will not be seen correctly at JS code? cnnot use substrings? etc), 3. what are the implications of using utf8 at the C api | 10:54.11 |
ator | (other than forcibly removing the dependencies and require a separate step to regenerate them) | 10:54.12 |
| ator: see tor/master for a slightly reworded documentation | 11:05.07 |
sebras | ator: since we have the headers checked in, why can't we just: generate-cmaps: resources/cmaps/% scripts/cmapdump.py | 11:05.48 |
| ator: or something along those lines. | 11:05.56 |
ator | sebras: because you can forget to regenerate them when you change them. | 11:06.07 |
| sebras: the same is true of the javascript and font dumps too, btw. | 11:06.52 |
sebras | ator: you mean when we update them with a new version from upstream? | 11:06.56 |
ator | which are updated more often than the cmaps | 11:07.10 |
sebras | true. | 11:07.14 |
| well, then we need python3. | 11:07.27 |
ator | also, generate-cmaps: wouldn't work because of tracking dependencies | 11:07.39 |
| the c files that include the generated header files depend on them, etc. | 11:07.55 |
| so for your proposal, we'd have to disconnect the dependencies | 11:08.34 |
| and that would mean having to 'make clean' everytime you change a file that is autogenerated | 11:08.51 |
| (like we used to, and that led to many build skew errors) | 11:09.03 |
| so yes, we need python3! | 11:09.19 |
avih | ator: also, i think the docs can recommend (maybe with provided utils) to handle arbitrary binary blobs as BCD or base64 strings. this guarantees perfect handling of arbitrary data. | 11:15.54 |
| while the plain C API assume utf8 or cesu8 or whatever you specify | 11:16.29 |
sebras | ator: ok. | 11:18.53 |
avih | or just plain hex where "\xDE\xED" is stored as the normal string "DEAD" in JS | 11:20.11 |
| "\xDE\xAD" -> "DEAD" | 11:20.33 |
ator | avih: if you want to process raw data, there's the TypedArray extension that I've been working on | 12:33.47 |
avih | yeah, i saw you pushed it earlier (currently i can live without it) | 12:48.22 |
| ator: in general though, we currently need mujs 1.0 for mpv. i don't think i want to require a minimum mujs version later than 1-below the latest | 12:51.39 |
| so if there's new which mpv will use, it will have to be there at last some while | 12:52.30 |
| new API * | 12:52.36 |
| ator: btw, what did/do you think of not interning small property names (which covers array indices if set to e.g. 12) like so https://0x0.st/isYD.txt | 14:30.19 |
| ator: in your current master (with SMP handling), something is not right. i have such emoji at the source file, the s[N] enumerate "correctly" (one char is SMP), but it seems that something breaks at js_tostring. i _think_ i'm not getting a correct utf8 sequence | 14:45.40 |
ator | avih: okay... can you provide an example? | 15:39.53 |
avih | ator: not yet. but i think print("<SMP>") at a source file on disk via mujs would reproduce it. but didn't try yet. | 18:24.39 |
| ator: yeah, reproducible with |mujs <this-file>| : | 18:31.15 |
| var s = "Hello 😀!\n"; | 18:31.16 |
| for (var i = 0; i < s.length; i++) | 18:31.16 |
| print(s.charCodeAt(i)); | 18:31.16 |
| print(s); | 18:31.16 |
| so for that emoji it prints 62976 for charCodeAt, but the output doesn't show correctly (and it does it i cat <this-file>) | 18:32.11 |
| if i cat * | 18:32.26 |
| i.e. the output of print(s) is incorrect. | 18:34.34 |
| <<<Back 1 day (to 2020/02/18) | Forward 1 day (to 2020/02/20)>>> | |