Log of #mupdf at irc.freenode.net.

Search:
 <<<Back 1 day (to 2020/02/18)Fwd 1 day (to 2020/02/20)>>>20200219 
avih ator: so any thoughts on what/if you'll do with CESU-8 API and/or js_loadfile? fwiw, the more i think about it, the more i'm think that internal representation as utf8 would be the best solution. i think it's the smallest leap from current code. strings are already variable length encoding which requires scanning them during random access. making a utf8 codepoint above BMP be a virtual two surrogates on access/write is relatively small leap, compared to testing08:28.03 
  for conversion on all APIs which use strings.08:28.03 
  (poop emojis et al are indeed rare, but less rare than they were 5 years ago)08:31.30 
  (and providing conversion utils, while better than nothing, still puts an additional meaningful burden on users which want their programs to handle supplementary codepoints correctly)08:34.26 
  also, i _think_ it should be relatively simple to make mujs allow 0 in strings. it will require few (definitely not all) places to consider len directly rather than strlen, and will require an additional api js_tolstring which will output the length as well, but otherwise all apis remain only aware of 0-terminated.08:57.34 
ator avih: I think just using WTF-8 may be the best way. i.e. extend the utf.c to cover 4-byte sequences, and leave data as-is.09:00.52 
  avih: yeah, having embedded 0 will need a length counter in strings (could force use of the JS_TMEMSTR subtype for any strings with embedded 0)09:01.52 
avih ator: so that "<one-codepoint-above-BMP>".length would be 1 ?09:02.11 
ator but should be fairly easily doable09:02.15 
  avih: yes, if encoded as utf-8. or 2 if encoded as a surrogate pair.09:02.43 
  surrogate pairs are forbidden in strict utf-8, required in cesu-8, and wtf-8 doesn't care.09:03.04 
avih what if js code does String.fromCharCode(0x10000) ?09:03.23 
ator node.js returns '\u0000' from that :)09:04.20 
  the spec says fromCharCode turns the number into a UInt16 first.09:04.59 
avih i think the main point is keeping consistency between strings into mujs, strings out of mujs, and manipulating strings inside mujs09:05.27 
ator yes, and I think wtf-8 will be the least surprising (and still allow emojis and other non-BMP text)09:05.52 
avih if you allow s[N] to hold a codepoint rather than 16 bits, then this solves all the issues09:05.55 
  (at the expense of not complying with the spec)09:06.23 
ator I'm fine with that, TBH09:06.53 
  compliant code and input will still work as expected09:07.14 
avih atso make Rune 32 bits, allow fromCharCode(0x1000), and that charCodeAt(N) can return 0x1000 ?09:09.00 
  ator: well, compliant code which creates >BMP codepoint as surrogate will be read incorrectly at the C API09:10.18 
ator not really, since the C api documentation will say WTF-8 :)09:11.10 
avih i think the most consistent option is utf8 or wtf8 internal representation, and making codepoints above MBP a virtual surrogate. it does need consideration of how to handle ill-formed content, but it's possible09:11.52 
  and this will not break the ES5 spec, and also allow consistency of in/out/within09:12.39 
  basically, it will need a mapping of internal representation to virtual js string, and the other way around, both for arbitrary content09:13.41 
ator javascript itself doesn't enforce properly encoded surrogate pairs though09:14.25 
avih right09:14.34 
ator so WTF-8 and no checking on the C side is probably for the best09:14.34 
avih that's why it needs to work with arbitrary content, content being js string and internal representation09:14.55 
ator that allows passing data in and out of strings properly (if also fixing the embedded 0)09:15.01 
avih ator: i still didn't read the whole wtf8 spec. do you know if it can hold arbitrary binary content?09:16.45 
  well, hold and preserve09:16.53 
  (preserve as long as it's not slice and diced, i assume)09:17.33 
ator "This specification defines WTF-8, a superset of UTF-8 that can losslessly represent arbitrary sequences of 16-bit code unit (even if ill-formed in UTF-16) but preserves the other well-formedness constraints of UTF-8." is the key sentence09:18.36 
avih yes, but also "WTF-8 (Wobbly Transformation Format − 8-bit) is an encoding of code point sequences".09:19.17 
  so if you try to encode the c-string { 0x80, 0x80, 0x80, 0x80, 0x80, 0x80 } then it's not a representation of a codepoint09:19.56 
  it's possible to convert such string into a sequence of 0x80 codepoints, however that ends up encoded09:20.29 
  but then it would also need conversion on the way out.09:21.28 
ator that c-string would end up being "\uFFFD\uFFFD\uFFFD...09:21.46 
avih maybe conversions should only happen for js_pushlstring and js_tolstring. these are almost by definition binary APIs and not string ones. they just use the String class in js to hold the value09:22.20 
ator if you wanted "\x80\x80\x80\x80\x80\x80" then you have to encode the code points using WTF-8 before passing to mujs09:22.58 
avih exactly, wtf8 does not preserve arbitrary binary content, but any arbitrary content can be converted into a wtf8 which can be preserved and converted back into the original binary content09:24.34 
ator yep.09:25.00 
  which is how you used to handle binary data in javascript before the TypedArray extensions, stick arbitrary data into its UTF-16 strings09:25.38 
  which don't actually require well formed UTF-1609:25.49 
avih ator: also, both wtf8 and cesu8 should not be used for communication, storage, etc. i don't think i've seen anything mentions library API though09:26.19 
ator because it's all an afterthought, much like unicode handling in Java and Windows and everyone else who jumped on the unicode bandwagon early09:26.22 
avih aye09:27.17 
  we know how we got here. but how do we handle it now? :)09:27.33 
ator "oops, you mean you need MORE than 16 bit characters? we're so fucked!"09:27.53 
avih lol09:28.02 
  so true09:28.04 
ator 640Kb is more than anyone will ever need!09:28.35 
avih it's enough for mujs, most of teh time :)09:28.51 
  2038 will never arrive09:30.31 
ator btw, node.js doesn't support >16bit characters in its strings09:31.21 
  "we won't be alive when 2038 arrives so who cares?" seems like a more apt description09:31.52 
avih that's ok. it's by spec. but i assume it does convert to/from utf8 when doing io, at the very least if it consists of well formed content09:32.15 
ator duktape does what we propose to do here09:33.21 
avih well, today most people can't hope to not be alive in 2038...09:33.22 
  which of the proposals?09:33.51 
ator no, but the folks back in the 70's who made the initial assumption09:34.06 
avih yup09:34.13 
ator allowing >16bit chars in strings, using relaxed superset of utf-809:34.37 
avih interesting09:34.54 
ator https://wiki.duktape.org/howtononbmpcharacters09:35.45 
avih it _is_ the simplest approach, but you have to break the spec for that, or else you cannot slice and dice such strings in js code09:35.56 
  and it will break for "correct" js code which constructs surrogate pairs09:36.24 
  but i also think it's the simplest solution09:36.42 
  i.e. acceptable until a better solution comes up09:36.59 
ator if you want JS code to handle strings, make sure to only push 16-bit code points in your C code09:37.06 
  we can add a helper js_pushcesu8 which converts any >BMP characters into surrogates09:37.24 
avih yeah, that would be trivial. the problem is the other way around09:37.59 
ator js_tostring? well, we could add a js_toutf8 that combines surrogate pairs but needs a place to store the memory09:38.39 
avih also, this would invalidate most APIs which take char*09:38.48 
ator there I don't follow you09:39.07 
avih i.e. can't use these apis anymore with utf8 content. you have to push as cesu8 first, then run js code which uses it09:39.33 
  or maybe do some jumping through hoops like js_pushcesu8(utf8); cesu8 = js_tostring(J, -1); /* now you can't pop it */ js_setproperty(J, cesu8)09:41.33 
ator js_pushstring_as_cesu() and js_tostring_from_cesu()09:41.39 
  like your wrappers that you posted a while ago09:41.51 
avih push/to string is not enough. you have property names too09:42.17 
ator oh, you mean the property names09:42.20 
avih you have error objects, etc09:42.31 
ator yeah. and then you have malloc issues.09:42.55 
  or the API becomes more complicated09:43.08 
avih yes.09:43.18 
  just look how many APIs use const char * at mujs.h09:43.58 
ator but why would it matter for property names -- either you have literal strings or it's data that you use opaquely09:44.18 
avih i think not, but let me think about it a bit more. i'm pretty sure in mpv context you want to move property names back and forth between mpv and mujs, where in mpv they're utf809:45.35 
  granted, you're not likely to want to slice and dice such names in js code though.09:46.10 
ator and also, would it matter in mpv that the .length of a utf-8 stirng is actual characters rather than code points?09:46.13 
  surrogate pairs are madness, and being able to avoid them must be a good thing all things considered09:46.48 
avih currently mpv does not access the .length property of a string, because there's no api which outputs it as js_tostring09:46.59 
  (and i assumed, though now i think i could have been wrong) that data is clipped at '\0' inside mujs09:47.39 
  but iirc i looked at js_pushlstring recently and it doesn't crop it iirc09:48.05 
ator the intent is to support them, but there are bits of the code that still terminate at 009:48.57 
avih ator: how about this: rune can be >16bits, but fromCharCode doesn't allow >16bits. if this can happen while substrings etc don't corrupt the data, it could work09:50.00 
  so basically, the limitation is that js code which constructs or assumes surrogate pairs will be work incorrectly at the C-API user, but everything else will work transparently09:52.03 
ator js_pushlstring is currently used to push strings without a 0-terminator09:52.05 
  but eventually should be able to be used to push strings with embedded 0 too09:52.16 
avih yes, i'm looking at jsV_newmemstring and it doesn't care about termination. it just copies n bytes09:54.18 
ator we're halfway there. need to add a length counter to memstrings, and fix the lexer to return memstrings and not always interned strings.09:54.51 
avih the only missing thing is js_tolstring. though it could probably be worked around by reading the .length property and doing some work with it. maybe.09:55.06 
ator and then look at the length counter in some jsstring.c functions09:55.14 
  js_tolstring will be added09:55.27 
avih ator: i still think that extending the virtuality of String to handle CP >= U+10000 at the underlaying utf8 data is the best solution. it doesn't need the spec to be broken, it can handle utf8 input and output, surrogate pairs expectation works for js code, etc. it can be a bit delicate in handling of ill-formed underlaying data, but that's about it09:59.53 
  String is already virtual anyway. you don't have real random access, so just extend it a bit.10:00.51 
ator ill-formed underlying data is what prevents us from doing automatic surrogate pair splitting/combining10:02.26 
  because ill-formed data is very much allowed by the spec10:02.41 
avih (also delicate in handling ill-formed utf8 code points from the js-string side)10:02.45 
  utf16 *10:02.53 
ator javascript strings are only utf16 in name -- in reality they're just arrays of unsigned short10:03.24 
avih yes, if ill-formed data was disallowed, there would be no issue. but i don't think it's not solvable10:03.33 
ator mujs shouldn't (IMO) enforce a policy.10:04.03 
avih yes, as i said ill-formed utf8, and js string data which would be considered ill-formed if treated as utf1610:04.20 
ator the current code only allows 16-bit values in its strings, and shows them as utf8-ish at the C api level10:04.55 
avih so these cases should definitely not be ignored. which is exactly the delicate part. but it's well defined and well contained in scope10:05.03 
ator I'm proposing to allow 21-bit values in strings, and let the user beware10:05.19 
  so naive passing of non-BMP UTF8 will go unnoticed10:05.45 
  javascript code itself can't create strings with >16bit characters (fromCharCode etc prevent it)10:06.00 
avih and the tradeoff is broken spec, yes?10:06.04 
  because charCodeAt (and maybe also fromCharCode) can be >= 0x1000 ?10:06.22 
ator tradeoff is we support more than the spec allows10:06.23 
avih 0x10000 *10:06.52 
ator given input data that conforms (up to the C api user to check when calling js_pushstring if he/she cares)10:06.56 
  then nothing should be a problem10:07.07 
  yes, charCodeAt can be >0x1000 in my proposal10:07.27 
  (and I wouldn't be too averse to allowing fromCharCode to allow >0x1000 either)10:07.50 
avih well, it's definitely the easiest solution, code wise10:07.53 
  (not suggesting in a bad way)10:08.09 
ator if you just want to pass emoji strings back and forth, this is the way. only legacy APIs use utf-16 and surrogate pairs these days.10:08.28 
pink_mist shouldn't it be 0x10000 not 0x1000?10:08.40 
ator > 0xffff10:08.59 
  so yes, >= 0x1000010:09.09 
avih pink_mist: yes, but 0x1000 is a lot easyer to ttoe, and we both understand it means 0x10000 :)10:09.10 
  easier*10:09.18 
pink_mist right :P10:09.21 
avih type* (grrr)10:09.23 
  i suggest ABMP (above BMP)10:11.29 
  ator: will you work on it at some branch? i still want to think about extending the virtuality to include underlaying ABMP10:13.12 
  (i.e. without breaking the spec)10:13.38 
ator avih: see top commit on tor/master10:15.53 
  that's all that's needed now10:16.05 
  supporting embedded 0 is going to be a lot trickier10:16.11 
  because of all the builtin functions that also need to support them, that currently just use C strings10:16.36 
  SMP, SIP, and SSP10:17.57 
avih (looking)10:19.21 
  ator: what's "SMP, SIP, and SSP" ?10:19.49 
ator supplementary multilingual plane, supplementary ideographic plane, supplementary special-case plane10:20.51 
avih aka ABMP ?10:21.04 
ator yeh10:21.12 
  just SMP will do when communicating10:21.24 
avih right10:21.32 
ator 'supplementary characters' are those outside the BMP in unicode lingo10:21.41 
avih aye. SMP is fine10:21.53 
  ator: at the new docs, what kind of attention/actions are required for "but requires attention when passing strings using supplementary unicode characters to and rom the MuJS library"10:22.53 
  it's important that client know what they should do to handle it as best as they can10:23.45 
  maybe "this maintains compatibility with valid UTF-8 at the C-API, at the expense that JS code will not see SMP as surrogate pairs. similarly, JS code which construct surrogate pairs will not be converted to UTF-8 at the C-API"10:26.07 
sebras ator: cgdae noticed in Makefile we have source/pdf/cmaps/%.h: resources/cmaps/% scripts/cmapdump.py since git controls in what order it checks out files can we really be certain about the dates and whether that rule is run?10:32.21 
ator avih: I still need to write that up properly.10:41.55 
  avih: that's a good summary.10:42.15 
  sebras: we check in the results, only if you forgot to commit the changes should that matter, no?10:47.30 
  but forgetting to commit the autogenerated files would be twice as bad in this scenario, yes!10:48.37 
sebras ator: we do, but make is triggered by timestamps.10:48.43 
  ator: since git doesn't keep timestamps then nothing prevents the rule from being triggered if cmapdump.py is younger than the checked in .h-files.10:49.21 
ator in a clean checkout, the files "should" be in sync and not need generating10:49.25 
  right, you mean we generate the file needlessly from a clean checkout?10:49.54 
sebras ator: exactly, but if the timestamp of cmapdump.py happens to be later than those if the headers then the rule would be run, no?10:49.56 
ator harmless I think :)10:50.00 
sebras ator: not if you don't have python installed on your system.10:50.10 
ator but maybe not if you don't have python3 installed10:50.10 
sebras in the oss-fuzz environment there is no python binary, but there is a python3 one.10:50.42 
ator we run python3, or should10:50.54 
  I believe I converted all the scripts to be python310:51.02 
  (or python2 *and* 3 compatible)10:51.17 
  the #! says python310:51.24 
  and the makefile calls python310:51.53 
sebras oh! but we were running an earlier commit!10:52.09 
  where it just said python in the Makefile.10:52.21 
ator in that case you're hosed :D10:52.23 
sebras yes, but should we even needlessly run cmapdump.py is a clean checkout?10:52.45 
ator do you know of a way to prevent it?10:53.26 
avih ator: thx. so the docs would need to specify these IMO: 1. what's the best way to use strings at the API (convert to/from cesu8?) 2. what are the limitations of this best way (no middle '\0', non valid CESU8 will not be seen correctly at JS code? cnnot use substrings? etc), 3. what are the implications of using utf8 at the C api10:54.11 
ator (other than forcibly removing the dependencies and require a separate step to regenerate them)10:54.12 
  ator: see tor/master for a slightly reworded documentation11:05.07 
sebras ator: since we have the headers checked in, why can't we just: generate-cmaps: resources/cmaps/% scripts/cmapdump.py11:05.48 
  ator: or something along those lines.11:05.56 
ator sebras: because you can forget to regenerate them when you change them.11:06.07 
  sebras: the same is true of the javascript and font dumps too, btw.11:06.52 
sebras ator: you mean when we update them with a new version from upstream?11:06.56 
ator which are updated more often than the cmaps11:07.10 
sebras true.11:07.14 
  well, then we need python3.11:07.27 
ator also, generate-cmaps: wouldn't work because of tracking dependencies11:07.39 
  the c files that include the generated header files depend on them, etc.11:07.55 
  so for your proposal, we'd have to disconnect the dependencies11:08.34 
  and that would mean having to 'make clean' everytime you change a file that is autogenerated11:08.51 
  (like we used to, and that led to many build skew errors)11:09.03 
  so yes, we need python3!11:09.19 
avih ator: also, i think the docs can recommend (maybe with provided utils) to handle arbitrary binary blobs as BCD or base64 strings. this guarantees perfect handling of arbitrary data.11:15.54 
  while the plain C API assume utf8 or cesu8 or whatever you specify11:16.29 
sebras ator: ok.11:18.53 
avih or just plain hex where "\xDE\xED" is stored as the normal string "DEAD" in JS11:20.11 
  "\xDE\xAD" -> "DEAD"11:20.33 
ator avih: if you want to process raw data, there's the TypedArray extension that I've been working on12:33.47 
avih yeah, i saw you pushed it earlier (currently i can live without it)12:48.22 
  ator: in general though, we currently need mujs 1.0 for mpv. i don't think i want to require a minimum mujs version later than 1-below the latest12:51.39 
  so if there's new which mpv will use, it will have to be there at last some while12:52.30 
  new API *12:52.36 
  ator: btw, what did/do you think of not interning small property names (which covers array indices if set to e.g. 12) like so https://0x0.st/isYD.txt14:30.19 
  ator: in your current master (with SMP handling), something is not right. i have such emoji at the source file, the s[N] enumerate "correctly" (one char is SMP), but it seems that something breaks at js_tostring. i _think_ i'm not getting a correct utf8 sequence14:45.40 
ator avih: okay... can you provide an example?15:39.53 
avih ator: not yet. but i think print("<SMP>") at a source file on disk via mujs would reproduce it. but didn't try yet.18:24.39 
  ator: yeah, reproducible with |mujs <this-file>| :18:31.15 
  var s = "Hello 😀!\n";18:31.16 
  for (var i = 0; i < s.length; i++)18:31.16 
  print(s.charCodeAt(i));18:31.16 
  print(s);18:31.16 
  so for that emoji it prints 62976 for charCodeAt, but the output doesn't show correctly (and it does it i cat <this-file>)18:32.11 
  if i cat *18:32.26 
  i.e. the output of print(s) is incorrect.18:34.34 
 <<<Back 1 day (to 2020/02/18)Forward 1 day (to 2020/02/20)>>> 
ghostscript.com #ghostscript
Search: