| <<<Back 1 day (to 2020/05/13) | Fwd 1 day (to 2020/05/15)>>> | 20200514 |
myopia | is artifex planning on muhtml after mujs? all current browser and html engines suffer from memory leaks from those I have tried | 10:51.11 |
| a clean-room C implementation of a leak-proof html engine to be coupled with mujs would be nice | 10:52.01 |
| a candidate to be squared away for a truly scriptable minimalistic browser | 10:52.36 |
| scriptable *and* minimalistic | 10:52.54 |
| probably also coupled with security proofs from a haskell code model | 10:53.39 |
ator | myopia: no. a minimalist browser that is usable on todays web would be a universe-imploding type paradox. | 11:49.46 |
| sebras: a couple of mujs commits on tor/master that could do with a review if you got time | 11:50.53 |
| avih: that goes for you too. I'm thinking of finally pushing the WTF-8 mess... | 11:51.52 |
avih | ator: summary? also what is the "WTF8 mess"? | 11:52.20 |
ator | allow >16bit characters in JS strings that would pass from C to C untouched, without automatic SMP conversion | 11:52.55 |
avih | (i've been using this successfully https://github.com/avih/mpv/commit/81306684 ) | 11:52.57 |
| well, i know what the problem is quite exactly. i mean, what's your solution? | 11:53.27 |
ator | basically the same hack as duktype | 11:53.27 |
| my solution is to do nothing and leave it up to the user :) | 11:53.47 |
avih | ah, so js char is a codepoint? | 11:53.47 |
| (that's the implication of char > 16 bit, right?) | 11:54.25 |
ator | with the commits I'm proposing, you won't be able to create a >16bit char from Javascript, nor by loading it from source | 11:55.08 |
avih | but it will preserve them if they came from the C api? | 11:55.35 |
| what about slice etc? | 11:55.40 |
ator | but if you pass in a string with >16bit characters via js_pushstring(), they will pass through unmolested back to js_tostring() | 11:56.08 |
avih | what happens with CP > BMP in source files/code? | 11:56.15 |
ator | SMP characters in UTF-8 source code will be converted to surrogate pairs | 11:56.37 |
avih | not fun | 11:57.02 |
| the C client will still need to handle surrogates then, in addition to UTF8 as non-surrogates | 11:57.28 |
| ator: i don't know why you keep messing with half solutions. make it proper. let the C api be UTF8 and handle everything which results inside mujs, IMO | 11:58.55 |
| be graceful with invalid codepoints, but for valid ones just make them work. remove this burden from the user. | 11:59.33 |
ator | fine. then I'll be strict and only allow 16-bit characters as originally intended by ECMAScript. | 12:00.07 |
avih | my solution is external and complete. internally it will be a bit simpler, but there's no getting away from converting every C api string if required. | 12:00.37 |
| you can create another version of all string APIs which will be used internally only, so it doesn't get tested for conversion, but user-facing APIs should just be UTF8. | 12:02.39 |
ator | I'm honestly more tempted to say "screw compatibility" and allow >16bit code points in the JS code | 12:03.21 |
avih | i think this would be better than half-baked solutions. not as good as both UTF8 API and 16 bits chars, but better than half-solutions IMO | 12:06.10 |
ator | the main problem with that is representing strings as ascii. there's no escape syntax for >16bit unicode characters. | 12:06.56 |
| \x is 1-byte, and \u is 2-byte only | 12:07.09 |
avih | i don't follow example? | 12:07.13 |
| hmm | 12:07.16 |
| isn't there \U ? | 12:07.24 |
ator | (of course, I could just not escape them) | 12:07.26 |
avih | so this is only for JSON? | 12:07.39 |
ator | JSON and console.log()/print() | 12:08.20 |
avih | iirc JSON allows both surrogate and \U<any-codepoint>. I don't know if the js JSON spec requires differently | 12:08.42 |
| I was wrong. JSON allows either surrogates or unescaped utf8 | 12:09.54 |
| https://en.wikipedia.org/wiki/JSON#Data_portability_issues | 12:10.19 |
ator | I'm leaning towards not escaping SMP characters | 12:10.39 |
avih | 15.12.1.1 says "SourceCharacter but not one of " or \ or U+0000 through U+001F" | 12:13.32 |
| i think it could be interpreted that that unescaped SMP is allowed? | 12:13.49 |
| actually, any source utf8 except 0-31 seems to be allowed | 12:14.34 |
ator | well, the JSON.stringify algorithm actually only says to escape 0-31, not >127 | 12:25.29 |
| avih: okay, how about what's on tor/master now then? | 12:35.42 |
avih | ator: i assume only HEAD~1 and HEAD~2 ? | 12:37.26 |
ator | yes. feel free to look at the others too, but those are the ones I think you're most interested in. | 12:37.54 |
avih | ator: so the first commit only converts 0 in js_pushlstring into whatever else you encode it as? | 12:40.36 |
ator | no. it does not affect pushlstring. it affects how "foo\0bar" is parsed | 12:41.20 |
avih | in source code only? | 12:41.44 |
ator | yes. | 12:41.48 |
avih | hmm.. | 12:41.52 |
| well, i think it should be applied in js_pushlstring as well | 12:42.10 |
| (not sure yet how to retrieve it. there's no js_tolstring...) | 12:42.37 |
ator | in other words, we represent embedded 0 in strings as <c080>. same as Java does in the .class bytecode and elsewhere. | 12:42.42 |
| js_pushlstring is to push a non-zero-terminated string | 12:43.12 |
avih | exactly, because it may include embedded 0 which you don't want to terminate it | 12:43.39 |
ator | making it a public function was a mistake | 12:43.45 |
| no, because you're pushing a slice of another string as a new string | 12:43.56 |
| like when the regexp code finds a match | 12:44.07 |
avih | that may have been your intention, but generally non-0-terminated also means a binary blob | 12:44.26 |
ator | pushing strings with an embedded zero was never the intention :) | 12:44.31 |
avih | anyway, i think this is off topic. source files rarely have 0 in them, so only for source files i don't think you should push it. | 12:45.20 |
| (i.e. just terminate the parsing on 0, or skip it, or whatever, it's not worth all this code IMO) | 12:45.59 |
| (fwiw, i don't disagree that pushlstring should not handle embedded 0. if you want blob, use whatever which supports arbitrary sequences - which is not *tf8. it could be some array buffer or some encoding etc) | 12:47.44 |
| "+If you have Javascript code that expects to work with UTF-16 surrogate pairs, | 12:49.24 |
| +you will need to manually convert any extended characters to surrogate pairs | 12:49.24 |
| +and back when passing strings between C and Javascript." | 12:49.24 |
| i think that it you already broke it, at least make it complete - allow >16bit also from js, and provide tehse functions which convert to/from surrogate pairs | 12:49.57 |
| (i can write them for you once you allow >16bit from inside js) | 12:50.31 |
| however, my solution is still complete, does not break the standard, does not have perf implications for running js code, and has effectively negligible perf impact for C APIs. | 12:52.31 |
| you keep trying to find half solutions, even with this. | 12:53.28 |
| (i understand why.. but you can also choose to solve it correctly) | 12:54.34 |
ator | nobody wants the "correct" solution because it's a piece of crap and the world has moved on to UTF-8 and is not stuck in 1993 when windows NT came out with its 2-byte "unicode" support... | 12:57.23 |
| nobody in their right mind wants to mess with surrogate pairs. | 12:57.35 |
avih | i agree, but it's still the correct js way | 12:58.03 |
ator | your patch should still work unaffected by what is on tor/master | 12:58.06 |
avih | i know. i was hoping to not need it. | 12:58.29 |
ator | but now, if someone wants to use mujs with utf-8, unrestricted by having to convert to surrogate pairs, they can. if they just don't know or don't care, their code will work. | 12:58.58 |
avih | regardless, if you allow SMP chars from inside js and provide the conversion function, i think it will be good enough | 12:59.00 |
ator | you talk as if you haven't actually looked at the code :) | 12:59.20 |
avih | i haven't | 12:59.29 |
| iirc you said it won't work from inside js | 12:59.41 |
ator | here I am thinking you object to something in the code, when you're actually objecting to something that doesn't exist >.< | 13:02.47 |
avih | i wasn't objecting to anything. i said that if you support SMP chars in js, then you should also support it from inside js if it's not supported already. | 13:03.47 |
| and the docs should provide the js conversion to/from surrogates, and i can write them for you once using SMP from inside js is supported. | 13:05.16 |
| i.e. str[1]=0x10005 should work | 13:05.54 |
| or rather FromCharCode(0x10005) should work | 13:07.02 |
ator | the current patch supports them inside and outside of js. String.fromCharCode and String.prototype.charCodeAt() are not limited to 16-bits (as per the commit message) | 13:09.25 |
avih | great. then it should be good enough. i can write the to/from surrogate pairs conversion if you want to add them to the docs. i expect ~5-10 LOC each | 13:11.17 |
| (probably more like 5) | 13:13.11 |
| i still think supporting embedded 0 in source files is a waste of code. but up to you. | 13:13.47 |
ator | avih: again you haven't read the source. I don't support embedded 0 in the source code *file* ... I'm supporting it in strings, like this: var myZeroString = "Hello\x00World!"; | 13:14.46 |
| or \u0000. not a literal 0-value byte in the source text. | 13:15.47 |
avih | yeah, i've looked at it but didn't follow it, which is why i asked what it does, and from your replies i thought it only applies to source files. i probably understood wrong. | 13:16.00 |
| anyway. i don't have stong opinion about the embedded 0 and i think the other patch to support SMP chars is good enough according to what you said. | 13:17.10 |
ator | fair enough. it's hard to convey the meaning in english :) | 13:17.26 |
| language is too ambiguous. | 13:17.33 |
avih | sorry, it was probably me not reading carefully enough. shit happens :) | 13:18.50 |
| <<<Back 1 day (to 2020/05/13) | Forward 1 day (to 2020/05/15)>>> | |