| <<<Back 1 day (to 2020/02/19) | Fwd 1 day (to 2020/02/21)>>> | 20200220 |
everytiing | hi. how can I fill form with mupdf? | 02:08.34 |
| I tried to read man page, website document, and googled. but nothing was found. :( | 02:09.52 |
sebras | everytiing: have you tried using mupdf-gl? | 04:18.21 |
| everytiing: mupdf-x11 does not really support form-filling I think. | 04:18.40 |
everytiing | @sebras you're right. it works. thanks a lot | 04:22.06 |
| It can't fill Thai language though. But that's alright. Okular also can't do it. | 04:24.34 |
sebras | everytiing: you're welcome. do you mind reporting a bug at bugs.ghostscript.com and attaching the PDF where you can't fill in thai language? | 04:25.48 |
| everytiing: we should make it possible to input text in any language. | 04:28.52 |
everytiing | sebras sure | 04:30.10 |
sebras | everytiing: thank you. | 04:30.40 |
avih | ator: can you reproduce the issue? | 08:56.16 |
| also, i'm quite convinced that the virtuality of strings can be extended to use an underlaying SMP wtf/utf-8 encoding, while sticking to the spec. in a nutshell, it's wtf8 - if it's unpaired surrogate then it's left as is, if it's paired then it's virtually on the fly conversion between utf8 and two surrogate runes | 09:32.21 |
| this needs runetochar and chartorune to hold a state, and then be aware of that at code which uses these functions (basically move the state around - an int) | 09:34.30 |
| well, the state is not held at these functions, it's held at its callers, but these functions would update the state | 09:35.05 |
| the is be 0/1 at chartorune of whether or not we're about to handle the 2nd surrogate of a valid pair, and at runetochar it's the value of the first surrogate rune when the the two form a valid codepoint | 09:38.03 |
| the state* is | 09:38.58 |
ator | avih: yes, missed a typedef that limits Rune to 16 bits :) | 10:49.56 |
avih | also, re correct underlaying wtf8, i think only chartorune needs a state. runetochar can remain 16 bits, but after things which touch strings, we would do a pass to convert cesu8 to utf8 inplace for valid pairs. | 10:53.56 |
ator | avih: where would you do this automatic conversion? js_pushstring or js_tostring or both? | 10:54.33 |
avih | (from my experience, this pass is very very quick, especially if no conversion is required) | 10:54.54 |
ator | there is *zero* guarantee that the strings from javascript are well-formed surrogate pairs | 10:54.57 |
| that sort of policy would be up to the user of the library, IMO | 10:55.09 |
| if you want to push as tring and convert UTF-8 to WTF-8 surrogate pairs, let's add a separate function for that | 10:55.29 |
| and vice versa | 10:55.31 |
avih | ator: none, chartorune will use a state which you init to 0 when starting to iterate a string. runetochar remains 16 bits. it's only the low level rune functions which are modified | 10:56.00 |
| ator: of course there's no guarantee. hence only valid pairs are converted. unpaired surrogate remain as is (wtf8) | 10:56.52 |
ator | right, but wtf8 also supports SMP characters that won't fit in 16-bits | 10:57.10 |
| you mean the chartorune would detect SMP and generate two runes as output? | 10:57.50 |
| i.e. do utf-8 to wtf-8 conversion | 10:58.04 |
avih | and chartorune will look at the next up to 6 bytes, if it's a valid pair, it will return 0, set the char to the 1st surrogate, and set the state, and the next call will return 4 and and set the char to the 2nd surrogate | 10:58.10 |
ator | or rather utf-8 to cesu-8 | 10:58.13 |
avih | yes | 10:58.53 |
| sorry, chartorune will look at 4 bytes, and return the virtual surrogates if it's valid | 10:59.32 |
ator | and runetochar, for strings created in javascript? | 11:00.16 |
avih | runetochar stays normal, but after a string is touched (concat, fromCharCode, etc), we do a pass to convert valid pairs to utf8 inplace (we do have enough space because utf8 is less than cesu8) | 11:00.32 |
ator | the client would js_pushstring utf-8 with SMP and js_tostring would return a WTF-8 with surrogate pairs | 11:00.49 |
avih | exactly | 11:00.57 |
| well, no, | 11:01.10 |
ator | no, that's not what you just said :) | 11:01.23 |
avih | it will return surrogate pairs only when they're not paired | 11:01.33 |
| it will return surrogates* only when they're not paired | 11:01.50 |
| i.e. the invariant is that any valid pairs of surrogates at the JS string is stored as 5 bytes utf8/wtf8 | 11:02.18 |
| 4* bytes | 11:02.35 |
| so chartorune is virtual with a state, runetochar is normal, but we do a pass to convert inplace to utf8 where possible after a string was touched | 11:03.41 |
| and this pass is very very quick. basically all strings without pairs fail strchr(s, 0xed) | 11:04.37 |
ator | let me mull it over. my gut instinct is to go "ick too complicated" :) | 11:04.42 |
| also, I should consider moving to a faster utf-8 implementation | 11:05.10 |
avih | i understand. but it's actually simple. the problem is that many places use chartorune, and they need to be modified to add the state | 11:05.24 |
ator | avih: https://bjoern.hoehrmann.de/utf-8/decoder/dfa/ | 11:05.38 |
avih | ator: i actually tested tables vs logic, tabes are faster when they're at the cpu cache, slower otherwise, so it depends how frequently it's used | 11:06.45 |
| (also, table of ints is slightly faster than table of chars, eventhough you only need 8 bits per element) | 11:07.44 |
| also, depending on the amount of logic, tables are not THAT much faster. e.g. if you want to test ((*s & 0xf8) == 0xf0) (first char of SMP in utf8), then this is only ~5-10% slower than is SMP0[*s] with such table of 256 elements | 11:09.47 |
| and that's while the table is cached. | 11:10.32 |
ator | it's never been so much slower that I've felt a pressing need to change, and I like the simple api of the plan9 utf stuff | 11:10.35 |
avih | sure | 11:11.35 |
| ator: also, the state which you pass to chartorune could be the inverse of the previous return value. so 0 initially, 0 for unpaired surrogate (return value is 3), 1 after 1st paired surrogate (because length for the 1st is 0) | 11:14.29 |
| i.e. no need for dedicated var. | 11:14.42 |
| AND it only requires conversion from valid pairs to utf8, which can always be done inplace. the other way around is virtual, so no memory management required | 11:19.33 |
| re do such conversion at js_pushstring et al (everything which takes a char * which becomes a JS string), not sure. i think it can stay as is without hurting anything. | 11:22.35 |
| chartorune will just see a BMP rune. that's fine, it won't be vitrual and work as is. but if we use that string as input to modifications (concat, etc), then it will be converted to utf8 if it has valid pairs | 11:24.43 |
| but it would be on the user, because surrogate pair is invalid utf8 and invalid wtf8 | 11:34.44 |
| nothing would blow, but string operations might not work as expected. | 11:35.19 |
| so basically conversion is required only after mujs itself concats or extracts substrings | 11:36.27 |
| ator: i think the only place which can be a bit painful to use a different chartorune is at the lexer. but 1. it's only run once (except eval etc) 2. we can convert the lexed source to cesu-8 (only if required) first, and use it with the current (i.e. non state) chartorune | 11:45.55 |
| actually, maybe not even that. J->source is simply moved forward while the lexer consumes the runes, so we would have J->last_rune_len which is the chartorune state, and only jsY_next needs to be modified to use it, and it should be initialized with the state | 11:51.49 |
ator | avih: sounds like you just want JS to see SMP characters as surrogate pairs and leave everything else alone. you can probably accomplish that with just changing a few bits in jsstring.c | 11:53.57 |
| the js_utfidxtoptr and reverse, and js_runeat | 11:54.08 |
avih | ator: not only that. i also want it to work correctly for JS code which constructs strings as surrogates | 11:54.43 |
| (as correct JS code should do for SMP) | 11:54.59 |
| and that when you concat s1="...<pair-1st>" with s1="<pair-2nd>..." then the result would invariantly be stored as underlaying utf8 | 11:56.05 |
| and i think this is all accomplished with these two changes: chartorune with state, normalization without memory management after mujs constructs a string from smaller elements | 11:57.51 |
| (substring _probably_ doesn't need special attention, but i didn't look at it yet) | 12:02.23 |
sebras | ator: I asked everytiing to report the lacking thai language support. I figured it would be better if we record what languages people has asked for at least. | 12:29.13 |
ator | sebras: yes. | 12:31.01 |
avih | ator: this (untested) is the change at chartorune (only difference from your master branch is after "goto bad"), and the required change at one consumer of it (pstr): https://0x0.st/iPjB.txt | 12:48.57 |
| it should be trivial to convert all consumers to use it. it doesn't include the normalization to utf8 after string operations yet, but it is also trivial. | 12:49.54 |
| i think it's fairly elegant, minimal, zero penalties in terms of correctness, negligible (i think) penalty at chartorune, and very minor penalty (not yet implemented) when strings are normalized | 13:00.53 |
bgermann | Hi! Just wanted to drop a note here about a license-specific thing. | 13:28.00 |
| https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=951705 | 13:28.05 |
| Maybe mupdf wants to consider adding an OpenSSL exception to its license so that it can be linked with OpenSSL in Debian | 13:28.59 |
avih | ator: this is also with the normalization function, and example usage of it at one place (toLowerCase): https://0x0.st/iP_v.txt | 13:52.57 |
| ator: this is a complete working patch which changes all chartorune invocations https://0x0.st/iPLw.txt (i slightly modified the interface). it's still missing the normalizations, notably at the lexer | 15:23.41 |
| so source with SMP parse correctly but converted to cesu8 because the lexer sees (virtual) cesu8 runes | 15:24.27 |
| also, we need to revive the 262 test suite. such changes could improve or regress conformance, and it would be nice to know that early. also worth having some performance tests to see how performance evolves over time and see if such patches regress performance meaningfully | 15:35.42 |
| ator: lex fixed too (2 LOC fix) https://0x0.st/iPL0.txt now source files with SMP work correctly. still missing normalization after string ops though | 15:52.16 |
| ator: at your master, the "avoid interning short names", i'd do s/8/12/ because array index is up to 9 decimal digits, 10 with '\0', so just make it 12 IMO | 16:06.33 |
| sorry, 10 digits, 11 with '\0'. covered by 12 | 16:07.02 |
| ator: the utf8 patch as git patch https://0x0.st/iP9e.txt | 17:17.24 |
| <<<Back 1 day (to 2020/02/19) | Forward 1 day (to 2020/02/21)>>> | |