Log of #mupdf at irc.freenode.net.

Search:
 <<<Back 1 day (to 2020/02/19)Fwd 1 day (to 2020/02/21)>>>20200220 
everytiing hi. how can I fill form with mupdf?02:08.34 
  I tried to read man page, website document, and googled. but nothing was found. :(02:09.52 
sebras everytiing: have you tried using mupdf-gl?04:18.21 
  everytiing: mupdf-x11 does not really support form-filling I think.04:18.40 
everytiing @sebras you're right. it works. thanks a lot04:22.06 
  It can't fill Thai language though. But that's alright. Okular also can't do it.04:24.34 
sebras everytiing: you're welcome. do you mind reporting a bug at bugs.ghostscript.com and attaching the PDF where you can't fill in thai language?04:25.48 
  everytiing: we should make it possible to input text in any language.04:28.52 
everytiing sebras sure04:30.10 
sebras everytiing: thank you.04:30.40 
avih ator: can you reproduce the issue?08:56.16 
  also, i'm quite convinced that the virtuality of strings can be extended to use an underlaying SMP wtf/utf-8 encoding, while sticking to the spec. in a nutshell, it's wtf8 - if it's unpaired surrogate then it's left as is, if it's paired then it's virtually on the fly conversion between utf8 and two surrogate runes09:32.21 
  this needs runetochar and chartorune to hold a state, and then be aware of that at code which uses these functions (basically move the state around - an int)09:34.30 
  well, the state is not held at these functions, it's held at its callers, but these functions would update the state09:35.05 
  the is be 0/1 at chartorune of whether or not we're about to handle the 2nd surrogate of a valid pair, and at runetochar it's the value of the first surrogate rune when the the two form a valid codepoint09:38.03 
  the state* is09:38.58 
ator avih: yes, missed a typedef that limits Rune to 16 bits :)10:49.56 
avih also, re correct underlaying wtf8, i think only chartorune needs a state. runetochar can remain 16 bits, but after things which touch strings, we would do a pass to convert cesu8 to utf8 inplace for valid pairs.10:53.56 
ator avih: where would you do this automatic conversion? js_pushstring or js_tostring or both?10:54.33 
avih (from my experience, this pass is very very quick, especially if no conversion is required)10:54.54 
ator there is *zero* guarantee that the strings from javascript are well-formed surrogate pairs10:54.57 
  that sort of policy would be up to the user of the library, IMO10:55.09 
  if you want to push as tring and convert UTF-8 to WTF-8 surrogate pairs, let's add a separate function for that10:55.29 
  and vice versa10:55.31 
avih ator: none, chartorune will use a state which you init to 0 when starting to iterate a string. runetochar remains 16 bits. it's only the low level rune functions which are modified10:56.00 
  ator: of course there's no guarantee. hence only valid pairs are converted. unpaired surrogate remain as is (wtf8)10:56.52 
ator right, but wtf8 also supports SMP characters that won't fit in 16-bits10:57.10 
  you mean the chartorune would detect SMP and generate two runes as output?10:57.50 
  i.e. do utf-8 to wtf-8 conversion10:58.04 
avih and chartorune will look at the next up to 6 bytes, if it's a valid pair, it will return 0, set the char to the 1st surrogate, and set the state, and the next call will return 4 and and set the char to the 2nd surrogate10:58.10 
ator or rather utf-8 to cesu-810:58.13 
avih yes10:58.53 
  sorry, chartorune will look at 4 bytes, and return the virtual surrogates if it's valid10:59.32 
ator and runetochar, for strings created in javascript?11:00.16 
avih runetochar stays normal, but after a string is touched (concat, fromCharCode, etc), we do a pass to convert valid pairs to utf8 inplace (we do have enough space because utf8 is less than cesu8)11:00.32 
ator the client would js_pushstring utf-8 with SMP and js_tostring would return a WTF-8 with surrogate pairs11:00.49 
avih exactly11:00.57 
  well, no,11:01.10 
ator no, that's not what you just said :)11:01.23 
avih it will return surrogate pairs only when they're not paired11:01.33 
  it will return surrogates* only when they're not paired11:01.50 
  i.e. the invariant is that any valid pairs of surrogates at the JS string is stored as 5 bytes utf8/wtf811:02.18 
  4* bytes11:02.35 
  so chartorune is virtual with a state, runetochar is normal, but we do a pass to convert inplace to utf8 where possible after a string was touched11:03.41 
  and this pass is very very quick. basically all strings without pairs fail strchr(s, 0xed)11:04.37 
ator let me mull it over. my gut instinct is to go "ick too complicated" :)11:04.42 
  also, I should consider moving to a faster utf-8 implementation11:05.10 
avih i understand. but it's actually simple. the problem is that many places use chartorune, and they need to be modified to add the state11:05.24 
ator avih: https://bjoern.hoehrmann.de/utf-8/decoder/dfa/11:05.38 
avih ator: i actually tested tables vs logic, tabes are faster when they're at the cpu cache, slower otherwise, so it depends how frequently it's used11:06.45 
  (also, table of ints is slightly faster than table of chars, eventhough you only need 8 bits per element)11:07.44 
  also, depending on the amount of logic, tables are not THAT much faster. e.g. if you want to test ((*s & 0xf8) == 0xf0) (first char of SMP in utf8), then this is only ~5-10% slower than is SMP0[*s] with such table of 256 elements11:09.47 
  and that's while the table is cached.11:10.32 
ator it's never been so much slower that I've felt a pressing need to change, and I like the simple api of the plan9 utf stuff11:10.35 
avih sure11:11.35 
  ator: also, the state which you pass to chartorune could be the inverse of the previous return value. so 0 initially, 0 for unpaired surrogate (return value is 3), 1 after 1st paired surrogate (because length for the 1st is 0)11:14.29 
  i.e. no need for dedicated var.11:14.42 
  AND it only requires conversion from valid pairs to utf8, which can always be done inplace. the other way around is virtual, so no memory management required11:19.33 
  re do such conversion at js_pushstring et al (everything which takes a char * which becomes a JS string), not sure. i think it can stay as is without hurting anything.11:22.35 
  chartorune will just see a BMP rune. that's fine, it won't be vitrual and work as is. but if we use that string as input to modifications (concat, etc), then it will be converted to utf8 if it has valid pairs11:24.43 
  but it would be on the user, because surrogate pair is invalid utf8 and invalid wtf811:34.44 
  nothing would blow, but string operations might not work as expected.11:35.19 
  so basically conversion is required only after mujs itself concats or extracts substrings11:36.27 
  ator: i think the only place which can be a bit painful to use a different chartorune is at the lexer. but 1. it's only run once (except eval etc) 2. we can convert the lexed source to cesu-8 (only if required) first, and use it with the current (i.e. non state) chartorune11:45.55 
  actually, maybe not even that. J->source is simply moved forward while the lexer consumes the runes, so we would have J->last_rune_len which is the chartorune state, and only jsY_next needs to be modified to use it, and it should be initialized with the state11:51.49 
ator avih: sounds like you just want JS to see SMP characters as surrogate pairs and leave everything else alone. you can probably accomplish that with just changing a few bits in jsstring.c11:53.57 
  the js_utfidxtoptr and reverse, and js_runeat11:54.08 
avih ator: not only that. i also want it to work correctly for JS code which constructs strings as surrogates11:54.43 
  (as correct JS code should do for SMP)11:54.59 
  and that when you concat s1="...<pair-1st>" with s1="<pair-2nd>..." then the result would invariantly be stored as underlaying utf811:56.05 
  and i think this is all accomplished with these two changes: chartorune with state, normalization without memory management after mujs constructs a string from smaller elements11:57.51 
  (substring _probably_ doesn't need special attention, but i didn't look at it yet)12:02.23 
sebras ator: I asked everytiing to report the lacking thai language support. I figured it would be better if we record what languages people has asked for at least.12:29.13 
ator sebras: yes.12:31.01 
avih ator: this (untested) is the change at chartorune (only difference from your master branch is after "goto bad"), and the required change at one consumer of it (pstr): https://0x0.st/iPjB.txt12:48.57 
  it should be trivial to convert all consumers to use it. it doesn't include the normalization to utf8 after string operations yet, but it is also trivial.12:49.54 
  i think it's fairly elegant, minimal, zero penalties in terms of correctness, negligible (i think) penalty at chartorune, and very minor penalty (not yet implemented) when strings are normalized13:00.53 
bgermann Hi! Just wanted to drop a note here about a license-specific thing.13:28.00 
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=95170513:28.05 
  Maybe mupdf wants to consider adding an OpenSSL exception to its license so that it can be linked with OpenSSL in Debian13:28.59 
avih ator: this is also with the normalization function, and example usage of it at one place (toLowerCase): https://0x0.st/iP_v.txt13:52.57 
  ator: this is a complete working patch which changes all chartorune invocations https://0x0.st/iPLw.txt (i slightly modified the interface). it's still missing the normalizations, notably at the lexer15:23.41 
  so source with SMP parse correctly but converted to cesu8 because the lexer sees (virtual) cesu8 runes15:24.27 
  also, we need to revive the 262 test suite. such changes could improve or regress conformance, and it would be nice to know that early. also worth having some performance tests to see how performance evolves over time and see if such patches regress performance meaningfully15:35.42 
  ator: lex fixed too (2 LOC fix) https://0x0.st/iPL0.txt now source files with SMP work correctly. still missing normalization after string ops though15:52.16 
  ator: at your master, the "avoid interning short names", i'd do s/8/12/ because array index is up to 9 decimal digits, 10 with '\0', so just make it 12 IMO16:06.33 
  sorry, 10 digits, 11 with '\0'. covered by 1216:07.02 
  ator: the utf8 patch as git patch https://0x0.st/iP9e.txt17:17.24 
 <<<Back 1 day (to 2020/02/19)Forward 1 day (to 2020/02/21)>>> 
ghostscript.com #ghostscript
Search: