MuPDF IRC logs

	<<<Back 1 day (to 2020/02/19)	Fwd 1 day (to 2020/02/21)>>>	20200220
everytiing	hi. how can I fill form with mupdf?		02:08.34
	I tried to read man page, website document, and googled. but nothing was found. :(		02:09.52
sebras	everytiing: have you tried using mupdf-gl?		04:18.21
	everytiing: mupdf-x11 does not really support form-filling I think.		04:18.40
everytiing	@sebras you're right. it works. thanks a lot		04:22.06
	It can't fill Thai language though. But that's alright. Okular also can't do it.		04:24.34
sebras	everytiing: you're welcome. do you mind reporting a bug at bugs.ghostscript.com and attaching the PDF where you can't fill in thai language?		04:25.48
	everytiing: we should make it possible to input text in any language.		04:28.52
everytiing	sebras sure		04:30.10
sebras	everytiing: thank you.		04:30.40
avih	ator: can you reproduce the issue?		08:56.16
	also, i'm quite convinced that the virtuality of strings can be extended to use an underlaying SMP wtf/utf-8 encoding, while sticking to the spec. in a nutshell, it's wtf8 - if it's unpaired surrogate then it's left as is, if it's paired then it's virtually on the fly conversion between utf8 and two surrogate runes		09:32.21
	this needs runetochar and chartorune to hold a state, and then be aware of that at code which uses these functions (basically move the state around - an int)		09:34.30
	well, the state is not held at these functions, it's held at its callers, but these functions would update the state		09:35.05
	the is be 0/1 at chartorune of whether or not we're about to handle the 2nd surrogate of a valid pair, and at runetochar it's the value of the first surrogate rune when the the two form a valid codepoint		09:38.03
	the state* is		09:38.58
ator	avih: yes, missed a typedef that limits Rune to 16 bits :)		10:49.56
avih	also, re correct underlaying wtf8, i think only chartorune needs a state. runetochar can remain 16 bits, but after things which touch strings, we would do a pass to convert cesu8 to utf8 inplace for valid pairs.		10:53.56
ator	avih: where would you do this automatic conversion? js_pushstring or js_tostring or both?		10:54.33
avih	(from my experience, this pass is very very quick, especially if no conversion is required)		10:54.54
ator	there is zero guarantee that the strings from javascript are well-formed surrogate pairs		10:54.57
	that sort of policy would be up to the user of the library, IMO		10:55.09
	if you want to push as tring and convert UTF-8 to WTF-8 surrogate pairs, let's add a separate function for that		10:55.29
	and vice versa		10:55.31
avih	ator: none, chartorune will use a state which you init to 0 when starting to iterate a string. runetochar remains 16 bits. it's only the low level rune functions which are modified		10:56.00
	ator: of course there's no guarantee. hence only valid pairs are converted. unpaired surrogate remain as is (wtf8)		10:56.52
ator	right, but wtf8 also supports SMP characters that won't fit in 16-bits		10:57.10
	you mean the chartorune would detect SMP and generate two runes as output?		10:57.50
	i.e. do utf-8 to wtf-8 conversion		10:58.04
avih	and chartorune will look at the next up to 6 bytes, if it's a valid pair, it will return 0, set the char to the 1st surrogate, and set the state, and the next call will return 4 and and set the char to the 2nd surrogate		10:58.10
ator	or rather utf-8 to cesu-8		10:58.13
avih	yes		10:58.53
	sorry, chartorune will look at 4 bytes, and return the virtual surrogates if it's valid		10:59.32
ator	and runetochar, for strings created in javascript?		11:00.16
avih	runetochar stays normal, but after a string is touched (concat, fromCharCode, etc), we do a pass to convert valid pairs to utf8 inplace (we do have enough space because utf8 is less than cesu8)		11:00.32
ator	the client would js_pushstring utf-8 with SMP and js_tostring would return a WTF-8 with surrogate pairs		11:00.49
avih	exactly		11:00.57
	well, no,		11:01.10
ator	no, that's not what you just said :)		11:01.23
avih	it will return surrogate pairs only when they're not paired		11:01.33
	it will return surrogates* only when they're not paired		11:01.50
	i.e. the invariant is that any valid pairs of surrogates at the JS string is stored as 5 bytes utf8/wtf8		11:02.18
	4* bytes		11:02.35
	so chartorune is virtual with a state, runetochar is normal, but we do a pass to convert inplace to utf8 where possible after a string was touched		11:03.41
	and this pass is very very quick. basically all strings without pairs fail strchr(s, 0xed)		11:04.37
ator	let me mull it over. my gut instinct is to go "ick too complicated" :)		11:04.42
	also, I should consider moving to a faster utf-8 implementation		11:05.10
avih	i understand. but it's actually simple. the problem is that many places use chartorune, and they need to be modified to add the state		11:05.24
ator	avih: https://bjoern.hoehrmann.de/utf-8/decoder/dfa/		11:05.38
avih	ator: i actually tested tables vs logic, tabes are faster when they're at the cpu cache, slower otherwise, so it depends how frequently it's used		11:06.45
	(also, table of ints is slightly faster than table of chars, eventhough you only need 8 bits per element)		11:07.44
	also, depending on the amount of logic, tables are not THAT much faster. e.g. if you want to test ((s & 0xf8) == 0xf0) (first char of SMP in utf8), then this is only ~5-10% slower than is SMP0[s] with such table of 256 elements		11:09.47
	and that's while the table is cached.		11:10.32
ator	it's never been so much slower that I've felt a pressing need to change, and I like the simple api of the plan9 utf stuff		11:10.35
avih	sure		11:11.35
	ator: also, the state which you pass to chartorune could be the inverse of the previous return value. so 0 initially, 0 for unpaired surrogate (return value is 3), 1 after 1st paired surrogate (because length for the 1st is 0)		11:14.29
	i.e. no need for dedicated var.		11:14.42
	AND it only requires conversion from valid pairs to utf8, which can always be done inplace. the other way around is virtual, so no memory management required		11:19.33
	re do such conversion at js_pushstring et al (everything which takes a char * which becomes a JS string), not sure. i think it can stay as is without hurting anything.		11:22.35
	chartorune will just see a BMP rune. that's fine, it won't be vitrual and work as is. but if we use that string as input to modifications (concat, etc), then it will be converted to utf8 if it has valid pairs		11:24.43
	but it would be on the user, because surrogate pair is invalid utf8 and invalid wtf8		11:34.44
	nothing would blow, but string operations might not work as expected.		11:35.19
	so basically conversion is required only after mujs itself concats or extracts substrings		11:36.27
	ator: i think the only place which can be a bit painful to use a different chartorune is at the lexer. but 1. it's only run once (except eval etc) 2. we can convert the lexed source to cesu-8 (only if required) first, and use it with the current (i.e. non state) chartorune		11:45.55
	actually, maybe not even that. J->source is simply moved forward while the lexer consumes the runes, so we would have J->last_rune_len which is the chartorune state, and only jsY_next needs to be modified to use it, and it should be initialized with the state		11:51.49
ator	avih: sounds like you just want JS to see SMP characters as surrogate pairs and leave everything else alone. you can probably accomplish that with just changing a few bits in jsstring.c		11:53.57
	the js_utfidxtoptr and reverse, and js_runeat		11:54.08
avih	ator: not only that. i also want it to work correctly for JS code which constructs strings as surrogates		11:54.43
	(as correct JS code should do for SMP)		11:54.59
	and that when you concat s1="...<pair-1st>" with s1="<pair-2nd>..." then the result would invariantly be stored as underlaying utf8		11:56.05
	and i think this is all accomplished with these two changes: chartorune with state, normalization without memory management after mujs constructs a string from smaller elements		11:57.51
	(substring _probably_ doesn't need special attention, but i didn't look at it yet)		12:02.23
sebras	ator: I asked everytiing to report the lacking thai language support. I figured it would be better if we record what languages people has asked for at least.		12:29.13
ator	sebras: yes.		12:31.01
avih	ator: this (untested) is the change at chartorune (only difference from your master branch is after "goto bad"), and the required change at one consumer of it (pstr): https://0x0.st/iPjB.txt		12:48.57
	it should be trivial to convert all consumers to use it. it doesn't include the normalization to utf8 after string operations yet, but it is also trivial.		12:49.54
	i think it's fairly elegant, minimal, zero penalties in terms of correctness, negligible (i think) penalty at chartorune, and very minor penalty (not yet implemented) when strings are normalized		13:00.53
bgermann	Hi! Just wanted to drop a note here about a license-specific thing.		13:28.00
	https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=951705		13:28.05
	Maybe mupdf wants to consider adding an OpenSSL exception to its license so that it can be linked with OpenSSL in Debian		13:28.59
avih	ator: this is also with the normalization function, and example usage of it at one place (toLowerCase): https://0x0.st/iP_v.txt		13:52.57
	ator: this is a complete working patch which changes all chartorune invocations https://0x0.st/iPLw.txt (i slightly modified the interface). it's still missing the normalizations, notably at the lexer		15:23.41
	so source with SMP parse correctly but converted to cesu8 because the lexer sees (virtual) cesu8 runes		15:24.27
	also, we need to revive the 262 test suite. such changes could improve or regress conformance, and it would be nice to know that early. also worth having some performance tests to see how performance evolves over time and see if such patches regress performance meaningfully		15:35.42
	ator: lex fixed too (2 LOC fix) https://0x0.st/iPL0.txt now source files with SMP work correctly. still missing normalization after string ops though		15:52.16
	ator: at your master, the "avoid interning short names", i'd do s/8/12/ because array index is up to 9 decimal digits, 10 with '\0', so just make it 12 IMO		16:06.33
	sorry, 10 digits, 11 with '\0'. covered by 12		16:07.02
	ator: the utf8 patch as git patch https://0x0.st/iP9e.txt		17:17.24
	<<<Back 1 day (to 2020/02/19)	Forward 1 day (to 2020/02/21)>>>

Log of #mupdf at irc.freenode.net.