MuPDF IRC logs

	<<<Back 1 day (to 2020/02/20)	Fwd 1 day (to 2020/02/22)>>>	20200221
avih	ator: i added few more normalizations, but i don't think this approach can work as is. there are several things which operate on c-strings (notably substrings, search, and regex) which cannot work correctly always with wtf8 storage. if this approach is kept then these things need to either modified to work on runes and not c-ctrings, or (probably easier, simpler, and faster to develop and in runtime) convert all their input to cesu8 (for most strings this is a		09:22.13
	no-op), do their work, and normalize the results		09:22.13
	this does add some memory management, but relatively limited in scope, i.e. not all over the place		09:23.33
	regex does sort of work on runes, but its results are lstrings within the input, so this doesn't work for virtual surrogate runes		09:30.28
	basically, js_utfidxtoptr and js_utfptrtoidx cannot be used with wtf8 storage		09:32.56
	(except if it the pointer is not between virtual surrogates, but we can't guarantee it unless we ensure the string is cesu8)		09:34.08
quite	mupdf is not great att finding strings in "many" pdfs..		14:21.50
kens	Possibly the problem is the PDF files		14:26.32
sebras	quite: PDFs do not _have_ to encode the text in a way to make it searchable by PDF viewers.		14:26.58
	quite: but if you find a PDF where mupdf fails to search but other viewers (acrobat reader in particular) is able to do text search, then feel free to file a bug.		14:27.33
quite	even the most simple one's i make with pandoc from a md now are good examples :(		14:31.30
	okular finds my strings.		14:32.05
	how does this work? picking up the font and ocr:ing? :o		14:32.17
	yes i'm clueless..		14:32.29
sebras	quite: do you mind attaching one of those pandoc-generated PDFs in a bug at bugs.ghostscript.com so someone can take a look? no, OCRing is not involved. the text is encoded in the PDF itself. and if okular can find the strings then probably mupdf should too.		14:38.53
ator	sebras: quite: it could be related to a new-ish bug in freetype that caused problems with text search and extraction		15:04.36
quite	yes i must say that i feel like it used to work better..		15:05.06
ator	probably sure distros build with system freetype, which would be affected		15:05.06
	quite: have you built from source or installed from your distro?		15:05.31
sebras	oh? I wasn't aware about the freetype bug having that effec.t		15:05.49
ator	it's causing us to add "random" space characters in the text		15:06.04
	because the advance widths reported by freetype are bugged for certain fonts		15:06.15
	quite: anyway, an example file would help us immensely in tracking this down		15:06.45
quite	https://bugs.ghostscript.com/show_bug.cgi?id=702141		15:10.07
ator	quite: yep, that's related to the freetype bug...		15:13.13
sebras	ator: you did submit a bug report upstream though. I wonder if werner did anything with it.		15:14.05
ator	sebras: commit 82196fd87		15:14.36
	https://bugs.ghostscript.com/show_bug.cgi?id=701977		15:14.54
	https://savannah.nongnu.org/bugs/?57519		15:15.05
sebras	found it.		15:16.15
	and it doesn't seem like there's a shred of a fix there...		15:16.29
quite	gosh		15:17.37
avih	ator: one thing which can be done is to make js_String hold also a cesu8 version of the string if required. it can be an additional pointer, or, slightly hackier, store the cesu8 version after strlen+1 of the stored string.		15:43.31
	this means that when creating a js_String it also has to check if it has utf8 SMPs		15:44.16
	and for most strings it won't so it remains identical to now. and js_tostring_cesu8(J) is s = js_tostring(); return s + strlen(s) + 1		15:45.27
	well, first it needs to check if there are SMPs, and if there are then it does this strlen thingy		15:46.18
	otherwise it just returns the normal string		15:46.31
	<<<Back 1 day (to 2020/02/20)	Forward 1 day (to 2020/02/22)>>>

Log of #mupdf at irc.freenode.net.