Log of #mupdf at irc.freenode.net.

Search:
 <<<Back 1 day (to 2020/02/20)Fwd 1 day (to 2020/02/22)>>>20200221 
avih ator: i added few more normalizations, but i don't think this approach can work as is. there are several things which operate on c-strings (notably substrings, search, and regex) which cannot work correctly always with wtf8 storage. if this approach is kept then these things need to either modified to work on runes and not c-ctrings, or (probably easier, simpler, and faster to develop and in runtime) convert all their input to cesu8 (for most strings this is a09:22.13 
  no-op), do their work, and normalize the results09:22.13 
  this does add some memory management, but relatively limited in scope, i.e. not all over the place09:23.33 
  regex does sort of work on runes, but its results are lstrings within the input, so this doesn't work for virtual surrogate runes09:30.28 
  basically, js_utfidxtoptr and js_utfptrtoidx cannot be used with wtf8 storage09:32.56 
  (except if it the pointer is not between virtual surrogates, but we can't guarantee it unless we ensure the string is cesu8)09:34.08 
quite mupdf is not great att finding strings in "many" pdfs..14:21.50 
kens Possibly the problem is the PDF files14:26.32 
sebras quite: PDFs do not _have_ to encode the text in a way to make it searchable by PDF viewers.14:26.58 
  quite: but if you find a PDF where mupdf fails to search but other viewers (acrobat reader in particular) is able to do text search, then feel free to file a bug.14:27.33 
quite even the most simple one's i make with pandoc from a md now are good examples :(14:31.30 
  okular finds my strings.14:32.05 
  how does this work? picking up the font and ocr:ing? :o14:32.17 
  yes i'm clueless..14:32.29 
sebras quite: do you mind attaching one of those pandoc-generated PDFs in a bug at bugs.ghostscript.com so someone can take a look? no, OCRing is not involved. the text is encoded in the PDF itself. and if okular can find the strings then probably mupdf should too.14:38.53 
ator sebras: quite: it could be related to a new-ish bug in freetype that caused problems with text search and extraction15:04.36 
quite yes i must say that i feel like it used to work better..15:05.06 
ator probably sure distros build with system freetype, which would be affected15:05.06 
  quite: have you built from source or installed from your distro?15:05.31 
sebras oh? I wasn't aware about the freetype bug having that effec.t15:05.49 
ator it's causing us to add "random" space characters in the text15:06.04 
  because the advance widths reported by freetype are bugged for certain fonts15:06.15 
  quite: anyway, an example file would help us immensely in tracking this down15:06.45 
quite https://bugs.ghostscript.com/show_bug.cgi?id=70214115:10.07 
ator quite: yep, that's related to the freetype bug...15:13.13 
sebras ator: you did submit a bug report upstream though. I wonder if werner did anything with it.15:14.05 
ator sebras: commit 82196fd8715:14.36 
  https://bugs.ghostscript.com/show_bug.cgi?id=70197715:14.54 
  https://savannah.nongnu.org/bugs/?5751915:15.05 
sebras found it.15:16.15 
  and it doesn't seem like there's a shred of a fix there...15:16.29 
quite gosh15:17.37 
avih ator: one thing which can be done is to make js_String hold also a cesu8 version of the string if required. it can be an additional pointer, or, slightly hackier, store the cesu8 version after strlen+1 of the stored string.15:43.31 
  this means that when creating a js_String it also has to check if it has utf8 SMPs15:44.16 
  and for most strings it won't so it remains identical to now. and js_tostring_cesu8(J) is s = js_tostring(); return s + strlen(s) + 115:45.27 
  well, first it needs to check if there are SMPs, and if there are then it does this strlen thingy15:46.18 
  otherwise it just returns the normal string15:46.31 
 <<<Back 1 day (to 2020/02/20)Forward 1 day (to 2020/02/22)>>> 
ghostscript.com #ghostscript
Search: