| <<<Back 1 day (to 2020/02/20) | Fwd 1 day (to 2020/02/22)>>> | 20200221 |
avih | ator: i added few more normalizations, but i don't think this approach can work as is. there are several things which operate on c-strings (notably substrings, search, and regex) which cannot work correctly always with wtf8 storage. if this approach is kept then these things need to either modified to work on runes and not c-ctrings, or (probably easier, simpler, and faster to develop and in runtime) convert all their input to cesu8 (for most strings this is a | 09:22.13 |
| no-op), do their work, and normalize the results | 09:22.13 |
| this does add some memory management, but relatively limited in scope, i.e. not all over the place | 09:23.33 |
| regex does sort of work on runes, but its results are lstrings within the input, so this doesn't work for virtual surrogate runes | 09:30.28 |
| basically, js_utfidxtoptr and js_utfptrtoidx cannot be used with wtf8 storage | 09:32.56 |
| (except if it the pointer is not between virtual surrogates, but we can't guarantee it unless we ensure the string is cesu8) | 09:34.08 |
quite | mupdf is not great att finding strings in "many" pdfs.. | 14:21.50 |
kens | Possibly the problem is the PDF files | 14:26.32 |
sebras | quite: PDFs do not _have_ to encode the text in a way to make it searchable by PDF viewers. | 14:26.58 |
| quite: but if you find a PDF where mupdf fails to search but other viewers (acrobat reader in particular) is able to do text search, then feel free to file a bug. | 14:27.33 |
quite | even the most simple one's i make with pandoc from a md now are good examples :( | 14:31.30 |
| okular finds my strings. | 14:32.05 |
| how does this work? picking up the font and ocr:ing? :o | 14:32.17 |
| yes i'm clueless.. | 14:32.29 |
sebras | quite: do you mind attaching one of those pandoc-generated PDFs in a bug at bugs.ghostscript.com so someone can take a look? no, OCRing is not involved. the text is encoded in the PDF itself. and if okular can find the strings then probably mupdf should too. | 14:38.53 |
ator | sebras: quite: it could be related to a new-ish bug in freetype that caused problems with text search and extraction | 15:04.36 |
quite | yes i must say that i feel like it used to work better.. | 15:05.06 |
ator | probably sure distros build with system freetype, which would be affected | 15:05.06 |
| quite: have you built from source or installed from your distro? | 15:05.31 |
sebras | oh? I wasn't aware about the freetype bug having that effec.t | 15:05.49 |
ator | it's causing us to add "random" space characters in the text | 15:06.04 |
| because the advance widths reported by freetype are bugged for certain fonts | 15:06.15 |
| quite: anyway, an example file would help us immensely in tracking this down | 15:06.45 |
quite | https://bugs.ghostscript.com/show_bug.cgi?id=702141 | 15:10.07 |
ator | quite: yep, that's related to the freetype bug... | 15:13.13 |
sebras | ator: you did submit a bug report upstream though. I wonder if werner did anything with it. | 15:14.05 |
ator | sebras: commit 82196fd87 | 15:14.36 |
| https://bugs.ghostscript.com/show_bug.cgi?id=701977 | 15:14.54 |
| https://savannah.nongnu.org/bugs/?57519 | 15:15.05 |
sebras | found it. | 15:16.15 |
| and it doesn't seem like there's a shred of a fix there... | 15:16.29 |
quite | gosh | 15:17.37 |
avih | ator: one thing which can be done is to make js_String hold also a cesu8 version of the string if required. it can be an additional pointer, or, slightly hackier, store the cesu8 version after strlen+1 of the stored string. | 15:43.31 |
| this means that when creating a js_String it also has to check if it has utf8 SMPs | 15:44.16 |
| and for most strings it won't so it remains identical to now. and js_tostring_cesu8(J) is s = js_tostring(); return s + strlen(s) + 1 | 15:45.27 |
| well, first it needs to check if there are SMPs, and if there are then it does this strlen thingy | 15:46.18 |
| otherwise it just returns the normal string | 15:46.31 |
| <<<Back 1 day (to 2020/02/20) | Forward 1 day (to 2020/02/22)>>> | |