| <<<Back 1 day (to 2016/08/24) | 20160825 |
k-man | hi | 02:23.58 |
ghostbot | Welcome to #ghostscript, the channel for Ghostscript and MuPDF. If you have a question, please ask it, don't ask to ask it. Do be prepared to wait for a reply as devs will check the logs and reply when they come on line. | 02:23.58 |
k-man | is there a way to print from mupdf? | 02:24.11 |
fontdebug | Hi, there, I'd like to get coordinates from glyphs/chars in PDF. Tried -sDEVICE=txtwrite -dTextFormat=4 with ghostscript 9.19. But bbox y values are the same, meaning bbox=flat(?) | 08:45.52 |
| corr. -dTextFormat=0, of course | 08:46.08 |
kens | txtwrite will give the co-ordinates, if you request XML output, | 08:46.15 |
| What version of Ghostscript are you using ? | 08:46.23 |
fontdebug | 9.19. My problem are the bbox y-values | 08:46.41 |
kens | And can you put a copy of the PDF somewhere public so we can look at it ? | 08:46.50 |
fontdebug | just a moment... let's seee.. | 08:47.10 |
kens | DropBox or something is fine, as long as you don;t mind the file being public | 08:47.46 |
fontdebug | here is an example: https://www.zvdd.de/fileadmin/AGSDD-Redaktion/zvdd_MODS_Application_Profile_2.1.pdf | 08:48.19 |
sebras | k-man: I think the mupdf app for Android can print through the cloud. | 08:48.45 |
fontdebug | output of gs-9.19 is e. g. <char bbox="113 47 121 47" c="M"/> | 08:48.54 |
kens | Give me a minute, are you looking at apge 1 ? | 08:49.06 |
| page* | 08:49.10 |
fontdebug | yes, page 1 | 08:49.17 |
kens | WHich font ? A lot fo them are not embedded | 08:49.31 |
fontdebug | oops... ABCDEE+Calibri | 08:49.50 |
kens | OK Calibri is embedded | 08:50.02 |
sebras | k-man: The other apps (e.g. the one for Linux and) don't support printing yet. | 08:50.12 |
fontdebug | the <char bbox="113 47 121 47" c="M"/> is the first letter on the first page of the mentioned pdf. | 08:51.19 |
kens | Yeah I see that | 08:51.33 |
| Finding out why might take a little longer.... | 08:52.00 |
fontdebug | I've also tried mupdf draw -F stext, but in mupdf chars below 32(dec) are put out as "?" | 08:52.01 |
kens | Does MuPDF get the bbox correct ? | 08:52.19 |
| The text output from MuPDF has generally had more work on it than the Ghostscript one | 08:52.33 |
fontdebug | (in mupdf bbox seems correct) | 08:53.15 |
| I have some rather obscure fonts in other PDFs where ghostscript says char is  (correct), but I need proper coordinates to inspect glyphs. | 08:54.25 |
kens | character codes below 32 are not unusual, especially with embedded subsets which often start from character code 1 | 08:55.02 |
fontdebug | Did a workaround by adding/subtracting a few pixels in y, but there seem to be cases where "the whole page" is a single glyph bbox... | 08:56.24 |
kens | That coudl be possible, we've seen some pretty badly created fonts over the years | 08:56.52 |
fontdebug | Will compare this with mutool. If mutool is correct, perhaps I get mutool to put out  instead of "?" | 08:57.34 |
| thanks so far. | 09:03.10 |
kens | Can't say we've helped at all. At the moment I'm struggling to see where the bbox is set up, which is embarassing since I wrote the code.... | 09:03.39 |
fontdebug | perhaps in devices/vector/gdevtxtw.c | 09:06.34 |
kens | Well yes, since that's the txtwrite device..... | 09:06.48 |
| I meant within the code path | 09:07.10 |
fontdebug | uuh... there are a lot of start.y & end.y ... | 09:08.34 |
kens | Indeed, and mmany of them are nothign to do with the glyph, they relate to the positions of text fragnents | 09:08.59 |
| Trying to piece together text out of a PDF file is a non-trivial task | 09:09.15 |
fontdebug | PS: the y coordinates returned are slight above the baseline | 09:09.27 |
kens | Probably the y-co-ordinate is the starting position of the text, though I haven't checked | 09:10.07 |
| OK he reason is that the font is a horizontal writing font, not vertical | 09:10.41 |
| So we deal with horizontals, but not verticals. | 09:10.54 |
| I believe that we don't currently have a decent way to get a proper glyph bopunding box which is why its only reliable in the font writing direction. | 09:11.21 |
| SO if you njeed an accurate BBox you are going to have to use MuPDF for now | 09:11.37 |
fontdebug | Yes, because of "putting text output together" is non-trivial, we've bought pdflib tet, but I didn't know of this features of ghostscript+mutool before. | 09:12.24 |
kens | Well, its all very heuristic, but as I say the MuPDF one has had more attention than the Ghostscript one | 09:12.58 |
| They don't share the same code base, or even the same approach, though. So its possible that sometimes the Ghostscript one will perform better | 09:13.30 |
| In any event, the reason the char bbox is 0 in the y direction is because its not a vertical font. We might well change that at some point in the future, but for now that';s the way it is. Better to stick with MuPDF I think. | 09:16.01 |
| I'm sure you can change it to emit character codes < 32 if no Unicode code point information is available. and its something maybe tor8 or Robin_Watts might consider anyway | 09:16.56 |
fontdebug | in mupdf, pdf-unicode.c, approx. line 100: change font->cid_to_ucs[cpt] = '?' to font->cid_to_ucs[cpt] = cpt | 09:39.27 |
| corr.: line 99 (mupdf-1.9a) | 09:40.30 |
kens | I suspected it would be straight-forward | 09:40.56 |
fontdebug | here the output of "hg difff", gzipped and base64'd: H4sIAD69vlcAA41OTY+CMBQ8t79ibrqp1RbBD4wu/oe9mQ3BQrGRUAL0ZPa/b0EP64FkD2/e5M3Mey83WoO3WG/kVYSRiORWorOuVcWqyfVQ3NVG2bxYKso5R7aaksnXzeHsSgQRpIwDEa9DBEJuwEQgBGWM4frfdChjufubThLwfbTYgXncI0koCHkMQIzGvOtbU5fdRTX998c4JdrWPT8pk6e9TZ16ajjCn00ra++uSbOyek8eKPfJouqKkUzumH3ODpS9rHiMdNLs29P8Mzw2gC/6C0O2Au97AQAA | 09:41.36 |
tor8 | fontdebug: if you want the raw glyph positions, mutool draw -Ftrace (but you'll also need to apply the matrices to the coords) | 09:42.51 |
kens | tor8 question for you on #artifex | 09:44.26 |
fontdebug | did you mean me? | 09:47.38 |
| (-kens)? | 09:47.45 |
kens | Nope, I meant tor8 | 09:47.48 |
tor8 | fontdebug: the 'cpt' there is essentially a random number with no known correlation to unicode, which is why we set it to '?' | 10:02.59 |
kens | Its hte character code ? | 10:03.13 |
tor8 | kens: It's the character code, but usually the glyph index for an Identity-H encoded font missing a ToUnicode cmap. | 10:04.09 |
kens | Right | 10:04.18 |
tor8 | so using that for text extraction and searching would, well, it wouldn't be much better than '?' :) | 10:04.35 |
kens | Agreed, but if its what fontdebug wants..... | 10:04.50 |
tor8 | though we should probably be using U+FFFD (REPLACEMENT CHARACTER) | 10:05.33 |
| if it's what he want, he's free to hack his own source, just beware of the risks of false positives | 10:06.10 |
fontdebug | Yes, U+FFFD seems a bit clearer than "?". | 10:34.54 |
| or something like U+FF00+cpt | 10:37.07 |
tor8 | fontdebug: cpt may be >= 256 for multibyte encodings | 10:41.19 |
| though that's obviously not the case for that particular bit of code | 10:42.02 |
fontdebug | bye. | 11:20.54 |
| Forward 1 day (to 2016/08/26)>>> | |