| <<<Back 1 day (to 2020/05/27) | Fwd 1 day (to 2020/05/29)>>> | 20200528 |
myopia | just came across djvu shell extension pack in https://www.cuminas.jp/en/downloads . it seems the thing can render djvu inside windows photo viewer, allowing page flipping even. wondering if mupdf should do this someday for pdf files? | 00:06.04 |
| https://www.tracker-software.com/shell_ext.html < here's one | 00:25.57 |
| if you can make it a shell extension / WIC code windows folks no longer need a separate pdf viewer, lol | 00:26.34 |
| apparantly this tracker soft's one of the first with the right business attitude/outlook > https://www.tracker-software.com/company/news-press-events/view/230 < read this one. it's as if people in the occupied areae of Republic of China can actually purchase from them, since completing the transaction requires visa/mastercard | 01:00.57 |
ator | Zsolt: PNG also uses Deflate compression. | 08:53.24 |
pedr0 | hi all, any thoughts regarding the last message I've posted concerning the document writer interface/Pdf Output device ? | 08:57.30 |
ator | pedr0: that error message generally comes when the PDF file uses an embedded font without a proper 'cmap' table | 09:02.24 |
| so we can't figure the encoding used | 09:03.29 |
| as a result, the output PDF file won't be searchable and you can't copy text from it | 09:03.54 |
pedr0 | exactly | 09:04.01 |
| but the same portion of text was searchable before my 'rewriting' | 09:04.27 |
| meaning before I fed the document to the writer | 09:04.49 |
| what I am trying to get to is if I have done something wrong or this is simply a limitation/edge case | 09:05.17 |
ator | the information that makes text searchable is lost between the layers of code. the stuff that goes to the fz_device (and thus the document writer) is the raw font data only, the PDF structures that define the mappings to searchable text have been lost at that point | 09:05.23 |
pedr0 | oks, there isn't much I can do to circumvent the problem then | 09:05.55 |
ator | now, if the font file used has an encoding the output will work. this is the case if you use system fonts or the 14 builtin core PDF fonts. | 09:06.17 |
| if the input has a bunch of subset fonts where the PDF producer has stripped out everything but the exact glyphs needed, and all auxiliary info, in order to make the embedded font as small as possible | 09:06.53 |
| then you're out of luck, | 09:06.59 |
pedr0 | I see | 09:07.04 |
| I understand now | 09:07.12 |
ator | the problem is that the input PDF file has a bunch of tables and PDF objects that define the mapping from glyph number to searchable text | 09:07.55 |
| but that is PDF interpreter specific, and is not exposed to the fz_device interface | 09:08.10 |
pedr0 | are you referring to the encoding ? | 09:08.24 |
| property of the font ? | 09:08.32 |
ator | in a way, yes. | 09:08.38 |
| encodings and fonts in PDF is a hairy chapter :) | 09:08.47 |
pedr0 | I think I get the gist of it, we don't 'carry the encoding with us' of those subset of fonts | 09:09.19 |
ator | but in short, an embedded font in a PDF is allowed to have NO encoding, or its encoding can be random gibberish | 09:09.27 |
| but the PDF file can define other mappings on top of this "no/unknown/gibberish" encoding | 09:09.56 |
| one of them is a ToUnicode, which is required to make text searchable | 09:10.04 |
pedr0 | thanks a lot ator. | 09:10.14 |
ator | *or* the embedded PDF file can have the usual truetype 'cmap' encoding table, or use Type1 font glyph names | 09:10.59 |
| we can get searchable text by looking up the encoding of the glyphs that are drawn using either the ToUnicode table, or the truetype cmap table, or the type1 glyph names | 09:11.50 |
| when going through the fz_device interface, we do not pass any ToUnicode table, so on the other side only the cmap table or glyph names are available | 09:12.26 |
| I have considered changing the fz_font/fz_device interface a bit to allow the ToUnicode table to be passed along as well | 09:12.50 |
| but no promises! | 09:12.57 |
| sebras: there are no js_tofloatarray because I forgot :) | 09:14.29 |
sebras | ator: ah! :) | 09:14.39 |
| ator: I hope I made you unforget. | 09:14.54 |
pedr0 | let's me try again :-) Consider a 'simple' font, the codepoint is read from the stream, an encoding is then used to tranform it to another value that can be used to index the table of glyphs. Is this description correct ? | 09:15.53 |
| That's my reading of the specs. The ToUnicodeMap - it's addressed in a similar manner, correct ? | 09:16.44 |
| ToUnicode table | 09:17.00 |
ator | in the content stream, there's an operator "Tj" which shows text | 09:17.24 |
| the input to the Tj operator is a byte string | 09:17.35 |
| for a simple font, the byte is used to lookup a glyph index (the raw number of the glyph in the font, no relation to anything) via the Encoding table | 09:18.23 |
| the byte can also be used to lookup a unicode character from the ToUnicode table, if such a table exists | 09:18.51 |
| now, the Encoding and ToUnicode tables can be created in a *lot* of different ways due to various arcane complexities of font files and the PDF format | 09:19.35 |
| some things are allowed to be unspecified in the PDF objects, in which case we create them from the embedded font file | 09:20.35 |
| in other cases, what's in the embedded font file is overridden by tables specified in the PDF objects | 09:20.54 |
| sometimes there's no information in either the font file or the PDF file -- usually the case when we get unsearchable text | 09:21.24 |
| if the Encoding table is Identity, the byte value in the stream is the number of the glyph in the font file | 09:21.47 |
| which may or may not have anything to do with any normal encoding like latin-1 or koi-8 or unicode | 09:22.13 |
| if in the same file, there is no ToUnicode table to map the byte to a unicode value, we're lost | 09:22.31 |
| oh. I lied before. the encoding information makes it to the fz_device interface, but it's *per string drawn* not for the font being passed | 09:24.57 |
| because of how XPS works | 09:25.02 |
| the text drawing operator in the XPS format gives you two arrays: one of the glyph indexes, and another parallel array of unicode characters | 09:25.37 |
| because text is complicated, and there's not necessarily a one-to-one mapping between font glyph and unicode character | 09:25.59 |
| like the 'fi' ligature which is one glyph representing two unicode characters | 09:26.29 |
pedr0 | Yes, thanks a lot again | 10:22.26 |
| I am thinking of changing approach - if I use a pdf_processor, and 'overrode | 10:25.32 |
| sorry sorry | 10:25.36 |
| enter was pressed by mistake | 10:25.43 |
| if I use a custom pdf_processor and I 'override' the op_T[Jj] function to execute an arbitrary piece of code and then delegating to the 'original' function, is there a way to translate the operands in pain text ? Of course provided that the font contains the relevant information. | 10:27.51 |
| *plain text | 10:28.49 |
| freudian slip :-) | 10:29.07 |
| I prefer the XPS format :-) | 10:36.00 |
sebras | ator: I need some guidance. you removed the existing separation bindings in 15a8c86192a9620e2dfe14945006c532440da82a | 10:43.33 |
| ator: but you left the java classes in. what should I make of that? | 10:43.54 |
| is it the case that we don't want bindings for this? | 10:44.19 |
ator | sebras: I would not worry about adding them for now. we may want to remove the java classes too, just in case someone thinks they work though! | 11:45.04 |
| pedr0: what are you trying to do? | 11:45.56 |
| (sorry, had internet problems) | 11:46.01 |
sebras | ator: ok. do you mind looking at sebras/master when you have time. I think I'm on the right track with the jni fixes. | 11:47.54 |
ator | pedr0: if you're trying to modify text in the PDF file, have a look at pdf_filter_page_contents and its callback API. | 11:49.50 |
| we use that function to apply redactions to text and images, for example | 11:50.09 |
| see source/pdf/pdf-clean.c for an example of how to use it | 11:50.17 |
| pdf_redact_page as a starting point | 11:50.55 |
sebras | maybe I should explain that dropping the underlying object in the finalizer but setting the pointer to NULL in destroy() means that we can end up in the finalizer() twice with the same pointer that we freed the first time. | 11:51.12 |
| hence I'm setting the pointer to NULL in the finalizer(), so the next time its called, it will safely try to drop NULL. | 11:51.44 |
ator | sebras: hm, I was thinking the split classes would go under 'jni' and the top file stay where it was :) | 11:54.07 |
| to cause the minimum amount of trouble for the other build systems | 11:54.30 |
| from_PDFObject_safe_own is a bit misleading, IMO, given that it also zeroes the pointer | 11:57.09 |
| how can we end up in finalizer twice in that case? if the GC runs between destroy calling finalize and setting the native pointer to 0? | 12:01.07 |
| native destroyNative(long ptr); finalize() { long tmp=pointer; pointer=0; destroyNative(tmp); } destroy() { finalize(); } maybe works then, so we don't have to call SetLongField from C? | 12:02.44 |
| biab, going out for a walk | 12:03.34 |
Kabouik | Hi #mupdf. I have mupdf installed on my Solus computers as well as on a LXC container running Debian. I noticed in Debian, I can use arrow keys to navigate through pages while on Solus I can only use b and space. Are the keys customizable so that I can homogenize keybindings between my machines? | 13:00.48 |
| I suspect it's just some custom changes made by packagers on either distros? | 13:01.22 |
kiwi_66 | Hello, the following PDF is built to prevent users from selecting and copying text. I can zoom in/out so it's not an image, but is a real (vector) PDF. All it allows is "Select All" : | 13:15.31 |
| https://we.tl/t-UF4rgPRFpx | 13:15.35 |
| Why is that? Can mutool remove this restriction? Thank you. | 13:15.38 |
| FWIW, "qpdf.exe --decrypt" didn't help. | 13:19.08 |
kens | Its not an encrypted file, it simply has no text in it | 13:19.23 |
| Each character appears to be an image in its own right | 13:20.24 |
| Ah no, there are a number of aimges which depict one or more characters | 13:20.51 |
kiwi_66 | If I can zoom in without edges, can they still be images? https://postimg.cc/xN0rcSG7 | 13:28.53 |
pedr0 | ator: thansk I will have a look, I am trying to remove text from a page - therefore I need to perform a part of analysis on the PDF first to detect what I want to get rid of, I want to remove stuff thereafter | 13:29.05 |
kens | kiwi_66: they *are* images. Zooming in just proves that they are high resolution, and (probably) that the viewer does edge smoothing | 13:29.36 |
| Also, on that image, there are plenty of jagged edges | 13:29.56 |
kiwi_66 | Ok so it's impossible to copy the text, besides using an OCR? | 13:30.17 |
kens | OCR is going to be the only solution here. Apart form a very few characters there is no actual text in the document. THere's a font on page 2 which has a ToUnicode CMap, btu there are only 2 character codes, one maps to a space, and the other has no entry in the ToUnicode CMap. | 13:31.08 |
kiwi_66 | Too bad. Thank you! | 13:32.16 |
kens | You're welcome | 13:32.24 |
sebras | ator: it can happen if we call destroy() manually and then the JVM calls it again due to GC. | 13:35.29 |
ator | sebras: as a race condition while destroy() is running before it sets pointer=0 rigth? | 14:17.41 |
| so we could prevent that race by setting pointer=0 and passing the old pointer value to the native function | 14:19.56 |
| or would that just open up for another type of race? | 14:20.20 |
| Kabouik: could be using an older version of mupdf, or one is mupdf-x11 and the other is mupdf-gl | 14:21.00 |
| we brought the keybindings between the -x11 and -gl viewers in sync a release or two back | 14:21.24 |
Kabouik | I think it's gl on both but I will check version numbers next time | 14:21.34 |
ator | the keys are only customizable by editing the source and recompiling | 14:21.45 |
Kabouik | Understood, thanks | 14:21.55 |
| Just to clarify, does the latest release include up/down arrows to navigate or does it remove them? | 14:23.41 |
| I'm not sure which version is the latest | 14:23.54 |
ator | the arrows do not change pages in the newer versions, only pan within a zoomed page | 14:24.45 |
| page up / page down / b / space / , / . are the keybindings to change page | 14:25.05 |
| b and space use "smart move", where we jump to the next section in a zoomed in page | 14:25.52 |
Kabouik | Understood, so my Debian install must be older | 14:25.55 |
ator | page up/down (and comma/period) jump to the next page, preserving the current scroll offset within the page | 14:26.23 |
sebras | ator: alright, then the commit on mupdf:sebras/master does what you desire. | 16:29.17 |
| ator: what was the historic reasons for making the finalize() members protected and add the destroy() members? | 16:30.41 |
| is commit d7ece4132d6219ee10ba9ed85a9f2a052a6bb92c the origin? if so the motivation reads "We could do this in finalize, but there's no guarantee that a finalize will occur before the muPDF context occurs." | 16:45.35 |
Kabouik | Maybe that's a corner case ator, but there are multiple laptops with no real PgUp/PgDn keys, you have to use the Fn keys to do PgUp/PgDn, sometimes in combinatioin with arrows in a non standard size (i.e., half-sized keys) | 16:59.47 |
| That's my case of course | 16:59.54 |
| I use b/space all the time because combinations with the small arrow keys is too error prone | 17:00.21 |
sebras | Kabouik: did you try , and . too? | 17:01.18 |
| I haven't read the entire conversation, but you never mentioned those two keys. | 17:01.33 |
Kabouik | Oh well, no I didn't sebras. Perfect. | 17:01.58 |
sebras | ator: let's try with a new set of commits on mupdf:sebras/master | 20:16.53 |
| LGTM "Fix warnings in jbig2 error callback type mismatch." | 20:17.14 |
ator | sebras: bah. I think I liked your first version better :) | 22:13.01 |
| just with the _own function renamed to something more appropriate. sorry to put you through the wringer like this. | 22:13.28 |
| <<<Back 1 day (to 2020/05/27) | Forward 1 day (to 2020/05/29)>>> | |