Log of #mupdf at irc.freenode.net.

Search:
 <<<Back 1 day (to 2020/05/27)Fwd 1 day (to 2020/05/29)>>>20200528 
myopia just came across djvu shell extension pack in https://www.cuminas.jp/en/downloads . it seems the thing can render djvu inside windows photo viewer, allowing page flipping even. wondering if mupdf should do this someday for pdf files?00:06.04 
  https://www.tracker-software.com/shell_ext.html < here's one00:25.57 
  if you can make it a shell extension / WIC code windows folks no longer need a separate pdf viewer, lol00:26.34 
  apparantly this tracker soft's one of the first with the right business attitude/outlook > https://www.tracker-software.com/company/news-press-events/view/230 < read this one. it's as if people in the occupied areae of Republic of China can actually purchase from them, since completing the transaction requires visa/mastercard01:00.57 
ator Zsolt: PNG also uses Deflate compression.08:53.24 
pedr0 hi all, any thoughts regarding the last message I've posted concerning the document writer interface/Pdf Output device ?08:57.30 
ator pedr0: that error message generally comes when the PDF file uses an embedded font without a proper 'cmap' table09:02.24 
  so we can't figure the encoding used09:03.29 
  as a result, the output PDF file won't be searchable and you can't copy text from it09:03.54 
pedr0 exactly09:04.01 
  but the same portion of text was searchable before my 'rewriting'09:04.27 
  meaning before I fed the document to the writer09:04.49 
  what I am trying to get to is if I have done something wrong or this is simply a limitation/edge case09:05.17 
ator the information that makes text searchable is lost between the layers of code. the stuff that goes to the fz_device (and thus the document writer) is the raw font data only, the PDF structures that define the mappings to searchable text have been lost at that point09:05.23 
pedr0 oks, there isn't much I can do to circumvent the problem then09:05.55 
ator now, if the font file used has an encoding the output will work. this is the case if you use system fonts or the 14 builtin core PDF fonts.09:06.17 
  if the input has a bunch of subset fonts where the PDF producer has stripped out everything but the exact glyphs needed, and all auxiliary info, in order to make the embedded font as small as possible09:06.53 
  then you're out of luck,09:06.59 
pedr0 I see09:07.04 
  I understand now09:07.12 
ator the problem is that the input PDF file has a bunch of tables and PDF objects that define the mapping from glyph number to searchable text09:07.55 
  but that is PDF interpreter specific, and is not exposed to the fz_device interface09:08.10 
pedr0 are you referring to the encoding ?09:08.24 
  property of the font ?09:08.32 
ator in a way, yes.09:08.38 
  encodings and fonts in PDF is a hairy chapter :)09:08.47 
pedr0 I think I get the gist of it, we don't 'carry the encoding with us' of those subset of fonts09:09.19 
ator but in short, an embedded font in a PDF is allowed to have NO encoding, or its encoding can be random gibberish09:09.27 
  but the PDF file can define other mappings on top of this "no/unknown/gibberish" encoding09:09.56 
  one of them is a ToUnicode, which is required to make text searchable09:10.04 
pedr0 thanks a lot ator.09:10.14 
ator *or* the embedded PDF file can have the usual truetype 'cmap' encoding table, or use Type1 font glyph names09:10.59 
  we can get searchable text by looking up the encoding of the glyphs that are drawn using either the ToUnicode table, or the truetype cmap table, or the type1 glyph names09:11.50 
  when going through the fz_device interface, we do not pass any ToUnicode table, so on the other side only the cmap table or glyph names are available09:12.26 
  I have considered changing the fz_font/fz_device interface a bit to allow the ToUnicode table to be passed along as well09:12.50 
  but no promises!09:12.57 
  sebras: there are no js_tofloatarray because I forgot :)09:14.29 
sebras ator: ah! :)09:14.39 
  ator: I hope I made you unforget.09:14.54 
pedr0 let's me try again :-) Consider a 'simple' font, the codepoint is read from the stream, an encoding is then used to tranform it to another value that can be used to index the table of glyphs. Is this description correct ?09:15.53 
  That's my reading of the specs. The ToUnicodeMap - it's addressed in a similar manner, correct ?09:16.44 
  ToUnicode table09:17.00 
ator in the content stream, there's an operator "Tj" which shows text09:17.24 
  the input to the Tj operator is a byte string09:17.35 
  for a simple font, the byte is used to lookup a glyph index (the raw number of the glyph in the font, no relation to anything) via the Encoding table09:18.23 
  the byte can also be used to lookup a unicode character from the ToUnicode table, if such a table exists09:18.51 
  now, the Encoding and ToUnicode tables can be created in a *lot* of different ways due to various arcane complexities of font files and the PDF format09:19.35 
  some things are allowed to be unspecified in the PDF objects, in which case we create them from the embedded font file09:20.35 
  in other cases, what's in the embedded font file is overridden by tables specified in the PDF objects09:20.54 
  sometimes there's no information in either the font file or the PDF file -- usually the case when we get unsearchable text09:21.24 
  if the Encoding table is Identity, the byte value in the stream is the number of the glyph in the font file09:21.47 
  which may or may not have anything to do with any normal encoding like latin-1 or koi-8 or unicode09:22.13 
  if in the same file, there is no ToUnicode table to map the byte to a unicode value, we're lost09:22.31 
  oh. I lied before. the encoding information makes it to the fz_device interface, but it's *per string drawn* not for the font being passed09:24.57 
  because of how XPS works09:25.02 
  the text drawing operator in the XPS format gives you two arrays: one of the glyph indexes, and another parallel array of unicode characters09:25.37 
  because text is complicated, and there's not necessarily a one-to-one mapping between font glyph and unicode character09:25.59 
  like the 'fi' ligature which is one glyph representing two unicode characters09:26.29 
pedr0 Yes, thanks a lot again10:22.26 
  I am thinking of changing approach - if I use a pdf_processor, and 'overrode10:25.32 
  sorry sorry10:25.36 
  enter was pressed by mistake10:25.43 
  if I use a custom pdf_processor and I 'override' the op_T[Jj] function to execute an arbitrary piece of code and then delegating to the 'original' function, is there a way to translate the operands in pain text ? Of course provided that the font contains the relevant information.10:27.51 
  *plain text10:28.49 
  freudian slip :-)10:29.07 
  I prefer the XPS format :-)10:36.00 
sebras ator: I need some guidance. you removed the existing separation bindings in 15a8c86192a9620e2dfe14945006c532440da82a10:43.33 
  ator: but you left the java classes in. what should I make of that?10:43.54 
  is it the case that we don't want bindings for this?10:44.19 
ator sebras: I would not worry about adding them for now. we may want to remove the java classes too, just in case someone thinks they work though!11:45.04 
  pedr0: what are you trying to do?11:45.56 
  (sorry, had internet problems)11:46.01 
sebras ator: ok. do you mind looking at sebras/master when you have time. I think I'm on the right track with the jni fixes.11:47.54 
ator pedr0: if you're trying to modify text in the PDF file, have a look at pdf_filter_page_contents and its callback API.11:49.50 
  we use that function to apply redactions to text and images, for example11:50.09 
  see source/pdf/pdf-clean.c for an example of how to use it11:50.17 
  pdf_redact_page as a starting point11:50.55 
sebras maybe I should explain that dropping the underlying object in the finalizer but setting the pointer to NULL in destroy() means that we can end up in the finalizer() twice with the same pointer that we freed the first time.11:51.12 
  hence I'm setting the pointer to NULL in the finalizer(), so the next time its called, it will safely try to drop NULL.11:51.44 
ator sebras: hm, I was thinking the split classes would go under 'jni' and the top file stay where it was :)11:54.07 
  to cause the minimum amount of trouble for the other build systems11:54.30 
  from_PDFObject_safe_own is a bit misleading, IMO, given that it also zeroes the pointer11:57.09 
  how can we end up in finalizer twice in that case? if the GC runs between destroy calling finalize and setting the native pointer to 0?12:01.07 
  native destroyNative(long ptr); finalize() { long tmp=pointer; pointer=0; destroyNative(tmp); } destroy() { finalize(); } maybe works then, so we don't have to call SetLongField from C?12:02.44 
  biab, going out for a walk12:03.34 
Kabouik Hi #mupdf. I have mupdf installed on my Solus computers as well as on a LXC container running Debian. I noticed in Debian, I can use arrow keys to navigate through pages while on Solus I can only use b and space. Are the keys customizable so that I can homogenize keybindings between my machines?13:00.48 
  I suspect it's just some custom changes made by packagers on either distros?13:01.22 
kiwi_66 Hello, the following PDF is built to prevent users from selecting and copying text. I can zoom in/out so it's not an image, but is a real (vector) PDF. All it allows is "Select All" :13:15.31 
  https://we.tl/t-UF4rgPRFpx13:15.35 
  Why is that? Can mutool remove this restriction? Thank you.13:15.38 
  FWIW, "qpdf.exe --decrypt" didn't help.13:19.08 
kens Its not an encrypted file, it simply has no text in it13:19.23 
  Each character appears to be an image in its own right13:20.24 
  Ah no, there are a number of aimges which depict one or more characters13:20.51 
kiwi_66 If I can zoom in without edges, can they still be images? https://postimg.cc/xN0rcSG713:28.53 
pedr0 ator: thansk I will have a look, I am trying to remove text from a page - therefore I need to perform a part of analysis on the PDF first to detect what I want to get rid of, I want to remove stuff thereafter13:29.05 
kens kiwi_66: they *are* images. Zooming in just proves that they are high resolution, and (probably) that the viewer does edge smoothing13:29.36 
  Also, on that image, there are plenty of jagged edges13:29.56 
kiwi_66 Ok so it's impossible to copy the text, besides using an OCR?13:30.17 
kens OCR is going to be the only solution here. Apart form a very few characters there is no actual text in the document. THere's a font on page 2 which has a ToUnicode CMap, btu there are only 2 character codes, one maps to a space, and the other has no entry in the ToUnicode CMap.13:31.08 
kiwi_66 Too bad. Thank you!13:32.16 
kens You're welcome13:32.24 
sebras ator: it can happen if we call destroy() manually and then the JVM calls it again due to GC.13:35.29 
ator sebras: as a race condition while destroy() is running before it sets pointer=0 rigth?14:17.41 
  so we could prevent that race by setting pointer=0 and passing the old pointer value to the native function14:19.56 
  or would that just open up for another type of race?14:20.20 
  Kabouik: could be using an older version of mupdf, or one is mupdf-x11 and the other is mupdf-gl14:21.00 
  we brought the keybindings between the -x11 and -gl viewers in sync a release or two back14:21.24 
Kabouik I think it's gl on both but I will check version numbers next time14:21.34 
ator the keys are only customizable by editing the source and recompiling14:21.45 
Kabouik Understood, thanks14:21.55 
  Just to clarify, does the latest release include up/down arrows to navigate or does it remove them?14:23.41 
  I'm not sure which version is the latest14:23.54 
ator the arrows do not change pages in the newer versions, only pan within a zoomed page14:24.45 
  page up / page down / b / space / , / . are the keybindings to change page14:25.05 
  b and space use "smart move", where we jump to the next section in a zoomed in page14:25.52 
Kabouik Understood, so my Debian install must be older14:25.55 
ator page up/down (and comma/period) jump to the next page, preserving the current scroll offset within the page14:26.23 
sebras ator: alright, then the commit on mupdf:sebras/master does what you desire.16:29.17 
  ator: what was the historic reasons for making the finalize() members protected and add the destroy() members?16:30.41 
  is commit d7ece4132d6219ee10ba9ed85a9f2a052a6bb92c the origin? if so the motivation reads "We could do this in finalize, but there's no guarantee that a finalize will occur before the muPDF context occurs."16:45.35 
Kabouik Maybe that's a corner case ator, but there are multiple laptops with no real PgUp/PgDn keys, you have to use the Fn keys to do PgUp/PgDn, sometimes in combinatioin with arrows in a non standard size (i.e., half-sized keys)16:59.47 
  That's my case of course16:59.54 
  I use b/space all the time because combinations with the small arrow keys is too error prone17:00.21 
sebras Kabouik: did you try , and . too?17:01.18 
  I haven't read the entire conversation, but you never mentioned those two keys.17:01.33 
Kabouik Oh well, no I didn't sebras. Perfect.17:01.58 
sebras ator: let's try with a new set of commits on mupdf:sebras/master20:16.53 
  LGTM "Fix warnings in jbig2 error callback type mismatch."20:17.14 
ator sebras: bah. I think I liked your first version better :)22:13.01 
  just with the _own function renamed to something more appropriate. sorry to put you through the wringer like this.22:13.28 
 <<<Back 1 day (to 2020/05/27)Forward 1 day (to 2020/05/29)>>> 
ghostscript.com #ghostscript
Search: