MuPDF IRC logs

	<<<Back 1 day (to 2020/05/27)	Fwd 1 day (to 2020/05/29)>>>	20200528
myopia	just came across djvu shell extension pack in https://www.cuminas.jp/en/downloads . it seems the thing can render djvu inside windows photo viewer, allowing page flipping even. wondering if mupdf should do this someday for pdf files?		00:06.04
	https://www.tracker-software.com/shell_ext.html < here's one		00:25.57
	if you can make it a shell extension / WIC code windows folks no longer need a separate pdf viewer, lol		00:26.34
	apparantly this tracker soft's one of the first with the right business attitude/outlook > https://www.tracker-software.com/company/news-press-events/view/230 < read this one. it's as if people in the occupied areae of Republic of China can actually purchase from them, since completing the transaction requires visa/mastercard		01:00.57
ator	Zsolt: PNG also uses Deflate compression.		08:53.24
pedr0	hi all, any thoughts regarding the last message I've posted concerning the document writer interface/Pdf Output device ?		08:57.30
ator	pedr0: that error message generally comes when the PDF file uses an embedded font without a proper 'cmap' table		09:02.24
	so we can't figure the encoding used		09:03.29
	as a result, the output PDF file won't be searchable and you can't copy text from it		09:03.54
pedr0	exactly		09:04.01
	but the same portion of text was searchable before my 'rewriting'		09:04.27
	meaning before I fed the document to the writer		09:04.49
	what I am trying to get to is if I have done something wrong or this is simply a limitation/edge case		09:05.17
ator	the information that makes text searchable is lost between the layers of code. the stuff that goes to the fz_device (and thus the document writer) is the raw font data only, the PDF structures that define the mappings to searchable text have been lost at that point		09:05.23
pedr0	oks, there isn't much I can do to circumvent the problem then		09:05.55
ator	now, if the font file used has an encoding the output will work. this is the case if you use system fonts or the 14 builtin core PDF fonts.		09:06.17
	if the input has a bunch of subset fonts where the PDF producer has stripped out everything but the exact glyphs needed, and all auxiliary info, in order to make the embedded font as small as possible		09:06.53
	then you're out of luck,		09:06.59
pedr0	I see		09:07.04
	I understand now		09:07.12
ator	the problem is that the input PDF file has a bunch of tables and PDF objects that define the mapping from glyph number to searchable text		09:07.55
	but that is PDF interpreter specific, and is not exposed to the fz_device interface		09:08.10
pedr0	are you referring to the encoding ?		09:08.24
	property of the font ?		09:08.32
ator	in a way, yes.		09:08.38
	encodings and fonts in PDF is a hairy chapter :)		09:08.47
pedr0	I think I get the gist of it, we don't 'carry the encoding with us' of those subset of fonts		09:09.19
ator	but in short, an embedded font in a PDF is allowed to have NO encoding, or its encoding can be random gibberish		09:09.27
	but the PDF file can define other mappings on top of this "no/unknown/gibberish" encoding		09:09.56
	one of them is a ToUnicode, which is required to make text searchable		09:10.04
pedr0	thanks a lot ator.		09:10.14
ator	or the embedded PDF file can have the usual truetype 'cmap' encoding table, or use Type1 font glyph names		09:10.59
	we can get searchable text by looking up the encoding of the glyphs that are drawn using either the ToUnicode table, or the truetype cmap table, or the type1 glyph names		09:11.50
	when going through the fz_device interface, we do not pass any ToUnicode table, so on the other side only the cmap table or glyph names are available		09:12.26
	I have considered changing the fz_font/fz_device interface a bit to allow the ToUnicode table to be passed along as well		09:12.50
	but no promises!		09:12.57
	sebras: there are no js_tofloatarray because I forgot :)		09:14.29
sebras	ator: ah! :)		09:14.39
	ator: I hope I made you unforget.		09:14.54
pedr0	let's me try again :-) Consider a 'simple' font, the codepoint is read from the stream, an encoding is then used to tranform it to another value that can be used to index the table of glyphs. Is this description correct ?		09:15.53
	That's my reading of the specs. The ToUnicodeMap - it's addressed in a similar manner, correct ?		09:16.44
	ToUnicode table		09:17.00
ator	in the content stream, there's an operator "Tj" which shows text		09:17.24
	the input to the Tj operator is a byte string		09:17.35
	for a simple font, the byte is used to lookup a glyph index (the raw number of the glyph in the font, no relation to anything) via the Encoding table		09:18.23
	the byte can also be used to lookup a unicode character from the ToUnicode table, if such a table exists		09:18.51
	now, the Encoding and ToUnicode tables can be created in a lot of different ways due to various arcane complexities of font files and the PDF format		09:19.35
	some things are allowed to be unspecified in the PDF objects, in which case we create them from the embedded font file		09:20.35
	in other cases, what's in the embedded font file is overridden by tables specified in the PDF objects		09:20.54
	sometimes there's no information in either the font file or the PDF file -- usually the case when we get unsearchable text		09:21.24
	if the Encoding table is Identity, the byte value in the stream is the number of the glyph in the font file		09:21.47
	which may or may not have anything to do with any normal encoding like latin-1 or koi-8 or unicode		09:22.13
	if in the same file, there is no ToUnicode table to map the byte to a unicode value, we're lost		09:22.31
	oh. I lied before. the encoding information makes it to the fz_device interface, but it's per string drawn not for the font being passed		09:24.57
	because of how XPS works		09:25.02
	the text drawing operator in the XPS format gives you two arrays: one of the glyph indexes, and another parallel array of unicode characters		09:25.37
	because text is complicated, and there's not necessarily a one-to-one mapping between font glyph and unicode character		09:25.59
	like the 'fi' ligature which is one glyph representing two unicode characters		09:26.29
pedr0	Yes, thanks a lot again		10:22.26
	I am thinking of changing approach - if I use a pdf_processor, and 'overrode		10:25.32
	sorry sorry		10:25.36
	enter was pressed by mistake		10:25.43
	if I use a custom pdf_processor and I 'override' the op_T[Jj] function to execute an arbitrary piece of code and then delegating to the 'original' function, is there a way to translate the operands in pain text ? Of course provided that the font contains the relevant information.		10:27.51
	*plain text		10:28.49
	freudian slip :-)		10:29.07
	I prefer the XPS format :-)		10:36.00
sebras	ator: I need some guidance. you removed the existing separation bindings in 15a8c86192a9620e2dfe14945006c532440da82a		10:43.33
	ator: but you left the java classes in. what should I make of that?		10:43.54
	is it the case that we don't want bindings for this?		10:44.19
ator	sebras: I would not worry about adding them for now. we may want to remove the java classes too, just in case someone thinks they work though!		11:45.04
	pedr0: what are you trying to do?		11:45.56
	(sorry, had internet problems)		11:46.01
sebras	ator: ok. do you mind looking at sebras/master when you have time. I think I'm on the right track with the jni fixes.		11:47.54
ator	pedr0: if you're trying to modify text in the PDF file, have a look at pdf_filter_page_contents and its callback API.		11:49.50
	we use that function to apply redactions to text and images, for example		11:50.09
	see source/pdf/pdf-clean.c for an example of how to use it		11:50.17
	pdf_redact_page as a starting point		11:50.55
sebras	maybe I should explain that dropping the underlying object in the finalizer but setting the pointer to NULL in destroy() means that we can end up in the finalizer() twice with the same pointer that we freed the first time.		11:51.12
	hence I'm setting the pointer to NULL in the finalizer(), so the next time its called, it will safely try to drop NULL.		11:51.44
ator	sebras: hm, I was thinking the split classes would go under 'jni' and the top file stay where it was :)		11:54.07
	to cause the minimum amount of trouble for the other build systems		11:54.30
	from_PDFObject_safe_own is a bit misleading, IMO, given that it also zeroes the pointer		11:57.09
	how can we end up in finalizer twice in that case? if the GC runs between destroy calling finalize and setting the native pointer to 0?		12:01.07
	native destroyNative(long ptr); finalize() { long tmp=pointer; pointer=0; destroyNative(tmp); } destroy() { finalize(); } maybe works then, so we don't have to call SetLongField from C?		12:02.44
	biab, going out for a walk		12:03.34
Kabouik	Hi #mupdf. I have mupdf installed on my Solus computers as well as on a LXC container running Debian. I noticed in Debian, I can use arrow keys to navigate through pages while on Solus I can only use b and space. Are the keys customizable so that I can homogenize keybindings between my machines?		13:00.48
	I suspect it's just some custom changes made by packagers on either distros?		13:01.22
kiwi_66	Hello, the following PDF is built to prevent users from selecting and copying text. I can zoom in/out so it's not an image, but is a real (vector) PDF. All it allows is "Select All" :		13:15.31
	https://we.tl/t-UF4rgPRFpx		13:15.35
	Why is that? Can mutool remove this restriction? Thank you.		13:15.38
	FWIW, "qpdf.exe --decrypt" didn't help.		13:19.08
kens	Its not an encrypted file, it simply has no text in it		13:19.23
	Each character appears to be an image in its own right		13:20.24
	Ah no, there are a number of aimges which depict one or more characters		13:20.51
kiwi_66	If I can zoom in without edges, can they still be images? https://postimg.cc/xN0rcSG7		13:28.53
pedr0	ator: thansk I will have a look, I am trying to remove text from a page - therefore I need to perform a part of analysis on the PDF first to detect what I want to get rid of, I want to remove stuff thereafter		13:29.05
kens	kiwi_66: they are images. Zooming in just proves that they are high resolution, and (probably) that the viewer does edge smoothing		13:29.36
	Also, on that image, there are plenty of jagged edges		13:29.56
kiwi_66	Ok so it's impossible to copy the text, besides using an OCR?		13:30.17
kens	OCR is going to be the only solution here. Apart form a very few characters there is no actual text in the document. THere's a font on page 2 which has a ToUnicode CMap, btu there are only 2 character codes, one maps to a space, and the other has no entry in the ToUnicode CMap.		13:31.08
kiwi_66	Too bad. Thank you!		13:32.16
kens	You're welcome		13:32.24
sebras	ator: it can happen if we call destroy() manually and then the JVM calls it again due to GC.		13:35.29
ator	sebras: as a race condition while destroy() is running before it sets pointer=0 rigth?		14:17.41
	so we could prevent that race by setting pointer=0 and passing the old pointer value to the native function		14:19.56
	or would that just open up for another type of race?		14:20.20
	Kabouik: could be using an older version of mupdf, or one is mupdf-x11 and the other is mupdf-gl		14:21.00
	we brought the keybindings between the -x11 and -gl viewers in sync a release or two back		14:21.24
Kabouik	I think it's gl on both but I will check version numbers next time		14:21.34
ator	the keys are only customizable by editing the source and recompiling		14:21.45
Kabouik	Understood, thanks		14:21.55
	Just to clarify, does the latest release include up/down arrows to navigate or does it remove them?		14:23.41
	I'm not sure which version is the latest		14:23.54
ator	the arrows do not change pages in the newer versions, only pan within a zoomed page		14:24.45
	page up / page down / b / space / , / . are the keybindings to change page		14:25.05
	b and space use "smart move", where we jump to the next section in a zoomed in page		14:25.52
Kabouik	Understood, so my Debian install must be older		14:25.55
ator	page up/down (and comma/period) jump to the next page, preserving the current scroll offset within the page		14:26.23
sebras	ator: alright, then the commit on mupdf:sebras/master does what you desire.		16:29.17
	ator: what was the historic reasons for making the finalize() members protected and add the destroy() members?		16:30.41
	is commit d7ece4132d6219ee10ba9ed85a9f2a052a6bb92c the origin? if so the motivation reads "We could do this in finalize, but there's no guarantee that a finalize will occur before the muPDF context occurs."		16:45.35
Kabouik	Maybe that's a corner case ator, but there are multiple laptops with no real PgUp/PgDn keys, you have to use the Fn keys to do PgUp/PgDn, sometimes in combinatioin with arrows in a non standard size (i.e., half-sized keys)		16:59.47
	That's my case of course		16:59.54
	I use b/space all the time because combinations with the small arrow keys is too error prone		17:00.21
sebras	Kabouik: did you try , and . too?		17:01.18
	I haven't read the entire conversation, but you never mentioned those two keys.		17:01.33
Kabouik	Oh well, no I didn't sebras. Perfect.		17:01.58
sebras	ator: let's try with a new set of commits on mupdf:sebras/master		20:16.53
	LGTM "Fix warnings in jbig2 error callback type mismatch."		20:17.14
ator	sebras: bah. I think I liked your first version better :)		22:13.01
	just with the _own function renamed to something more appropriate. sorry to put you through the wringer like this.		22:13.28
	<<<Back 1 day (to 2020/05/27)	Forward 1 day (to 2020/05/29)>>>

Log of #mupdf at irc.freenode.net.