MuPDF IRC logs

	<<<Back 1 day (to 2019/06/10)	Fwd 1 day (to 2019/06/12)>>>	20190611
pietrop	hi everybody, still here banging my head :-)		08:44.00
	is it possible to use mutool to compress a pdf previously obtained through "mutool clean -asdifggg" ?		08:44.52
ator	pietrop: mutool clean -z (as in zip)		08:46.45
	pietrop: mutool clean -diz is probably the best combination of flags		08:47.32
	decompress, then zip, to get rid of the 'ascii' filter		08:47.48
pietrop	oh sorry, that was in the documentation		08:52.30
	I did not spot it		08:52.35
	did you mean -daz ?		08:53.26
ator	no, I meant 'i' to not decompress images (or you'll lose JPEG compression on JPEG images)		08:53.59
pulsarpietro	Hopefully I am not asking a silly question, I am learning a bit about text drawings as some of the watermarks I want to remove are not made of images unfortunately. I frequently came across the TJ/Tj instructions and very often the text isn't plain text, but it is encoded somehow. Is it possible to mutool-ize it to be plain text ? I was also wondering how such text is obtained - is the font's charset		10:51.03
	?		10:51.09
kens	what do you mean by 'plain text' ?		10:51.31
	For example, Chinese isn't going to come out as ASCII no matter what you do		10:51.48
	Beyond that, fonts may use a custom Encoding,l which means that the character code bears no relation to the glyph drawn.		10:52.28
	Often (but its optional) a font may include a ToUnicode CMap which maps the character codes to the Unicode code point.		10:52.57
	I believe that MuPDF already uses this to create text output		10:53.17
	I'm not sure about exactly how MuPDF does its text extraction, but I imagine it follows similar rules to Ghostscript; If there is a ToUnicode CMap, use it, its the most reliable informaiton		10:54.19
	In the absence of a ToUnicode CMap, if the font is TrueType, try to use the CMAP subtable from the font. If the font is PostScript then try and use the glyph name from teh Encoding or CharStrings dictionary. If its a recognised name, translate to Unicode if its a Unicode name , strip off the prefix, use the numeric portion as a Unicode name.		10:55.31
	If all else fials, treat it as ASCII.		10:55.42
	Oh I forgot, if its a CIDFont, then you can (sometimes0 use the CMap associated with the CIDFont to give you the unicode points.		10:56.13
	I'm afraid that in the absence of a ToUnicode CMap its all guesswork		10:56.38
ator	pulsarpietro: the arguments to the text showing operators (TJ, Tj, ', ") are just bytes, interpreted according to the font resource's Encoding entry. This does not have to correspond to any actual text encoding, it can be any arbitrary bytes that happen to produce human readable text using the embedded font.		11:00.31
pulsarpietro	I meant something I can read and recognize from the "rendered version of PDF" - they are all in English and we do render them all with pdftotext - therefore I am assuming such mapping (about which I am not very knowledgeable) is present.		11:01.16
ator	pulsarpietro: there is a "structured text device" that extracts human readable text, if available		11:01.43
kens	As ator said, not really, no. There might be a ToUnicode Cmap, but there might not.		11:01.43
ator	mutool draw -Ftext		11:01.50
	this is not guaranteed to work, but on well formed files it will generally work.		11:02.14
	it all depends on the input. garbage in, garbage out, that whole deal.		11:02.28
pulsarpietro	eheh		11:02.34
kens	For example, a subset fotn will often be created where character code 1 is applied to the first character, character code 2 to the second adn so on. So "Hello World" would becom <01020303040506040708>		11:03.04
ator	in a well made PDF, the fonts all use sensible encodings and have ToUnicode CMaps to allow software to convert the text drawing operations into unicode text.		11:03.27
kens	NB you often can't create a ToUnicode CMap or a 'sensible' encoding if the PDF file was created from PostScript :-)		11:04.08
ator	but as kens said, it's nowhere near a guarantee that this is the case. plenty of PDF creators never bother to write the information needed to get unicode text back out.		11:04.15
pulsarpietro	mmm....I need to some more reading.. but I can use pdftotext and it works fine, which means that the text is printed correctly out, does not that mean that a ToUnicode CMaps is associated with the font ?		11:09.49
kens	It might, or it might mean the encoding makes sense		11:10.08
pulsarpietro	oks		11:10.19
kens	mutool can extract text I think		11:10.27
pulsarpietro	I am experimenting with mutool draw -Ftext is is possible to preserve the layout ?		11:25.04
paulgardiner	What's the magic incantation to make mupdf-gl save the document?		11:25.16
	A perusal of do_app and do_help within gl-main.c has so far not helped.		11:31.07
kens	Hmm, sorry I don't know the answers to these questions. AFAIK mupdf doesn't maintain the text layout when extracting, though it does write the text position information when outputting 'XML'. Its hard to do recreate a rich text layout when emitting ascii, Ghostscript makes an attempt with hte textwrite device, but its not terribly reliable.		11:49.40
ator	paulgardiner: 'a' to open annotations and there's a button in the lower right		11:54.59
	paulgardiner: I probably should add a menu bar or something ... but sebras would object!		11:55.34
	to wasting screen space		11:55.41
sebras	ator: I would indeed. :)		12:07.22
paulgardiner	ator: ah yes. Thanks. That did it.		12:20.54
pulsarpietro	@kens: thanks		12:53.09
kens	NP		12:53.17
paulgardiner	Robin_Watts: some time ago, you pointed out a bug in the signature support code. There's a fix for the problem on my master branch, I believe.		12:57.40
Robin_Watts	looking.		12:58.03
	sizeof("STRING") == strlen("STRING") + 1 ?		12:59.34
paulgardiner	Great. Ta		12:59.36
Robin_Watts	"STRING" is a char *, or an array of chars ?		13:00.10
	I guess it's the latter, but can be seen as the former when required.		13:00.28
paulgardiner	I'd never used that before and had to google it.		13:01.33
Robin_Watts	ok, as long as one of us has :)		13:01.58
	You're assuming that the data is there as "/ByteRange" rather than /B%34teRange etc.		13:02.32
	but I guess we're writing the dictionary you're searching for ?		13:02.51
paulgardiner	Yes, I believe we write it.		13:03.03
Robin_Watts	Then lgtm.		13:03.10
paulgardiner	Thank you		13:04.42
	cluster seems happy too.		13:05.14
dzonerzy	Ehy guys could you check this please? https://github.com/ccxvii/mujs/issues/105		19:51.00
sebras	dzonerzy: let me try to reproduce.		20:53.16
	dzonerzy: what version of mujs are you running?		20:57.24
dzonerzy	master from git		20:59.29
	to reproduce just write		20:59.38
sebras	dzonerzy: e2b59201d5049a8ad509b280e729a871756abc99 ?		20:59.43
dzonerzy	/.(?:)/		21:00.14
	this would compile as regex and trigger the bug		21:00.25
	sure e2b59201d5049a8ad509b280e729a871756abc99		21:00.49
sebras	dzonerzy: ok. I'm unable to reproduce.		21:01.16
dzonerzy	sorry checked now it was the previous commit my bad :D		21:01.34
	it was fixed without no answer to issue thanks much		21:01.59
sebras	dzonerzy: maybe ator forgot to close issue 105.		21:02.42
dzonerzy	ye thanks much for the fast fix :) i love mujs		21:03.06
sebras	dzonerzy: I'll pass that on to ator. :)		21:03.20
dzonerzy	I already closed the issue		21:04.04
sebras	dzonerzy: thanks.		21:04.24
dzonerzy	found a weird behavior		21:50.33
	max size for array is 0x7fffffff		21:50.43
	anyway if you declare var a=[] and then do a[1]=1 , length is increased and become 1, but if you set a[0xffffffff+1]=1 length remains 0 but property is there		21:51.36
	@ator @sebras		21:51.51
sebras	dzonerzy (for the logs): I'll leave that for ator to figure out tomorrow. perhaps there's a limit of 2**31-1 elements in an array. I wouldn't be surprised. should mujs detect it and error out somehow? perhaps.		21:59.42
	<<<Back 1 day (to 2019/06/10)	Forward 1 day (to 2019/06/12)>>>

Log of #mupdf at irc.freenode.net.