MuPDF IRC logs

	<<<Back 1 day (to 2020/04/29)	Fwd 1 day (to 2020/05/01)>>>	20200430
ator	paulgardiner: ah, yes. I see the issue. the multi-language string writing commit broke combed fields. will fix.		09:18.45
paulgardiner	ator: great thanks		09:21.57
ator	paulgardiner: do you need this in the release?		09:29.45
pedr0	hi all. I am fiddling with mutool run - is it possible to print out or somehow loop over a Buffer obtained through readStream() instructions by instructions ?		09:32.19
ator	what do you mean with "instructions by instructions"?		09:34.00
paulgardiner	ator: I'd imagine we do.		09:35.20
ator	paulgardiner: right. I'm pretty close to a fix, there are some bugs in it I can't quite figure out when mixing languages.		09:36.39
	probably Tj resets some positioning state :/		09:36.48
	switching fonts in the middle messes it up		09:36.57
	paulgardiner: I'm pretty confident I'll have as fix for you today		09:37.23
paulgardiner	Magic		09:37.30
ator	paulgardiner: eh, yes. it helps to use the right operator. Tj is not the same as TJ :)		09:38.45
	paulgardiner: see tor/release branch		09:42.11
	give that a test and review, please		09:42.22
kiwi_66	Hello, I need to massage PDFs for reading on e-readers so that pages that contain complicated layout look OK.		09:54.07
	An alternative to the so-so job done by Poppler to convert a PDF to HTML (and then on to EPUB), is to turn those problematic pages from text to picture, and merge everything back into a full PDF.		09:54.11
	It works, but those "picture" pages are much smaller as the "text" pages. Is there a way to crop the margins when turning PDFs into PNGs?		09:54.16
	https://postimg.cc/gallery/Z96s6r3		09:54.20
	I used the following commands:		09:54.24
	#Loop: Turn all fifty pages into individual PDFs		09:54.28
	mutool clean -g input.pdf 1.pdf 1		09:54.28
	mutool clean -g input.pdf 2.pdf 2		09:54.28
	etc.		09:54.28
	#Loop: Convert problematic pages from PDF to PNG		09:54.32
	#213DPI, 758px width, 1024px height		09:54.32
	mutool draw -r 213 -w 758 -h 1024 -o 13.png input.pdf 13		09:54.32
	mutool draw -r 213 -w 758 -h 1024 -o 34.png input.pdf 34		09:54.32
	etc.		09:54.32
	#Loop: Convert PNG files into PDFs		09:54.36
	mutool convert -O compress -F pdf -o 13.pdf 13.png		09:54.36
	mutool convert -O compress -F pdf -o 34.pdf 34.png		09:54.36
	etc.		09:54.36
	#Merge all individual PDFs (untouched + turned into pictures) into single PDF		09:54.38
	#TODO Find way to build list and pass it on to merge		09:54.38
	mutool merge -o new.pdf -O compress 1.pdf 2.pdf etc.		09:54.38
	Thank you.		09:54.41
ator	kiwi_66: margin cropping is not trivial, you may need to resort to imagemagick or some other tool to scan or crop the bitmaps		10:01.02
kiwi_66	That's what I suspected while reading the MuPDF manual. Thanks for the confirmation.		10:01.37
ator	having a PNG with white borders adds very little to the file size, plain colored areas compress very well		10:01.43
kiwi_66	I don't mind the file size, but I wanted to see if the presentation could be improved so that those "picture" pages look closer to the "text" pages.		10:02.29
kens	If you're ereally desperate you could use Ghostscript to determine the boundingbox of the content in the original PDF, then some PostScript trickery to set the imaging 'window' and have GS render teh PDF into that window, Assuming hte white areas are genuinely unmarked that would get rid of the white space in the image files. But its non-trivial		10:04.30
kiwi_66	Will do. Thank you.		10:04.57
ator	huh. we have a bbox device, but no way to call it from mutool :/		10:08.00
kens	Hmm sounds like you nmeed to add it, surely that shouldn't be hard ?		10:08.19
	It sounds like it might be useful to me		10:08.37
	Obviously post-release :-)		10:08.49
ator	kens: yeah. I've got a commit ready already :)		10:14.19
kens	Wow that was fast		10:14.29
ator	kens: question though is what format is most useful		10:14.30
	what does GS output?		10:14.43
kens	IIRC Ghostscript dumps it out in PostScript form, I doubt that's terribly useful		10:14.51
	%%oundingBox: llx lly urx ury		10:15.07
ator	I made it dump it in some XML format. not the easiest to handle, but it fits with everything else we dump.		10:15.13
kens	XML makes sense		10:15.22
ator	<page bbox="llx lly urx ury" mediabox="llx lly urx ury" />		10:15.28
kens	Seems reasonable should be parseable by any reasonable XML parser and it makes sense to a human reader		10:15.49
ator	easy enough to crack with awk or sed if you need to since it's line based		10:15.51
kens	So the page bbox is the bounding box of the marks ?		10:16.20
ator	yeah. the 'bbox' is the bbox of the marks		10:16.30
	and I put the mediabox in there as well, because why not?		10:16.46
kens	Fair enough then it looks fine to me, you might ask Robin for an opinion		10:16.49
ator	yeah. I'll get Robin_Watts to review it.		10:16.59
kens	MediaBox is always useful, but obviously for GS we can't know that the input is PDF		10:17.17
ator	maybe bbox should be contentbox or drawbox or markbox or something more evocative than 'bbox'		10:17.57
kens	Boundingbox then :-)		10:18.18
ator	$ mutool draw -Fbbox input.pdf		10:18.40
kens	markedboundsbox?		10:18.46
ator	ENAMETOOLONG :)		10:18.56
kens	Yeah anything truly descriptive is		10:19.09
kiwi_66	Using convert+CBZ to turn a 9,5MB PDF into pictures generates a 640MB PDF output.		10:42.09
	"-O -compress" makes no difference.		10:42.13
	Is there a way to reduce the file size? "number of bits of antialiasing", "resolution", "colorspace", etc.		10:42.17
kens	lower resolution would be my bet		10:42.40
	But basically, that's what happens when you render vectors to bitmaps		10:43.01
kiwi_66	Using "convert -O resolution=100 -o temp.cbz", it's even bigger (700MB) than without (640MB; Original: 9MB)		10:59.19
kens	I don' tknow what the default resolution would have been		10:59.51
kiwi_66	How do I find this info?		11:00.38
kens	If it was (for example) 96 dpi then yes, 100 dpi will produce a larger set of bitmaps and bigger output		11:00.39
	convert is imagemagick, isn't it ? I've no idea how to tell what its default resolution is		11:01.46
kiwi_66	I'll read up on resolution, thanks		11:02.12
kens	Ah, apparently its 72 dpi		11:02.18
kiwi_66	No, it's mutool's convert		11:02.22
kens	Oh sorry, wrong convert		11:02.29
	Then I don't know, but presumably you could find out from the code		11:02.54
pedr0	ator: sorry for the delay. For instance I want to scan the stream searching for a specific instruction, that is, stream's instructions such as 'Tj' etc etc		11:02.57
	I can get the stream - in a Buffer - I am not sure how to use such object to read the stream instruction by instruction		11:04.12
ator	pedr0: you'd have to write a tokenizer yourself, the Buffer is just an array of bytes		11:04.39
pedr0	oks		11:04.44
	can I build a string from it ?		11:05.08
kens	ator what's the default resolution of mutool convert ? I can't find it anywhere		11:05.12
ator	for (i = 0; i < buffer.length; ++i) buffer[i] accesses all the bytes		11:05.27
	you'd have to build a string from those, or tokenize directly and build up temporary strings using String.fromCharCode(buffer[i])		11:06.42
	kens: 72dpi I think		11:06.52
pedr0	is that buffer a JS object or is it a custom object part of the mutool environment ?		11:06.53
	oks - I get it		11:07.08
ator	it is a custom object wrapping a fz_buffer		11:07.09
pedr0	thanks for your help		11:07.18
kens	ator thanks, that would explain the increase in size that kiwi_66 experienced with resolution of 100 then :-)		11:07.21
ator	kens: ah yes, missed that bit of the conversation. I'm 99% certain it's 72dpi, and if not 72 then 96		11:08.17
kens	Either would explain the increase in size.		11:08.31
kiwi_66	Is there another setting besides "resolution" to have "mutool convert" build smaller files?		11:08.32
ator	my bad, it is actually 96 dpi		11:08.55
kens	Well you could render to grayscale, that would reduce the size, while discarding all the colour		11:09.00
	96 was my guess :-)		11:09.08
ator	fz_parse_draw_options() reveals my lie!		11:09.18
	opts->x_resolution = 96;		11:09.27
kens	But resolution is the killer, because its in each direction, so doubling it squares the output size		11:09.34
ator	kiwi_66: you could convert to grayscale or monochrome		11:10.02
	-O colorspace=mono or colorspace=gray		11:10.17
kens	monochrome would reduce it a lot		11:10.28
ator	with the obvious loss of color and anti-aliasing		11:10.31
kiwi_66	my e-reader is only b&w :-)		11:10.47
ator	how many gray levels?		11:10.55
kens	Then you may as well have monochrome		11:10.57
kiwi_66	16 shades		11:11.16
kens	Oh, that's gray scale, not monochrrome, but still, not may grays		11:11.30
kiwi_66	entry level (but strong screen, thankfully for bike rides)		11:12.03
kens	Does mutool convert do 4-bit output ?		11:12.10
*kens*	suspects not		11:12.23
kiwi_66	With "-O colorspace=gray", I go from 640MB to 214MB :-)		11:14.37
ator	mutool convert -o out%d.pbm -O colorspace=mono,graphics=aa0 input.pdf		11:15.04
	you should get a bunch of pbm files that you can convert to PDF and get very small files		11:15.16
kiwi_66	ie. convert PDF to PBM, and then on to PDB?		11:17.01
	PDF		11:17.09
ator	yeah. rasterize the PDF to black-and-white images, the wrap those into a new PDF		11:17.28
kiwi_66	can mutool do this, or should I look at ImageMagick etc. ?		11:17.46
ator	it's like PNG but will become smaller if it's black and white		11:17.48
	just use 'pbm' as the suffix rather than 'png'		11:17.59
kiwi_66	ok		11:18.03
paulgardiner	That looks to work thanks Tor		11:20.33
ator	paulgardiner: great!		11:20.44
kiwi_66	mutool convert -o out%d.pbm -O colorspace=mono,graphics=aa0 input.pdf 72-74		11:25.52
	mutool merge -o output.pdf out1.pbm out2.pbm out3.pbm		11:25.56
	error: cannot recognize version marker		11:26.06
	warning: trying to repair broken xref		11:26.06
	warning: repairing PDF document		11:26.06
	error: invalid key in dict		11:26.06
	error: no objects found		11:26.06
	error: aborting process from uncaught error!		11:26.07
ator	merge only takes PDF as inptu		12:01.51
	zip output.cbz *.pbm; mutool convert -o output.pdf output.cbz		12:02.14
	or convert the pbm to pdf before merging		12:03.11
	like you did with PNG files earlier		12:03.20
kiwi_66	thx		12:06.12
malc_	ator: just hit a previously unseen warning while building with clang and -Weverything - http://tpaste.us/gk5d		18:41.56
ator	Robin_Watts: ^ maybe something with the header file cleanups gone wrong?		18:58.34
sebras	Robin_Watts: ator: http://git.ghostscript.com/?p=user/sebras/mupdf.git;a=commitdiff;h=15a4819739aa387031f3a4af074c1da7ff7dcb70 and http://git.ghostscript.com/?p=user/sebras/mupdf.git;a=commitdiff;h=05301a828419fb95e2d06ee01dc23c814330760b		21:08.39
	both appear to cluster well.		21:08.44
malc_	sebras: tack		21:52.08
	<<<Back 1 day (to 2020/04/29)	Forward 1 day (to 2020/05/01)>>>

Log of #mupdf at irc.freenode.net.