| <<<Back 1 day (to 2020/04/29) | Fwd 1 day (to 2020/05/01)>>> | 20200430 |
ator | paulgardiner: ah, yes. I see the issue. the multi-language string writing commit broke combed fields. will fix. | 09:18.45 |
paulgardiner | ator: great thanks | 09:21.57 |
ator | paulgardiner: do you need this in the release? | 09:29.45 |
pedr0 | hi all. I am fiddling with mutool run - is it possible to print out or somehow loop over a Buffer obtained through readStream() instructions by instructions ? | 09:32.19 |
ator | what do you mean with "instructions by instructions"? | 09:34.00 |
paulgardiner | ator: I'd imagine we do. | 09:35.20 |
ator | paulgardiner: right. I'm pretty close to a fix, there are some bugs in it I can't quite figure out when mixing languages. | 09:36.39 |
| probably Tj resets some positioning state :/ | 09:36.48 |
| switching fonts in the middle messes it up | 09:36.57 |
| paulgardiner: I'm pretty confident I'll have as fix for you today | 09:37.23 |
paulgardiner | Magic | 09:37.30 |
ator | paulgardiner: eh, yes. it helps to use the right operator. Tj is not the same as TJ :) | 09:38.45 |
| paulgardiner: see tor/release branch | 09:42.11 |
| give that a test and review, please | 09:42.22 |
kiwi_66 | Hello, I need to massage PDFs for reading on e-readers so that pages that contain complicated layout look OK. | 09:54.07 |
| An alternative to the so-so job done by Poppler to convert a PDF to HTML (and then on to EPUB), is to turn those problematic pages from text to picture, and merge everything back into a full PDF. | 09:54.11 |
| It works, but those "picture" pages are much smaller as the "text" pages. Is there a way to crop the margins when turning PDFs into PNGs? | 09:54.16 |
| https://postimg.cc/gallery/Z96s6r3 | 09:54.20 |
| I used the following commands: | 09:54.24 |
| #Loop: Turn all fifty pages into individual PDFs | 09:54.28 |
| mutool clean -g input.pdf 1.pdf 1 | 09:54.28 |
| mutool clean -g input.pdf 2.pdf 2 | 09:54.28 |
| etc. | 09:54.28 |
| #Loop: Convert problematic pages from PDF to PNG | 09:54.32 |
| #213DPI, 758px width, 1024px height | 09:54.32 |
| mutool draw -r 213 -w 758 -h 1024 -o 13.png input.pdf 13 | 09:54.32 |
| mutool draw -r 213 -w 758 -h 1024 -o 34.png input.pdf 34 | 09:54.32 |
| etc. | 09:54.32 |
| #Loop: Convert PNG files into PDFs | 09:54.36 |
| mutool convert -O compress -F pdf -o 13.pdf 13.png | 09:54.36 |
| mutool convert -O compress -F pdf -o 34.pdf 34.png | 09:54.36 |
| etc. | 09:54.36 |
| #Merge all individual PDFs (untouched + turned into pictures) into single PDF | 09:54.38 |
| #TODO Find way to build list and pass it on to merge | 09:54.38 |
| mutool merge -o new.pdf -O compress 1.pdf 2.pdf etc. | 09:54.38 |
| Thank you. | 09:54.41 |
ator | kiwi_66: margin cropping is not trivial, you may need to resort to imagemagick or some other tool to scan or crop the bitmaps | 10:01.02 |
kiwi_66 | That's what I suspected while reading the MuPDF manual. Thanks for the confirmation. | 10:01.37 |
ator | having a PNG with white borders adds very little to the file size, plain colored areas compress very well | 10:01.43 |
kiwi_66 | I don't mind the file size, but I wanted to see if the presentation could be improved so that those "picture" pages look closer to the "text" pages. | 10:02.29 |
kens | If you're ereally desperate you could use Ghostscript to determine the boundingbox of the content in the original PDF, then some PostScript trickery to set the imaging 'window' and have GS render teh PDF into that window, Assuming hte white areas are genuinely unmarked that would get rid of the white space in the image files. But its non-trivial | 10:04.30 |
kiwi_66 | Will do. Thank you. | 10:04.57 |
ator | huh. we have a bbox device, but no way to call it from mutool :/ | 10:08.00 |
kens | Hmm sounds like you nmeed to add it, surely that shouldn't be hard ? | 10:08.19 |
| It sounds like it might be useful to me | 10:08.37 |
| Obviously post-release :-) | 10:08.49 |
ator | kens: yeah. I've got a commit ready already :) | 10:14.19 |
kens | Wow that *was* fast | 10:14.29 |
ator | kens: question though is what format is most useful | 10:14.30 |
| what does GS output? | 10:14.43 |
kens | IIRC Ghostscript dumps it out in PostScript form, I doubt that's terribly useful | 10:14.51 |
| %%oundingBox: llx lly urx ury | 10:15.07 |
ator | I made it dump it in some XML format. not the easiest to handle, but it fits with everything else we dump. | 10:15.13 |
kens | XML makes sense | 10:15.22 |
ator | <page bbox="llx lly urx ury" mediabox="llx lly urx ury" /> | 10:15.28 |
kens | Seems reasonable should be parseable by any reasonable XML parser and it makes sense to a human reader | 10:15.49 |
ator | easy enough to crack with awk or sed if you need to since it's line based | 10:15.51 |
kens | So the page bbox is the bounding box of the marks ? | 10:16.20 |
ator | yeah. the 'bbox' is the bbox of the marks | 10:16.30 |
| and I put the mediabox in there as well, because why not? | 10:16.46 |
kens | Fair enough then it looks fine to me, you might ask Robin for an opinion | 10:16.49 |
ator | yeah. I'll get Robin_Watts to review it. | 10:16.59 |
kens | MediaBox is always useful, but obviously for GS we can't know that the input is PDF | 10:17.17 |
ator | maybe bbox should be contentbox or drawbox or markbox or something more evocative than 'bbox' | 10:17.57 |
kens | Boundingbox then :-) | 10:18.18 |
ator | $ mutool draw -Fbbox input.pdf | 10:18.40 |
kens | markedboundsbox? | 10:18.46 |
ator | ENAMETOOLONG :) | 10:18.56 |
kens | Yeah anything truly descriptive is | 10:19.09 |
kiwi_66 | Using convert+CBZ to turn a 9,5MB PDF into pictures generates a 640MB PDF output. | 10:42.09 |
| "-O -compress" makes no difference. | 10:42.13 |
| Is there a way to reduce the file size? "number of bits of antialiasing", "resolution", "colorspace", etc. | 10:42.17 |
kens | lower resolution would be my bet | 10:42.40 |
| But basically, that's what happens when you render vectors to bitmaps | 10:43.01 |
kiwi_66 | Using "convert -O resolution=100 -o temp.cbz", it's even bigger (700MB) than without (640MB; Original: 9MB) | 10:59.19 |
kens | I don' tknow what the default resolution would have been | 10:59.51 |
kiwi_66 | How do I find this info? | 11:00.38 |
kens | If it was (for example) 96 dpi then yes, 100 dpi will produce a larger set of bitmaps and bigger output | 11:00.39 |
| convert is imagemagick, isn't it ? I've no idea how to tell what its default resolution is | 11:01.46 |
kiwi_66 | I'll read up on resolution, thanks | 11:02.12 |
kens | Ah, apparently its 72 dpi | 11:02.18 |
kiwi_66 | No, it's mutool's convert | 11:02.22 |
kens | Oh sorry, wrong convert | 11:02.29 |
| Then I don't know, but presumably you could find out from the code | 11:02.54 |
pedr0 | ator: sorry for the delay. For instance I want to scan the stream searching for a specific instruction, that is, stream's instructions such as 'Tj' etc etc | 11:02.57 |
| I can get the stream - in a Buffer - I am not sure how to use such object to read the stream instruction by instruction | 11:04.12 |
ator | pedr0: you'd have to write a tokenizer yourself, the Buffer is just an array of bytes | 11:04.39 |
pedr0 | oks | 11:04.44 |
| can I build a string from it ? | 11:05.08 |
kens | ator what's the default resolution of mutool convert ? I can't find it anywhere | 11:05.12 |
ator | for (i = 0; i < buffer.length; ++i) buffer[i] accesses all the bytes | 11:05.27 |
| you'd have to build a string from those, or tokenize directly and build up temporary strings using String.fromCharCode(buffer[i]) | 11:06.42 |
| kens: 72dpi I think | 11:06.52 |
pedr0 | is that buffer a JS object or is it a custom object part of the mutool environment ? | 11:06.53 |
| oks - I get it | 11:07.08 |
ator | it is a custom object wrapping a fz_buffer | 11:07.09 |
pedr0 | thanks for your help | 11:07.18 |
kens | ator thanks, that would explain the increase in size that kiwi_66 experienced with resolution of 100 then :-) | 11:07.21 |
ator | kens: ah yes, missed that bit of the conversation. I'm 99% certain it's 72dpi, and if not 72 then 96 | 11:08.17 |
kens | Either would explain the increase in size. | 11:08.31 |
kiwi_66 | Is there another setting besides "resolution" to have "mutool convert" build smaller files? | 11:08.32 |
ator | my bad, it is actually 96 dpi | 11:08.55 |
kens | Well you could render to grayscale, that would reduce the size, while discarding all the colour | 11:09.00 |
| 96 was my guess :-) | 11:09.08 |
ator | fz_parse_draw_options() reveals my lie! | 11:09.18 |
| opts->x_resolution = 96; | 11:09.27 |
kens | But resolution is the killer, because its in each direction, so doubling it squares the output size | 11:09.34 |
ator | kiwi_66: you could convert to grayscale or monochrome | 11:10.02 |
| -O colorspace=mono or colorspace=gray | 11:10.17 |
kens | monochrome would reduce it a lot | 11:10.28 |
ator | with the obvious loss of color and anti-aliasing | 11:10.31 |
kiwi_66 | my e-reader is only b&w :-) | 11:10.47 |
ator | how many gray levels? | 11:10.55 |
kens | Then you may as well have monochrome | 11:10.57 |
kiwi_66 | 16 shades | 11:11.16 |
kens | Oh, that's gray scale, not monochrrome, but still, not may grays | 11:11.30 |
kiwi_66 | entry level (but strong screen, thankfully for bike rides) | 11:12.03 |
kens | Does mutool convert do 4-bit output ? | 11:12.10 |
kens | suspects not | 11:12.23 |
kiwi_66 | With "-O colorspace=gray", I go from 640MB to 214MB :-) | 11:14.37 |
ator | mutool convert -o out%d.pbm -O colorspace=mono,graphics=aa0 input.pdf | 11:15.04 |
| you should get a bunch of pbm files that you can convert to PDF and get very small files | 11:15.16 |
kiwi_66 | ie. convert PDF to PBM, and then on to PDB? | 11:17.01 |
| PDF | 11:17.09 |
ator | yeah. rasterize the PDF to black-and-white images, the wrap those into a new PDF | 11:17.28 |
kiwi_66 | can mutool do this, or should I look at ImageMagick etc. ? | 11:17.46 |
ator | it's like PNG but will become smaller if it's black and white | 11:17.48 |
| just use 'pbm' as the suffix rather than 'png' | 11:17.59 |
kiwi_66 | ok | 11:18.03 |
paulgardiner | That looks to work thanks Tor | 11:20.33 |
ator | paulgardiner: great! | 11:20.44 |
kiwi_66 | mutool convert -o out%d.pbm -O colorspace=mono,graphics=aa0 input.pdf 72-74 | 11:25.52 |
| mutool merge -o output.pdf out1.pbm out2.pbm out3.pbm | 11:25.56 |
| error: cannot recognize version marker | 11:26.06 |
| warning: trying to repair broken xref | 11:26.06 |
| warning: repairing PDF document | 11:26.06 |
| error: invalid key in dict | 11:26.06 |
| error: no objects found | 11:26.06 |
| error: aborting process from uncaught error! | 11:26.07 |
ator | merge only takes PDF as inptu | 12:01.51 |
| zip output.cbz *.pbm; mutool convert -o output.pdf output.cbz | 12:02.14 |
| or convert the pbm to pdf before merging | 12:03.11 |
| like you did with PNG files earlier | 12:03.20 |
kiwi_66 | thx | 12:06.12 |
malc_ | ator: just hit a previously unseen warning while building with clang and -Weverything - http://tpaste.us/gk5d | 18:41.56 |
ator | Robin_Watts: ^ maybe something with the header file cleanups gone wrong? | 18:58.34 |
sebras | Robin_Watts: ator: http://git.ghostscript.com/?p=user/sebras/mupdf.git;a=commitdiff;h=15a4819739aa387031f3a4af074c1da7ff7dcb70 and http://git.ghostscript.com/?p=user/sebras/mupdf.git;a=commitdiff;h=05301a828419fb95e2d06ee01dc23c814330760b | 21:08.39 |
| both appear to cluster well. | 21:08.44 |
malc_ | sebras: tack | 21:52.08 |
| <<<Back 1 day (to 2020/04/29) | Forward 1 day (to 2020/05/01)>>> | |