| <<<Back 1 day (to 2020/04/22) | Fwd 1 day (to 2020/04/24)>>> | 20200423 |
emfipp | can someone take a look at the pdf at https://dl.acm.org/doi/book/10.5555/578789 (just click on "PDF")? | 01:45.32 |
| mupdf 1.16.0 on windows (x64) cannot open the file | 01:45.42 |
| also, how could one possibly fix the pdf so that mupdf can open it? | 01:46.14 |
| mutool clean gives "error: cannot find startxref" | 02:00.08 |
| also "warning: object missing 'endobj' token" and "warning: PDF stream Length incorrect" | 02:00.53 |
sebras | emfipp: I'm looking at it. | 05:28.00 |
emfipp | sebras: thanks. later, I tried "mutool extract" and an error occured on page 196 | 05:51.01 |
| which translates roughly to img-11729 | 05:51.20 |
| later I tried grepping the trailer object | 05:51.53 |
| looks like the thing being missing | 05:52.06 |
| (feel free to suggest to me ways of fixing the file even it requires some hex'ing work) | 05:53.01 |
sebras | emfipp: do you get an error message when you run mutool show aho.pdf trailer ? | 06:19.53 |
| maybe the file you downloaded was not complete? | 06:20.31 |
| I can display the copy I downloaded fine on linux using both mupdf-x11 and mupdf-gl as well as the windows binaries mupdf.exe and mupdf-gl.exe running in wine. | 06:22.30 |
| mutool extract does error out on me on page 786, because it seems to run out of memory. | 06:23.48 |
| so I can't reproduce the problems you are seeing. | 06:24.00 |
| emfipp: do you mind reporting a bug at bugs.ghostscript.com and writing down the exact commands you used, what pages were problematic and so on? one bug per problem please. :) | 06:24.38 |
| emfipp: also, where did you get the mupdf binary? did you build it yourself? | 06:24.53 |
ator | emfipp: double check that the file size of 578789.pdf is 99671413 bytes and that the md5sum is f76d4cbffe26110796b5bf0b9058a079 | 06:53.14 |
| your error sounds like a truncated file | 06:53.24 |
sebras | ator: mine is 99671378 and the md5sum is ccd8e153f8caf41d1f5333914862b45a | 06:53.54 |
| ator: I have downloaded things from acm before, I think they randomize the files somehow. | 06:54.13 |
ator | right. 30 bytes give or take is I guess fine when randomizing. | 06:55.03 |
| "cannot find startxref" is always the case with a truncated file | 06:55.21 |
sebras | yes. | 06:55.29 |
ator | and missing an endobj and a stream being cut short is another indication that it's just cut off | 06:55.39 |
sebras | I tried downloading it a second time and ended up at 99671396 bytes. I wonder what it is that they are randomizing... | 06:57.20 |
kens | Its easy enough to stick random unreferenced bytes into a PDF file | 06:57.57 |
ator | sebras: mutool show foo.pdf grep and diff? | 06:58.46 |
| looks like it's generated from scratch again, with different object and image numbering | 07:03.08 |
| so the flate compression differences on all the tiny streams is going to add up to a couple of bytes of diff | 07:03.33 |
sebras | yes, they are. | 07:04.16 |
emfipp | (while I was certain it had been complete, I think the error on page 786 seems to suggest this is an incomplete file --- 31,499,598 bytes in toto) | 07:04.25 |
sebras | the Length fields appear to change too. | 07:04.28 |
ator | the Length is the compressed length, not the uncompressed length | 07:04.49 |
sebras | emfipp: does acrobat reader show it without problems? | 07:04.51 |
emfipp | sorry, but I do not run programs such as adobe reader on any machine I control | 07:05.13 |
ator | good for you :) wish I had that luxury. | 07:05.24 |
emfipp | what's the file size on your end though, sebras? | 07:05.56 |
| (oddly firefox did report download complete) | 07:06.14 |
sebras | emfipp: I've downloaded the file three times, (firefox, chromium and wget) each time they are around 99671323 bytes. but your size of 31,499,598 bytes seems way off. | 07:06.46 |
emfipp | ok. I guess something's wrong with my firefox | 07:07.03 |
| or acm simply doesn't want me to download the entire file | 07:07.14 |
sebras | emfipp: it took a long time for me to download the entire file. | 07:07.35 |
emfipp | same here too. I can only http-download files of that size at certain time of day | 07:08.02 |
ator | emfipp: I think firefox may be confused, I don't think it sets a content-length so can't know if it's complete or cut short | 07:08.10 |
emfipp | or connection teardown happens | 07:08.12 |
| (and bloodly hell acm gives out free books yet still no byte range requests) | 07:08.36 |
ator | curl -c cookiejar -I -L 'https://dl.acm.org/doi/pdf/10.5555/578789?download=true' shows no Content-Length header on the final redirect with mimetype application/pdf | 07:09.55 |
| and given the random nature of the file when we download it, I'll just assume it's generating the PDF on the server on the fly and not serving up a static file | 07:10.31 |
| which would explain why no byte range requests work, etc. | 07:10.44 |
emfipp | so acm is fingerprinting those pdf files? | 07:18.56 |
emfipp | shudders | 07:18.58 |
| rule no. 1: gs-reparse every pdf file downloaded from the internet | 07:19.50 |
sebras | emfipp: there is a randomized ID-field in the PDF trailer. they might use that for fingerprinting the files. acm normally only offers these files for paid download so I'm not terribly surprised. | 07:23.57 |
sarmols | Hi, I could use some help. I am using the MuPDF API and need a way to get rectangles around all the words on a page. A word is a string that is seperated by either a space or a newline. Can anyone think of a way to do this? Because I just can't come up with anything, not even some really hacky way to implement this. | 08:58.45 |
Robin_Watts_ | sarmols: Have you looked at the stext functions? | 09:13.40 |
| fz_new_stext_page_from_{page,page_number,display} | 09:15.13 |
| any of those will get you an fz_stext_page. | 09:15.22 |
| That's a structure you can walk, it contains the locations/bbox for ever char on a page. | 09:15.46 |
sarmols | Robin_Watts_: Great, thank you! I'll check it out. | 09:57.52 |
paulgardiner | Seeing something strange when signing documents. I'm returning a value for max_digest that assures we have well over the space allocated in the file. Usually that would mean I'd see a trailer of zeros, but it seems to be filling out with random garbage | 16:09.20 |
| Ah I bet output streams have changed and our wrapper around objC that implements one no longer works | 16:13.02 |
| <<<Back 1 day (to 2020/04/22) | Forward 1 day (to 2020/04/24)>>> | |