Log of #mupdf at irc.freenode.net.

Search:
 <<<Back 1 day (to 2020/04/22)Fwd 1 day (to 2020/04/24)>>>20200423 
emfipp can someone take a look at the pdf at https://dl.acm.org/doi/book/10.5555/578789 (just click on "PDF")?01:45.32 
  mupdf 1.16.0 on windows (x64) cannot open the file01:45.42 
  also, how could one possibly fix the pdf so that mupdf can open it?01:46.14 
  mutool clean gives "error: cannot find startxref"02:00.08 
  also "warning: object missing 'endobj' token" and "warning: PDF stream Length incorrect"02:00.53 
sebras emfipp: I'm looking at it.05:28.00 
emfipp sebras: thanks. later, I tried "mutool extract" and an error occured on page 19605:51.01 
  which translates roughly to img-1172905:51.20 
  later I tried grepping the trailer object05:51.53 
  looks like the thing being missing05:52.06 
  (feel free to suggest to me ways of fixing the file even it requires some hex'ing work)05:53.01 
sebras emfipp: do you get an error message when you run mutool show aho.pdf trailer ?06:19.53 
  maybe the file you downloaded was not complete?06:20.31 
  I can display the copy I downloaded fine on linux using both mupdf-x11 and mupdf-gl as well as the windows binaries mupdf.exe and mupdf-gl.exe running in wine.06:22.30 
  mutool extract does error out on me on page 786, because it seems to run out of memory.06:23.48 
  so I can't reproduce the problems you are seeing.06:24.00 
  emfipp: do you mind reporting a bug at bugs.ghostscript.com and writing down the exact commands you used, what pages were problematic and so on? one bug per problem please. :)06:24.38 
  emfipp: also, where did you get the mupdf binary? did you build it yourself?06:24.53 
ator emfipp: double check that the file size of 578789.pdf is 99671413 bytes and that the md5sum is f76d4cbffe26110796b5bf0b9058a07906:53.14 
  your error sounds like a truncated file06:53.24 
sebras ator: mine is 99671378 and the md5sum is ccd8e153f8caf41d1f5333914862b45a06:53.54 
  ator: I have downloaded things from acm before, I think they randomize the files somehow.06:54.13 
ator right. 30 bytes give or take is I guess fine when randomizing.06:55.03 
  "cannot find startxref" is always the case with a truncated file06:55.21 
sebras yes.06:55.29 
ator and missing an endobj and a stream being cut short is another indication that it's just cut off06:55.39 
sebras I tried downloading it a second time and ended up at 99671396 bytes. I wonder what it is that they are randomizing...06:57.20 
kens Its easy enough to stick random unreferenced bytes into a PDF file06:57.57 
ator sebras: mutool show foo.pdf grep and diff?06:58.46 
  looks like it's generated from scratch again, with different object and image numbering07:03.08 
  so the flate compression differences on all the tiny streams is going to add up to a couple of bytes of diff07:03.33 
sebras yes, they are.07:04.16 
emfipp (while I was certain it had been complete, I think the error on page 786 seems to suggest this is an incomplete file --- 31,499,598 bytes in toto)07:04.25 
sebras the Length fields appear to change too.07:04.28 
ator the Length is the compressed length, not the uncompressed length07:04.49 
sebras emfipp: does acrobat reader show it without problems?07:04.51 
emfipp sorry, but I do not run programs such as adobe reader on any machine I control07:05.13 
ator good for you :) wish I had that luxury.07:05.24 
emfipp what's the file size on your end though, sebras?07:05.56 
  (oddly firefox did report download complete)07:06.14 
sebras emfipp: I've downloaded the file three times, (firefox, chromium and wget) each time they are around 99671323 bytes. but your size of 31,499,598 bytes seems way off.07:06.46 
emfipp ok. I guess something's wrong with my firefox07:07.03 
  or acm simply doesn't want me to download the entire file07:07.14 
sebras emfipp: it took a long time for me to download the entire file.07:07.35 
emfipp same here too. I can only http-download files of that size at certain time of day07:08.02 
ator emfipp: I think firefox may be confused, I don't think it sets a content-length so can't know if it's complete or cut short07:08.10 
emfipp or connection teardown happens07:08.12 
  (and bloodly hell acm gives out free books yet still no byte range requests)07:08.36 
ator curl -c cookiejar -I -L 'https://dl.acm.org/doi/pdf/10.5555/578789?download=true' shows no Content-Length header on the final redirect with mimetype application/pdf07:09.55 
  and given the random nature of the file when we download it, I'll just assume it's generating the PDF on the server on the fly and not serving up a static file07:10.31 
  which would explain why no byte range requests work, etc.07:10.44 
emfipp so acm is fingerprinting those pdf files?07:18.56 
emfipp shudders07:18.58 
  rule no. 1: gs-reparse every pdf file downloaded from the internet07:19.50 
sebras emfipp: there is a randomized ID-field in the PDF trailer. they might use that for fingerprinting the files. acm normally only offers these files for paid download so I'm not terribly surprised.07:23.57 
sarmols Hi, I could use some help. I am using the MuPDF API and need a way to get rectangles around all the words on a page. A word is a string that is seperated by either a space or a newline. Can anyone think of a way to do this? Because I just can't come up with anything, not even some really hacky way to implement this.08:58.45 
Robin_Watts_ sarmols: Have you looked at the stext functions?09:13.40 
  fz_new_stext_page_from_{page,page_number,display}09:15.13 
  any of those will get you an fz_stext_page.09:15.22 
  That's a structure you can walk, it contains the locations/bbox for ever char on a page.09:15.46 
sarmols Robin_Watts_: Great, thank you! I'll check it out.09:57.52 
paulgardiner Seeing something strange when signing documents. I'm returning a value for max_digest that assures we have well over the space allocated in the file. Usually that would mean I'd see a trailer of zeros, but it seems to be filling out with random garbage16:09.20 
  Ah I bet output streams have changed and our wrapper around objC that implements one no longer works16:13.02 
 <<<Back 1 day (to 2020/04/22)Forward 1 day (to 2020/04/24)>>> 
ghostscript.com #ghostscript
Search: