MuPDF IRC logs

	<<<Back 1 day (to 2020/04/22)	Fwd 1 day (to 2020/04/24)>>>	20200423
emfipp	can someone take a look at the pdf at https://dl.acm.org/doi/book/10.5555/578789 (just click on "PDF")?		01:45.32
	mupdf 1.16.0 on windows (x64) cannot open the file		01:45.42
	also, how could one possibly fix the pdf so that mupdf can open it?		01:46.14
	mutool clean gives "error: cannot find startxref"		02:00.08
	also "warning: object missing 'endobj' token" and "warning: PDF stream Length incorrect"		02:00.53
sebras	emfipp: I'm looking at it.		05:28.00
emfipp	sebras: thanks. later, I tried "mutool extract" and an error occured on page 196		05:51.01
	which translates roughly to img-11729		05:51.20
	later I tried grepping the trailer object		05:51.53
	looks like the thing being missing		05:52.06
	(feel free to suggest to me ways of fixing the file even it requires some hex'ing work)		05:53.01
sebras	emfipp: do you get an error message when you run mutool show aho.pdf trailer ?		06:19.53
	maybe the file you downloaded was not complete?		06:20.31
	I can display the copy I downloaded fine on linux using both mupdf-x11 and mupdf-gl as well as the windows binaries mupdf.exe and mupdf-gl.exe running in wine.		06:22.30
	mutool extract does error out on me on page 786, because it seems to run out of memory.		06:23.48
	so I can't reproduce the problems you are seeing.		06:24.00
	emfipp: do you mind reporting a bug at bugs.ghostscript.com and writing down the exact commands you used, what pages were problematic and so on? one bug per problem please. :)		06:24.38
	emfipp: also, where did you get the mupdf binary? did you build it yourself?		06:24.53
ator	emfipp: double check that the file size of 578789.pdf is 99671413 bytes and that the md5sum is f76d4cbffe26110796b5bf0b9058a079		06:53.14
	your error sounds like a truncated file		06:53.24
sebras	ator: mine is 99671378 and the md5sum is ccd8e153f8caf41d1f5333914862b45a		06:53.54
	ator: I have downloaded things from acm before, I think they randomize the files somehow.		06:54.13
ator	right. 30 bytes give or take is I guess fine when randomizing.		06:55.03
	"cannot find startxref" is always the case with a truncated file		06:55.21
sebras	yes.		06:55.29
ator	and missing an endobj and a stream being cut short is another indication that it's just cut off		06:55.39
sebras	I tried downloading it a second time and ended up at 99671396 bytes. I wonder what it is that they are randomizing...		06:57.20
kens	Its easy enough to stick random unreferenced bytes into a PDF file		06:57.57
ator	sebras: mutool show foo.pdf grep and diff?		06:58.46
	looks like it's generated from scratch again, with different object and image numbering		07:03.08
	so the flate compression differences on all the tiny streams is going to add up to a couple of bytes of diff		07:03.33
sebras	yes, they are.		07:04.16
emfipp	(while I was certain it had been complete, I think the error on page 786 seems to suggest this is an incomplete file --- 31,499,598 bytes in toto)		07:04.25
sebras	the Length fields appear to change too.		07:04.28
ator	the Length is the compressed length, not the uncompressed length		07:04.49
sebras	emfipp: does acrobat reader show it without problems?		07:04.51
emfipp	sorry, but I do not run programs such as adobe reader on any machine I control		07:05.13
ator	good for you :) wish I had that luxury.		07:05.24
emfipp	what's the file size on your end though, sebras?		07:05.56
	(oddly firefox did report download complete)		07:06.14
sebras	emfipp: I've downloaded the file three times, (firefox, chromium and wget) each time they are around 99671323 bytes. but your size of 31,499,598 bytes seems way off.		07:06.46
emfipp	ok. I guess something's wrong with my firefox		07:07.03
	or acm simply doesn't want me to download the entire file		07:07.14
sebras	emfipp: it took a long time for me to download the entire file.		07:07.35
emfipp	same here too. I can only http-download files of that size at certain time of day		07:08.02
ator	emfipp: I think firefox may be confused, I don't think it sets a content-length so can't know if it's complete or cut short		07:08.10
emfipp	or connection teardown happens		07:08.12
	(and bloodly hell acm gives out free books yet still no byte range requests)		07:08.36
ator	curl -c cookiejar -I -L 'https://dl.acm.org/doi/pdf/10.5555/578789?download=true' shows no Content-Length header on the final redirect with mimetype application/pdf		07:09.55
	and given the random nature of the file when we download it, I'll just assume it's generating the PDF on the server on the fly and not serving up a static file		07:10.31
	which would explain why no byte range requests work, etc.		07:10.44
emfipp	so acm is fingerprinting those pdf files?		07:18.56
*emfipp*	shudders		07:18.58
	rule no. 1: gs-reparse every pdf file downloaded from the internet		07:19.50
sebras	emfipp: there is a randomized ID-field in the PDF trailer. they might use that for fingerprinting the files. acm normally only offers these files for paid download so I'm not terribly surprised.		07:23.57
sarmols	Hi, I could use some help. I am using the MuPDF API and need a way to get rectangles around all the words on a page. A word is a string that is seperated by either a space or a newline. Can anyone think of a way to do this? Because I just can't come up with anything, not even some really hacky way to implement this.		08:58.45
Robin_Watts_	sarmols: Have you looked at the stext functions?		09:13.40
	fz_new_stext_page_from_{page,page_number,display}		09:15.13
	any of those will get you an fz_stext_page.		09:15.22
	That's a structure you can walk, it contains the locations/bbox for ever char on a page.		09:15.46
sarmols	Robin_Watts_: Great, thank you! I'll check it out.		09:57.52
paulgardiner	Seeing something strange when signing documents. I'm returning a value for max_digest that assures we have well over the space allocated in the file. Usually that would mean I'd see a trailer of zeros, but it seems to be filling out with random garbage		16:09.20
	Ah I bet output streams have changed and our wrapper around objC that implements one no longer works		16:13.02
	<<<Back 1 day (to 2020/04/22)	Forward 1 day (to 2020/04/24)>>>

Log of #mupdf at irc.freenode.net.