MuPDF IRC logs

	<<<Back 1 day (to 2018/07/09)	20180710
inflex	Is the muPDF "outline" a small-page-view column down the side of the viewer, or is it something else entirely?	10:16.57
	If muPDF doesn't already support something like that, then I suppose I could look at generating bitmapped thumbnails of the pages on load.	10:26.30
kens	While MuPDF (the core code) can certainly access the /Outlines of the PDF file, I don't think any of the demo apps expose it (ie add a user interface to allow you to select a page based on the Outlines). I could be wrong though.	10:27.40
inflex	np, all good. I noticed the mupdf-gl does have the press 'o' for outlines, but maybe that's actually more a "outline/description" of the PDF.	10:36.44
kens	Ah I wasn't aware of that, I'm afraid I don't know what it does, sorry	10:37.23
tor8	inflex: the 'o' outline is the table of contents (called the 'outline' in PDF specification)	10:37.26
inflex	thanks tor8. Is there something I can expose from the inner workings to achieve the side-pane of thumbnails, or is it something I just need to generate myself?	11:33.16
tor8	inflex: you'll have to render the page thumbnails yourself. beware that it could be very slow on image-intensive pages.	11:37.40
	rendering the thumbnail is going to involve parsing and rendering the whole page, just at a small size	11:37.56
	there are no page thumbnail images stored in the PDF format	11:38.12
	so unless you're doing multiple threads and rendering in the background, I can't recommend it	11:38.45
inflex	tor8, that's fine, I had suspected as much	12:03.30
paulgardiner	tor8: I've run into a problem with signature support that I'm struggling with. The signature field dictionary refers to a byte range, which specifies what parts of the document are hashed as part of signing. When verifying, we need to check that the byte range is reasonable: it should cover what was the whole document at the time of signing (although that might be a prefix of the document at...	12:41.10
	...time of verification because of subsequent incremental updates).	12:41.12
	I've tried changing the loading functions to look for startxref <number> %%EOF after reading the trailer. That works for anything mupdf produces, but I'm seeing some files where the trailer is stored near the beginning of the file. I'm a bit lost how to approach this now.	12:41.18
	Possibly it fails with some file mupdf produces, come to think of it. The structure I'm seeing may be due to linearization	12:54.09
tor8	paulgardiner: I don't think I can be of much help... you and robin have changed that code so much I no longer recognize it or know what it does...	12:58.39
paulgardiner	I could search for the startxref <number> %%EOF for which the number corresponds to the start of the xref, but that might require reading through the whole document.	12:59.08
	tor8: I'm not sure I'm asking about mupdf so much as about PDF.	12:59.33
tor8	paulgardiner: the trailer can be anywhere, often at the beginning with linearized files I expect.	13:00.56
	can't you just save the size of the file when we first open the PDF?	13:01.20
	we start by scanning for 'startxref' at the end to find the trailer	13:01.32
paulgardiner	That works only for the last xref section.	13:01.45
inflex	tor8, almost wondering, since most of the PDFs that are going to be viewed through the viewer will be used over and over again (schematic diagrams), it could be possible to create PNG thumbnails once-off	13:01.54
	tor8, and have them stored alongside the actual PDF as a metafile.	13:02.21
paulgardiner	tor8: I need to be able to work out the sizes the file corresponding to each xref section.	13:02.54
tor8	paulgardiner: you can't reasonably verify anything other than the last xref section?	13:02.58
paulgardiner	tor8: I think we need to for the case of multiple signatures.	13:03.21
tor8	I mean, any additional xref sections or data appended to the end can very much change the document	13:03.22
paulgardiner	Possibly we also need to determine if an incremental update adds only signatures.	13:03.50
tor8	and saying "this signature checks out, because the subset of the file that was used when it was signed matches, but hey, I'm just kidding somebody replaced bits with newer objects at the end of the file"	13:04.28
	is not okay	13:04.42
	I mean, it's perfectly plausible to replace the content stream of a page with a newer generation object in an incremental update	13:05.05
paulgardiner	See above.	13:05.14
tor8	that (and any other edits) should invalitade all signatures.	13:05.17
paulgardiner	the case of the only change being to add another signature.	13:05.43
tor8	wouldn't multiple signatures put each other in the byte ranges that are excluded?	13:05.53
paulgardiner	It seems not. If using AR I sign a document that has multiple signature fields, it doesn't leave room for the digest of the other signatures.	13:07.10
tor8	right. so each signature checks a certain subset of the file.	13:07.47
paulgardiner	Yep.	13:07.54
tor8	and the ranges are shorter than a file if incrementally updated	13:08.12
paulgardiner	Yep.	13:08.20
tor8	so if we incrementally update the file, we can't tell if those edits in general invalidate a signature	13:08.54
	or well, if they should invalidate it	13:09.02
paulgardiner	AR will say something like "This signature is valid but for an early version of the document, and then asks if the user would like to view the ealier version" I assume it doesn't do that if the only change is to add another signature	13:10.02
	AR can, in some cases at least, list the changes that have been made since a signing.	13:10.37
tor8	paulgardiner: looking at the spec (pdfref17.pdf page 726) there's a note	13:11.16
	If a signed document is modified and saved by incremental update, bla bla bla, it is possible to recreate the state of the document as it existed at the time of signing.	13:11.50
	it doesn't say how, other than implying the ByteRange array	13:12.07
	so I would have to assume the last range will imply the end of the file	13:12.22
	at the time of signing	13:12.28
	and it's only possible to tell the end of the file for the latest iteration (the startxref entry)	13:13.16
paulgardiner	I don't see how that relates to byte ranges	13:13.20
tor8	incremental updates don't write the end of the file of the previous version, it just chains the xrefs	13:13.38
	Note: If a signed document is modified and saved by incremental update (see Sec-	13:14.11
	tion 3.4.5, âIncremental Updatesâ), the data corresponding to the byte range of the	13:14.11
	original signature is preserved. Therefore, if the signature is valid, it is possible to	13:14.11
	recreate the state of the document as it existed at the time of signing.""	13:14.11
paulgardiner	You can create the state of the document at the time of signing by just ignoring the subsequent xrefs, I think	13:14.13
	No need for the byte ranges.	13:14.29
	The byte ranges specify what is hashed.	13:14.43
tor8	there is not a guaranteed one-to-one mapping between xref sections and incremental updates	13:14.46
	it is entirely possible to have multiple xref sections with only one trailer	13:15.14
paulgardiner	In any case, I don't believe there is an intention to use the byte ranges in the recreation of older versions of the document.	13:16.19
tor8	and the previous trailers are 'lost' when you incrementally save	13:16.33
	"Note: If a signed document is modified and saved by incremental update (see Sec-	13:17.08
	tion 3.4.5, âIncremental Updatesâ), the data corresponding to the byte range of the	13:17.08
	original signature is preserved. Therefore, if the signature is valid, it is possible to	13:17.09
	recreate the state of the document as it existed at the time of signing."	13:17.09
paulgardiner	Really? I thought we always appended to the end of the file for incremental update.	13:17.13
tor8	we do, but there's nothing in the appended data that points to the previous end of file	13:17.35
	there's the "Prev" entry	13:18.37
	which points to the previous 'xref' section but not the actual EOF	13:18.55
	since the xref can be anywhere in the file	13:19.01
	and that sentence leads me to believe the ByteRange implies the length of the previous file	13:20.12
	(especially given how vague and implementation-driven "this is what acrobat does, do that and ignore what the spec actually says" the later additions to the PDF spec are)	13:20.45
paulgardiner	I have PDF32000_2008 here. Is that the wrong version? It says it's v 1.7	13:24.36
tor8	paulgardiner: it's the ISO version of the same text -	13:24.49
	worse typography, same content	13:24.59
paulgardiner	My incremental updates section seems to be 7.5.6	13:25.32
tor8	oh dear, this is worrying... our code assumes the 'trailer' always succeeds an 'xref' (old style) section	13:27.10
	but the spec says it precedes the 'startxref'	13:27.17
	I wonder if that might trip us up into doing a repair job on valid files	13:28.35
paulgardiner	"precede" as in just before?	13:28.43
tor8	yes.	13:28.48
	of course, it then throws out the baby and the bathwater and the whole bathroom when they introduce 'Cross Reference streams' where the 'xref' and 'trailer' keywords are just gone	13:29.14
paulgardiner	Well that seems not to be true.	13:29.17
	All the problems I'm having are with xref streams.	13:29.43
tor8	paulgardiner: the 'new style' ones?	13:30.04
paulgardiner	yep	13:30.12
tor8	yeah, they don't have an 'xref' or 'trailer' keyword anywhere	13:30.16
	the only reliable end-of-file marker is the "startxref\n[0-9]+\n%%EOF\n" string	13:30.54
paulgardiner	It's not the lack of keywords that is troubling me, but the position within the document.	13:31.05
tor8	but I think if you look at the ByteRange that would probably be enough?	13:31.11
paulgardiner	tor8: the problem with that, is the whole point of what I'm trying to do at the moment is to validate the byte range.	13:31.43
tor8	and check that the byte range ends at an appropriate point?	13:32.24
paulgardiner	Yeah.	13:32.44
tor8	you could do what adobe does, and say "this matched an earlier version" and add a question to 'restore the old version?' which would copy the file up to the end of the byteranges	13:32.53
paulgardiner	AR seems to do that in some cases	13:32.54
	I think you are misreading that clause.	13:33.57
tor8	though maybe, just maybe, if our assumption about oldstyle is always the sequence 'xref <sections> trailer ... startxref ... %%EOF'	13:34.00
	but that would fail for new style where the trailer is at the head of the stream and could be anywhere in the file	13:34.52
paulgardiner	It just points out the fact that an incremental update doesn't alter the bytes of the previous version of the document and hence nothing in the byte range changes	13:35.02
tor8	paulgardiner: I'm reading more into it than it says, by the sentence "herefore, if the signature is valid, it is possible to	13:35.37
	recreate the state of the document as it existed at the time of signing."	13:35.37
	the therefore is my key word	13:35.49
	but yes, it may be I'm fantasizing	13:36.06
	but I do wonder how you could recreate an old version other than parsing the whole file from the start and scanning for %%EOF	13:36.39
	since nothing in the trailer and xref chain point to the old version's eof	13:36.57
paulgardiner	I've been reading "state" as reader state necessary to show the old version, not file state.	13:37.52
	.. which I believe can be done by just ignoring all the xref sections since the signing.	13:38.33
tor8	you could pop xref sections until you get to some other version	13:38.40
	but how do you know which one that is?	13:38.44
paulgardiner	The signature field will be refered to by the one you should stop on?	13:39.07
tor8	referred to by what?	13:39.21
	the signature field could exist in all versions	13:39.34
	I know we discussed being able to time travel by popping xrefs	13:40.22
paulgardiner	But presumably only one xref section refers to the version that you are looking at	13:40.25
tor8	I don't understand.	13:41.04
	consider this case: a file is created with a signature field. this is the original version A.	13:41.18
	then it is edited to create version B.	13:41.26
	then it is signed and saved as version C.	13:41.31
	then it is edited and saved as version D.	13:41.36
paulgardiner	And what do you wish to achieve from that point?	13:42.10
tor8	that is my question for you.	13:42.34
	what does version C look like?	13:42.58
	signing the field, writes the digest somewhere. is that in a new incremental section?	13:43.19
paulgardiner	yes	13:43.29
tor8	and this new xref section, is it included in the byteranges (minus the actual bytes with the digest checksum)?	13:43.56
paulgardiner	I assume if I drop all xref sections since that one, I see the document as it was.	13:43.57
tor8	now, someone opens version D, and asks to verify the signature	13:44.23
	they sohuld see "this matches a previous version", right?	13:44.36
paulgardiner	yes, and possibly "here's a list of changes since then"	13:44.59
	and an offer to show the version as signed	13:45.14
tor8	and you're saying how we find this is by looking at the signature field V object, and looking to see which xref subsection it is defined in?	13:45.45
paulgardiner	That's what I've been assuming. I'm not currently trying to do that, but I assumed we might if we need to.	13:46.18
tor8	if we can be sure that the signature will be saved in an incremental update, we should be able to find it by that way	13:46.48
	because we can't track (a) the EOF for any given xref section, or (b) what it was when it was actually signed	13:47.13
	finding the subsection where the form field object was last updated would probably do well enough	13:47.32
	and we can show the document from that point onwards	13:47.45
	but we'd have to take care to flush all cached pdf_obj's when we rewind the xref view	13:47.59
paulgardiner	Yeah, I was assuming that would work, but that wasn't what I was working on.	13:48.02
	I was just trying to find a way to validate the byte range values.	13:48.16
tor8	right, so consider if we implement the above	13:49.00
	we can rewind the view to C	13:49.03
paulgardiner	To check that the signing software didn't maliciously use a small byte range that meant almost none of the file is included in the hash	13:49.04
tor8	but when checking it, we still don't know the actual EOF	13:49.34
	so we can't find such cases	13:50.07
	do we write a new byterange when signing? I thought that was part of the original structure created by the PDF authoring software.	13:50.44
paulgardiner	So you are saying it cannot be done. AR is doing it for the case of the signing being the last thing done, but perhaps that is the only case it does it for,	13:51.06
tor8	It cannot be done trivially at least :)	13:51.21
paulgardiner	byterange is part of the signature	13:51.25
	My original question was "hey tor look at this. I can't find a trivial way to do it" :-)	13:51.59
	The byte range is usually the whole document with a hole for the digest	13:52.32
tor8	Then the TL;DR version of my answer is: "Neither can I"	13:52.36
paulgardiner	Damn! :-)	13:52.52
	for old style xrefs, I think looking for startxref after the trailer works, but for new style xrefs...	13:53.44
tor8	paulgardiner: yeah. for new style xrefs ... no can do.	13:54.01
paulgardiner	Possibly AR doesn't check byte ranges other than for the case of the signature being the last change	13:54.38
	Forward 1 day (to 2018/07/11)>>>

Log of #mupdf at irc.freenode.net.