| <<<Back 1 day (to 2020/08/30) | Fwd 1 day (to 2020/09/01) >>> | 20200831 |
myopia | reading the pdf specs, how the pdf interpreter indirect object references has me perplexed. say, assuming the interpreter reads from beginning to end, and we are now at "1 0 obj" part of which references "2 0 R" in the Length entry. however, "1 0 obj" contains purely binary data stream which includes in the binary stream multiple instances of "endobj" and "endstream" before the actual PDF endobj/endstream. to make the matter worse, it also features | 09:00.16 |
| "2 0 obj" after some of the endobj/endstream. then, how is the pdf interpreter supposed to know where the real "2 0 obj" begins? | 09:00.16 |
chrisl | myopia: That's for the xref table is for | 09:02.09 |
| myopia: That's what the xref table is for | 09:02.20 |
myopia | erh... what if the octet stream also included "xref", "trailer" and "startxref"... | 09:03.12 |
| for "1 0 obj", that is | 09:03.21 |
| then shouldn't the spec define a special offset that gives the *real* xref? | 09:03.52 |
chrisl | It does | 09:04.01 |
myopia | can you give the section number in the form of x.y.z that deals with this? | 09:06.36 |
chrisl | I only have the PDF 1.7 ref manual currently to hand, but in there, you should read 3.4.4 ("File Trailer") | 09:08.38 |
myopia | my thanks | 09:09.19 |
| gotta say I prefer random access dictionaries in the fore with padding for room of growth | 09:12.25 |
ator | myopia: also the /Length entry in the object states how many bytes are between the stream and endstream keywords | 09:24.11 |
| myopia: to answer your #mupdf question from yesterday, "mutool convert" should be able to convert JP2 images to PDF. | 09:24.38 |
myopia | except that Length could refer to a "next" object which again goes to the "3.4.4 File Trailer" | 09:26.10 |
ator | myopia: correct. you need the xref in order to read a PDF file correctly in all cases. | 09:26.50 |
| most times you can recover a PDF file that has a missing xref table, but as you say, if the stream has "endstream" in the content then you're screwed. | 09:27.38 |
myopia | ator: and thanks for the file conversion part, but I later tried to write my own script for the express purpose of generating a *minimum* pdf file | 09:27.56 |
ator | but most PDF streams are compressed, so the likelyhood of that is extremely small | 09:27.57 |
| mutool convert will write pretty much a minimal PDF file | 09:28.58 |
| the only thing it could write smaller is to inline the resources dictionary | 09:29.23 |
myopia | continuing on xrefs, a binary file format could also solve the problem of xrefs. say, we have a file that is an array of records, which is <record type><record length><further data>. let's say the first record has to be a master record dictionary which keeps a list of offsets of later records. this special record has the form <record type = 0><record length><record next offset><further data>. the <record next offset> allows the file to append at | 09:39.05 |
| the end of the file for the purpose of incremental upgrades. and constructing a full master dictionary can entail reading a linked list of records | 09:39.05 |
kens | Feel free to propose that to Adobe as a replacement for PDF. | 09:39.31 |
| In the meantime we have to live with what the specification says | 09:39.44 |
ator | myopia: the xrefs are a binary file format, with fixed length records. it also happens to be mostly-human-readable. | 09:48.06 |
myopia | (and the text-binary pdf format could assign/mandate an entry "TrailerOffset" or "XRefOffset" alongside "Version" and Producer" without breaking memory caching order by reading from the end) | 09:51.16 |
ator | startxref at the end of the file has the offset of the first xref section, the trailer at the end of the xref may have a /Prev which links to other xref sections. that's what's used to link together incremental updates. | 09:52.37 |
| you need to start reading from the end, because when incrementally updating you don't want to change what's already there, you just append. | 09:53.34 |
myopia | (you could also pre-pad the number of characters with whitespaces, say "/TrailerOffset 200 SP SP SP... CR LF", but once we were onto the boat of reading from the end, this is becoming a pattern of convenience, I guess, and this is probably what gives PDF the random access pattern) | 09:57.43 |
| <<<Back 1 day (to 2020/08/30) | Forward 1 day (to 2020/09/01)>>> | |