| <<<Back 1 day (to 2012/10/18) | 2012/10/19 |
paulgardiner | Robin_Watts: ping | 11:34.11 |
Robin_Watts | pong | 11:34.15 |
paulgardiner | When a field changes in a way that will affect its appearance, I need to repaint the relevant areas of pages where they appear. Fields aren't associated with a page, but they have widget annotations as children that are specific to a page. Given an annotation, I need to find what page it's on efficiently. | 11:37.38 |
| I don't want to search through the page's tree of annotations. | 11:38.06 |
| Annotations can have a reference to the page, but its optional. I'm thinking I should add a reference if it doesn't already exist when we process the annotations | 11:38.48 |
| That shouldn't break anything, should it? | 11:39.39 |
Robin_Watts | Let me restate that to see if I've followed it... | 11:40.06 |
paulgardiner | That would be appreciated. Maybe then I'll understand it. :-) | 11:41.53 |
Robin_Watts | A given field can appear on one or more pages by having a widget annotation on that page. The only way a field can appear on a page is by having such a widget annotation | 11:41.53 |
| Every annotation can only appear on a single page. | 11:42.28 |
paulgardiner | Yes, I believe so. | 11:42.54 |
Robin_Watts | Annotations can optionally have a reference to the page that they are on, and you propose to make this compulsory by making such a reference if there isn't one already existing. | 11:43.10 |
| Assuming that every annotation can only appear on a single page, that sounds fine. | 11:43.30 |
paulgardiner | Yes, although now I'm concerned that it might still be difficult to derive the page number | 11:43.43 |
Robin_Watts | sainsburys delivery. back in a mo, sorry. | 11:43.51 |
paulgardiner | np | 11:43.57 |
kens | I'm not certain that an annotation can appear on only one page | 11:55.53 |
| because each page has an Annots entry | 11:56.24 |
paulgardiner | Hmmm, I'd better check. Not sure what made me think it so. | 11:57.56 |
Robin_Watts | back. | 11:58.41 |
| yeah, that's vaguely my worry. | 11:58.46 |
| If every page has a list of the annotations that's on it of the form: /Annots [24 0 R 25 0 R 26 0 R] etc... | 11:59.15 |
paulgardiner | Annotations that have a page-object reference presumably appear on one page. | 11:59.15 |
Robin_Watts | it would be possible for several pages to refer to the same annotation (e.g. 24 0 R) | 11:59.40 |
paulgardiner | Somewhere I picked up the idea that fields could appear on multiple pages, and they did so by having multiple annotations,. each appearing on a single page | 12:00.02 |
Robin_Watts | Page 148 of the 1.7 spec shows /Annots [23 0 R 24 0 R] as an example | 12:00.46 |
paulgardiner | Ah. Bottom of Page 605 | 12:01.35 |
| A given annotation dictionary may be referenced from the Annots array of only one page. Attempting to share an annotation dictionary among multiple pages produces unpredictable behavior | 12:02.06 |
Robin_Watts | OK, perfect. | 12:02.13 |
| That sounds like exactly the restriction you need to be able to pull your trick. | 12:02.33 |
paulgardiner | But there's another problem. The current plan lets me get the page object, but I need the page number | 12:02.50 |
Robin_Watts | Presumably, it's the P entry you are looking at ? | 12:02.54 |
paulgardiner | Yep | 12:03.04 |
Robin_Watts | You can go from page object -> page number. | 12:03.22 |
paulgardiner | That was my hope, but I haven't found it yet | 12:03.47 |
Robin_Watts | by doing a linear search of xref->page_refs and doing pdf_to_num on each. | 12:03.49 |
paulgardiner | I was hoping to avoid searches | 12:04.52 |
Robin_Watts | You *could* add a PageNum entry to that dict? | 12:05.06 |
| but it's not nice. | 12:05.21 |
paulgardiner | It's looking like the only reasonable way | 12:05.36 |
| I already add a "Dirty" entry temporarily to fields. | 12:06.01 |
Robin_Watts | paulgardiner: Or you could build a mapping from object number -> page. | 12:06.10 |
| and hold it in the xref. | 12:06.28 |
tor8 | pdf_lookup_page_number(xref, obj) | 12:06.36 |
paulgardiner | tor8: Does that already exist? | 12:06.52 |
Robin_Watts | tor8: Ah, right, that's the encapsulation of the linear search ? | 12:06.58 |
tor8 | it exists. and it's a wrap of a linear search. any improvements should go in there :) | 12:07.17 |
| like building a sorted array of obj-num to page-num mappings that can be binary searched | 12:07.39 |
paulgardiner | Ok. I think the fact that that already exists implies, I should use it rather than the trick on adding an entry. This isn't horrendously time critical any way. | 12:08.23 |
| Must the obj be a page reference? Or would that work with an annotation? I'm guessing the former. | 12:08.53 |
tor8 | it must be a page reference indirect object | 12:09.19 |
| it compares against the references used in the page tree | 12:09.33 |
paulgardiner | So I still need to add "P" entrees to the annotations that don't already have them. | 12:10.05 |
| ? | 12:10.12 |
Robin_Watts | Yes, but that's much less nasty, as you're simply filling out an optional bit of the spec. | 12:10.47 |
paulgardiner | Unless I give in completely, and add pdf_lookup_annotations_page? | 12:11.03 |
| Hmm, that would be slow though because it would have to look through the annotation lists of every page. | 12:11.45 |
| ... but could be sped up with the right data structure | 12:12.28 |
| Robin_Watts: yes, it isn't really nasty at all, I guess. | 12:13.49 |
| Thanks. I'll battle on. | 12:14.05 |
| Presumably 10,000 page documents don't tend to be forms anyway. | 12:15.27 |
Robin_Watts | could imagine an O'Reilly book with a form in the back for "tell me about updates to this book" etc? | 12:18.16 |
paulgardiner | I'll ignore that. :-) | 12:23.51 |
Robin_Watts | I think that should work with your planned "fill in the P option" plan. | 12:27.19 |
| http://www.printercomparison.com/default.asp?newsID=1509 | 13:30.56 |
| I'm all for supporting a range of products, but *42* new laser printers? | 13:31.15 |
chrisl | kens: I have to go out, and I almost certainly won't be back for 4 o'clock....... | 13:39.26 |
kens | OK chrisl no problem | 13:39.40 |
chrisl | If I do get back at a vaguely sensible time, I'll give you a call | 13:40.05 |
paulgardiner | Robin_Watts: there's a few commits on paulg/master if you have a moment. | 13:40.22 |
Robin_Watts | ok. | 13:40.56 |
paulgardiner | Robin_Watts, tor8: I still haven't sorted out this ensuring page references are present in annotations. | 13:53.44 |
| What looks like the natural place to do it, has the page object in the form of a dict. The P entry is supposed to be an indirect reference. Is there a llokup function for that?... I should look really, I guess. | 13:55.18 |
Robin_Watts | xref->page_objs and xref->page_refs are kept for exactly this reason. | 13:56.51 |
| Where are you in the code? | 13:56.59 |
paulgardiner | Ah right | 13:57.16 |
| Line 410 of pdf_page.c | 13:58.02 |
Robin_Watts | In pdf_load_page? | 13:58.27 |
paulgardiner | I haved the pageref | 13:58.30 |
| have | 13:58.35 |
Robin_Watts | (My code has changes in that file) | 13:58.37 |
paulgardiner | It's ok. The ref is already in a variable | 13:58.49 |
Robin_Watts | Right, so the pageref is what you what. | 13:58.51 |
paulgardiner | Yep, thanks. | 13:59.26 |
tor8 | paulgardiner: pdf_lookup_page_number takes the indirect reference | 14:01.39 |
paulgardiner | tor8: ah right. So another reason I should make sure its the ref I put in the "P" entry. | 14:05.28 |
kens | Robin_Watts : ping ? | 14:18.06 |
Robin_Watts | pong | 14:18.12 |
kens | Can you help me understand a log entry ? | 14:18.22 |
Robin_Watts | I can try. | 14:18.28 |
kens | My regression test has: | 14:18.46 |
| The following 2 regression file(s) have started producing errors: | 14:18.46 |
| tests_private/comparefiles/446-01.ps.pdf.pkmraw.300.0 gs pdfwrite inches miles Error_reading_Ghostscript_produced_PDF/PS_file | 14:18.46 |
| So I look in teh logs for miles ? | 14:19.02 |
Robin_Watts | I think so. | 14:19.12 |
kens | OK well if I do that I don;t see an error, so I'm puzzled.... | 14:19.24 |
| ===tests_private__comparefiles__446-01.ps.pdf.pkmraw.300.0=== | 14:19.34 |
| gs pdfwrite | 14:19.34 |
| ./gs/bin/gs -sOutputFile=./temp/tests_private__comparefiles__446-01.ps.pdf.pkmraw.300.0.pdf -sDEVICE=pdfwrite -r300 -sDEFAULTPAPERSIZE=letter -dNOPAUSE -dBATCH -dClusterJob -dJOBSERVER - < ./tests_private/comparefiles/446-01.ps | 14:19.34 |
| GPL Ghostscript 9.07 (2012-07-31) | 14:19.34 |
| Copyright (C) 2012 Artifex Software, Inc. All rights reserved. | 14:19.34 |
| Ooops | 14:20.04 |
| DOn't know how much of that you saw before I got kicked off | 14:20.15 |
Robin_Watts | kens: OK. I have the same log. | 14:20.28 |
kens | BUt I see the report here: | 14:20.35 |
| http://ghostscript.com/cgi-bin/clustermonitor.cgi?log=log&machine=miles&report=ken | 14:20.35 |
Robin_Watts | Note that there are 2 entries for each pdfwrite. | 14:20.47 |
| First you have the pdfwrite step, which as you say completes with no error. | 14:20.57 |
kens | So there are, the first is the pdfwrite conersion I guess | 14:21.10 |
Robin_Watts | Then you have another step which is it reading the pdfwritten thing, and writing the pkmraw output. | 14:21.20 |
kens | Ah, and the second does indeed afil | 14:21.20 |
Robin_Watts | And that... yeah. | 14:21.24 |
kens | OK well I don't see that locally so I guess I'd better try it with teh command line, thanks | 14:21.40 |
Robin_Watts | no worries. | 14:21.46 |
kens | Hmm, well that does reproduce it, something to do with the environment then | 14:25.22 |
Robin_Watts | tor8, kens, sebras, anyone else... | 14:54.41 |
| When we'd looked at hints tables before, we'd decided that nothing uses them, right? | 14:54.59 |
sebras | Robin_Watts: yeah, that's what kens said about acrobat at least. | 14:55.13 |
Robin_Watts | How then can you know what order pages go in? | 14:55.17 |
| It's easy to know the first page to use for a file. | 14:55.26 |
| Do we then assume that you don't display any more pages until the whole lot has arrived? | 14:55.39 |
| (or at least you only display blank pages, but the right number of blanks) | 14:55.55 |
| (possibly of the wrong size) | 14:56.06 |
kens | Robin_Watts : I don't really think anything uses hte hints at all, so you can only use page 1 | 14:56.14 |
sebras | well, what kens said was that whether the hints stream was bogus or not didn't affect the "optimized" state according to acrobat. | 14:56.18 |
Robin_Watts | Right, so if we DID use the hints table we could display subsequent pages as we go. | 14:56.48 |
kens | One of teh acrobat implementation notes says that, although techincially the 'first' page need not be ordinal page 1, that's the only way Acrobat writes it. | 14:56.53 |
| Robin_Watts : Technically I believe we could but I would want to reread tghe spec (again) before comitting myself. | 14:57.13 |
| Also, I wouldn't count on being able to do it from the Acrobat output, which is even worse than GS's | 14:57.30 |
sebras | Robin_Watts: as I understand it that must be the point of the hints table, no..? | 14:57.40 |
Robin_Watts | sebras: Indeed. | 14:57.55 |
kens | I guess my point is that you can't rely on this stuff being correct. Alos, there's another implementation note about compressed objects which basically says 'can't use hint streams then' | 14:58.21 |
| And since most PDF files from Adobe apps use compressed objects.... | 14:58.37 |
sebras | this is just the case of badly implemented generators, not a bad spec, right? at least that is my understanding. | 14:58.49 |
Robin_Watts | Actually, even with hint streams how do I find what number object is the page object? | 14:59.21 |
sebras | kens: objects in object streams, or just objects with compressed stream contents? | 14:59.34 |
kens | THe spec is OK (if daft, opaque, hard to understand and harder to implemetn), the 'problem' is that AQcrobat Distiller, and by implication other Adobe products) don't follow the spec well. | 14:59.45 |
| Robin_Watts : the headers tell you taht stiuff IIRC | 15:00.06 |
Robin_Watts | The top level dict tells me the object number for page 1. | 15:00.23 |
kens | sebras object streams | 15:00.24 |
| Robin_Watts : hang on I'll go and get the spec out again.... | 15:00.35 |
sebras | kens: are object streams really that common? and also, will the adobe apps really generate object streams when they are asked to generate linearized pdfs!? if that's an implementation limitation that seems really strange. | 15:01.53 |
Robin_Watts | I can find "the first object number" for any given page by looking at item 1 in the page offset hint table and accumulating. | 15:02.04 |
kens | Yes, that's it Robin_Watts | 15:02.23 |
| I believbe the first object for the page is the page object | 15:02.35 |
Robin_Watts | But I can't find which entry within the block of objects is the page number. | 15:02.43 |
kens | There is no page number | 15:03.00 |
Robin_Watts | Where in the spec does it say the first element has to be the page object? (It may do, I can't find it) | 15:03.01 |
| is the page object, sorry | 15:03.10 |
kens | I believe item 1 for the page is the page object | 15:03.22 |
kens | is still reading | 15:03.57 |
Robin_Watts | Ah, got it. You're right. | 15:04.19 |
| So if I use the hint tables, I can get pages out early. | 15:04.37 |
kens | Assuming they are correct. | 15:04.59 |
Robin_Watts | indeed. | 15:05.06 |
kens | You can test our linearisation :-) | 15:05.08 |
| I bet GS isn't right | 15:05.17 |
henrys | if colorado is a swing state 4 years from now I'm leaving the state for the election period. | 15:07.00 |
Robin_Watts | can see the headlines now: "colorado swings" | 15:07.57 |
| henrys: We're sick of the election coverage over here, and it's not even our election. I can only imagine how you feel. | 15:08.53 |
| Surely it's a simple choice though "another 4 years of underachievement" vs "ABSOLUTELY BATSHIT CRAZY!" | 15:09.18 |
henrys | to sick the entire U.S. campaign machinery on a few relatively small states is too much. | 15:11.41 |
| replacing the electoral college with a popular vote would spread the hell around evenly | 15:16.33 |
Robin_Watts | But that'll never happen. | 15:17.18 |
kens | OK, one more try.... | 15:17.46 |
| hopefully no errors this time | 15:17.57 |
Robin_Watts | loses an hour looking for a bug that turns out to be = instead of ==. | 16:05.45 |
| kens: Where did you see the implementation note that said "compressed objects -> no hint streams" ? | 16:17.21 |
kens | PDF reference manual, give me a minute | 16:17.42 |
| p1025 | 16:21.43 |
Robin_Watts | I have something on page 1040 | 16:22.03 |
kens | For files containing object streams, hint data can specify the location and size of the object streams only (or uncompressed objects), not the individual compressed objects | 16:22.03 |
| Robin_Watts : p1040 that's the one | 16:22.51 |
Robin_Watts | ok. | 16:23.38 |
| So that doesn't say "no hint streams". It just says "hint streams even more broken". | 16:24.00 |
| Presumably you have to assume that entries for pages won't be shared within a single compressed stream. | 16:24.49 |
kens | essentially useless | 16:24.53 |
Robin_Watts | so the first compressed object in a stream is the page object. | 16:25.06 |
| Oh, but you can't know what objects are in a compressed stream, without the xref. | 16:25.19 |
| gah. Useless :( | 16:25.25 |
kens | what I said was 'can't use hint streams' | 16:25.26 |
| Time to be off, night all | 16:42.20 |
Robin_Watts | tor8: Well, I have a first version of progressive file loading working. | 17:19.58 |
| I've added a -b option to mupdf that lets you give it a bps figure, and it then simulates the chosen file arriving at that many bps. | 17:20.44 |
| So if I: mupdf -b 409600 pdf_reference17.pdf I first off get an error box telling me "not enough data to open the file". Then I click "OK" and it tries again, and I get another error box telling me "not enough data to count pages". Then I click OK again a few times, and that goes away and I get page 1 (possibly with no fonts). | 17:22.14 |
| If I navigate to pages I haven't got yet I get blank pages. | 17:22.38 |
| Then when the whole file arrives, it loads properly and I can navigate between properly sized/filled in pages. | 17:23.19 |
sebras | Robin_Watts: what prompts the file reader to progress? | 18:08.29 |
Robin_Watts | currently when you change page and reload. | 18:11.24 |
sebras | Robin_Watts: ok, so it's not the dialog-box-clicking in the scenario outlined above? | 18:16.10 |
Robin_Watts | sebras: Basically, when we try to read beyond the end of the data we have currently got, it throws a TRYLATER exception. | 18:17.01 |
| I've modified the app so that when it's initially trying to open, it just puts a dialogue up and retrys. | 18:17.41 |
| Once we have got far enough that the page object for the first page is loaded, we can then at least start to display something, and no more dialogues. | 18:18.22 |
| I'm working to the idea that if the app asks us to do something, either we should say "not yet", or we should say "here is the best I can do", or "Done". | 18:19.49 |
| "not yet" is achievable by the TRYLATER exception. | 18:20.08 |
| "here is the best I can do" can be done by the cookie coming back with an 'incomplete, try later' flag set. | 18:20.30 |
| and Done is normal exit. | 18:20.38 |
| I guess I should add a mechanism for the app to keep poling to say "is it worth me retrying yet?" | 18:21.05 |
| at the moment whenever we try to load a page, if the pageobject is NULL I call pdf_progressive_advance | 18:23.15 |
| and that 'gobbles' more objects from the file (akin to doing a repair), and inserting them into xref as it goes. | 18:23.38 |
| For the polling mechanism I could call the same function, and then give a return code based on whether we pass a significant point (i.e. another page loaded, or end of file reached or something). | 18:24.33 |
| I've just pushed the patch to my repo if you want to look at it. It's probably still broken in many places though... | 18:25.38 |
| got to go help cook. bbs. | 18:25.48 |
sebras | is back. | 18:36.48 |
| Robin_Watts: I'm thinking that this means that there are three ways of information there. | 18:38.58 |
| would it make sense to have cookie return -1 or something to mean TRYLATER? | 18:39.19 |
| I'm not even convinced that this is a good approach myself, just writing out loud. | 18:39.40 |
| maybe -1 has a special meaning already. | 18:40.00 |
| how much does pdf_progressive_advance() advance by? | 18:40.36 |
| oh, cooking. ok. I'll look in the sources. | 18:43.56 |
Robin_Watts | sebras: The cookie is a structure that's passed in. | 18:55.27 |
| so we can add as many fields to it as we want. | 18:55.37 |
| but the cookie isn't present for all operations - only for rendering. | 18:55.46 |
tor8 | Robin_Watts: just a thought; refactor pdf_xref and give it two modes -- normal and repair/progressive. a lot of the stuff we do when repairing is common with what we have to do for progressive loading. | 18:59.42 |
| and if we hit an xref/object/parsing error in normal mode, "restart" it in repair mode | 19:00.14 |
| so if we get a bad xref that looks initially okay, we won't completely break like currently | 19:00.53 |
sebras | tor8: on the subject of breaking -- you saw me mentioning that we die on improper hex-strings in repair-mode right? | 19:07.40 |
tor8 | sebras: no. I tihnk I've seen that problem though. | 19:08.23 |
| but is it die, or just spew a lot of warnings? | 19:08.39 |
sebras | tor8: warnings + error and then dead. | 19:09.03 |
| one of the ioccc entries generates a .pdf-file the doubles as a .c-file. | 19:09.26 |
| hence it contains #include <stdio.h> which trips us up. | 19:09.40 |
tor8 | rats. I think Robin_Watts did something about parsing hex strings to fix that (or cause it... never can know when Robin's been busy ;) | 19:09.40 |
sebras | I tested it late in the evening, so it may very well be me. | 19:11.36 |
| Forward 1 day (to 2012/10/20)>>> | |