| <<<Back 1 day (to 2014/12/23) | 20141224 |
Robin_Watts | henrys: Please don't feel you should miss out on skiing because of us. | 00:07.36 |
henrys | Robin_Watts: nope looking forward to going up to the park with all of you. | 00:11.33 |
Robin_Watts | Booked flights and hotels today. We now have 3 holidays queued up :) | 00:12.15 |
henrys | nice | 00:12.31 |
Robin_Watts | malc_: You were asking about pdf_lookup_page_loc_imp | 22:03.30 |
malc_ | aye | 22:03.35 |
Robin_Watts | Essentially that used to be implemented as a simple recursive function. | 22:03.50 |
| but that turned out to be bad, for 2 reasons... | 22:04.04 |
malc_ | stackblowup | 22:04.10 |
| i suppose | 22:04.13 |
| try's | 22:04.14 |
Robin_Watts | Firstly, some files could cause stack blowup, yes. | 22:04.29 |
| In particular we saw some files that were particularly pathological. We saw page trees of the form: | 22:05.37 |
| [ <Page1> [<Page 2> [<Page3> [<Page4> ...] ] ] ] | 22:06.00 |
| The only other complexities here are 1) the need to keep a parent pointer for a given node, and 2) the need to ensure we don't go into an infinite loop (which we do by marking nodes as we search) | 22:07.17 |
malc_ | saw that yeah | 22:07.54 |
Robin_Watts | So ideally, a file would produce a nice balanced node tree for the pages. | 22:09.38 |
| I'm not sure what facilities we have within mupdf for making page trees. | 22:11.17 |
| You could drive the low level PDF object manipulation functions yourself. | 22:11.39 |
| What are you hoping to achieve? | 22:11.43 |
malc_ | have a linear array of all the objects representing pages | 22:13.45 |
| for fast lookup | 22:13.54 |
| basically the way it was before | 22:14.14 |
Robin_Watts | malc_: The problem with that is that when we do manipulations that change the page tree, that gets out of date. | 22:17.20 |
malc_ | sure | 22:17.27 |
| but | 22:17.28 |
Robin_Watts | Also it means we have to read the entire tree to start with. | 22:17.31 |
malc_ | unless you do that | 22:17.42 |
Robin_Watts | A better scheme might be to have a page cache. | 22:17.50 |
malc_ | you can not presnet the user with the information of where the hell he is | 22:17.55 |
| think scollbar or somesuch | 22:18.01 |
Robin_Watts | malc_: Eh? | 22:18.08 |
| We can know how many pages there are, without having loaded them all. | 22:18.30 |
malc_ | you need to know how tall is the entire document, you can't know that unless you count the individual pages heights | 22:18.46 |
Robin_Watts | malc_: Right, yes. | 22:18.54 |
malc_ | "..ent is," | 22:18.55 |
Robin_Watts | But we have to be a tad circumspect about this, especially as we add support for new formats, like the forthcoming epub support. | 22:19.24 |
| We want to move away from knowing the number of pages at load time - cos for epub that requires us to lay out the whole damn document. | 22:19.46 |
malc_ | even el'cheepo ebooks show you how many pages there are | 22:20.43 |
| but that's beside the point i suppose | 22:20.58 |
Robin_Watts | malc_: Right, but if you load a book in an ebook reader it will tell you 'x' pages, but it lies. | 22:21.25 |
| (at least many of them lie). Change the font size, and the number of pages doesn't change. | 22:21.42 |
| Moving pages doesn't always change the page number. | 22:21.49 |
| The smart way to do this would be to have the app load the PDF file, read the size of the first page, read the number of pages, and then guess at a size. | 22:22.52 |
| Then you could run through in the background reading a page at a time and adjusting the height to be correct. | 22:23.12 |
| but that's app level cleverness, not core cleverness, to my mind. | 22:23.34 |
| To have to load 5158 pages before you show the first one just because you want the scrollbar size to be exact seems... excessive. | 22:24.07 |
malc_ | i don't have to load 5158 pages | 22:26.08 |
| only get their mediaboxen | 22:26.13 |
| big difference | 22:26.16 |
| and guesswork is inadequate if you spoort whitespace trimming and such | 22:27.11 |
Robin_Watts | That requires us to hunt through though. | 22:27.20 |
malc_ | Robin_Watts: sorry for being obtuse, but i'm still a bit unsure how to achieve what i want | 22:58.47 |
| drop "a bit" | 22:59.13 |
Robin_Watts | malc_: If your intention is to speedily run through the entire file fetching the page boxes, then we probably need to write some new code. | 23:00.30 |
| Possibly to have a page iterator. | 23:00.40 |
| Or page 'map' function. | 23:00.52 |
| But that will require some coding within the mupdf core, which you could do. | 23:01.11 |
| Alternatively, if you just want to remain a 'user' of the core, then you'd need to do some cleverer coding. Like guessing at a size based on the first page, and then refining that guess on a background thread. | 23:02.25 |
malc_ | Robin_Watts: i was thinking about stealing https://github.com/sumatrapdfreader/sumatrapdf/blob/master/src/PdfEngine.cpp (line 1097 and bellow) but prefer to understand what i'm doing... | 23:05.21 |
Robin_Watts | malc_: The simplest 'fast' way of traversing all the pages would be to copy that pdf_lookup_page_loc_imp function, as a new function and modifying it. | 23:05.32 |
| where pdf_lookup_page_loc_imp skips to a particular page within the page tree, just make it run through every single entry. | 23:06.07 |
| And for each entry it finds, call a function pointer that's passed in. | 23:06.25 |
| That gives you a 'map this function across every page' function, right? | 23:06.38 |
| You can then call that new 'map' function with a function that extracts the page mediabox. | 23:07.16 |
malc_ | Robin_Watts: that's what i was trying to do :) | 23:07.19 |
Robin_Watts | Ok. | 23:07.23 |
malc_ | but i'm failing | 23:07.30 |
Robin_Watts | ok. | 23:07.40 |
| The current 'stack' in there is used to store each node as we pass through it. | 23:09.28 |
| At the moment we only ever move down the tree - hence the stack only gets extended (within the traversal) | 23:09.49 |
| and then gets wound up at the end. | 23:09.56 |
| You'll need to change that to step both up and down the tree. (basically you're doing a depth first search) | 23:10.16 |
| You can probably lose indexp and parentp. | 23:10.36 |
| and skip goes away too. | 23:10.43 |
| You may be best dropping back to a simple recursive implementation while you get it right, and then turn that into an iterative one later - it all depends what you're most comfortable with. | 23:11.39 |
malc_ | Robin_Watts: yes, i was pondering that route too, thanks | 23:12.05 |
| Robin_Watts: and done.. thanks a lot | 23:31.11 |
Robin_Watts | malc_: Fab. let me know how you get on. This may be something we should consider generalising. | 23:42.06 |
| not entirely sure how given we want it to work with lots of file formats, but... | 23:42.25 |
malc_ | Robin_Watts: i was discussing it with Tor and he was against the idea of having some sort of visitor function supplied to lookup | 23:43.06 |
| which would have allowed one to cache the stuff | 23:43.16 |
Robin_Watts | malc_: I think I agree with him that we don't want to clutter the existing lookup functions. | 23:44.03 |
| What we're talking about here is a different beast. An addition to the API if you like. | 23:44.31 |
malc_ | Robin_Watts: sure | 23:46.08 |
| Forward 1 day (to 2014/12/25)>>> | |