Ghostscript IRC logs

Log of #ghostscript at irc.freenode.net.

	<<<Back 1 day (to 2014/12/23)	20141224
Robin_Watts	henrys: Please don't feel you should miss out on skiing because of us.	00:07.36
henrys	Robin_Watts: nope looking forward to going up to the park with all of you.	00:11.33
Robin_Watts	Booked flights and hotels today. We now have 3 holidays queued up :)	00:12.15
henrys	nice	00:12.31
Robin_Watts	malc_: You were asking about pdf_lookup_page_loc_imp	22:03.30
malc_	aye	22:03.35
Robin_Watts	Essentially that used to be implemented as a simple recursive function.	22:03.50
	but that turned out to be bad, for 2 reasons...	22:04.04
malc_	stackblowup	22:04.10
	i suppose	22:04.13
	try's	22:04.14
Robin_Watts	Firstly, some files could cause stack blowup, yes.	22:04.29
	In particular we saw some files that were particularly pathological. We saw page trees of the form:	22:05.37
	[ <Page1> [<Page 2> [<Page3> [<Page4> ...] ] ] ]	22:06.00
	The only other complexities here are 1) the need to keep a parent pointer for a given node, and 2) the need to ensure we don't go into an infinite loop (which we do by marking nodes as we search)	22:07.17
malc_	saw that yeah	22:07.54
Robin_Watts	So ideally, a file would produce a nice balanced node tree for the pages.	22:09.38
	I'm not sure what facilities we have within mupdf for making page trees.	22:11.17
	You could drive the low level PDF object manipulation functions yourself.	22:11.39
	What are you hoping to achieve?	22:11.43
malc_	have a linear array of all the objects representing pages	22:13.45
	for fast lookup	22:13.54
	basically the way it was before	22:14.14
Robin_Watts	malc_: The problem with that is that when we do manipulations that change the page tree, that gets out of date.	22:17.20
malc_	sure	22:17.27
	but	22:17.28
Robin_Watts	Also it means we have to read the entire tree to start with.	22:17.31
malc_	unless you do that	22:17.42
Robin_Watts	A better scheme might be to have a page cache.	22:17.50
malc_	you can not presnet the user with the information of where the hell he is	22:17.55
	think scollbar or somesuch	22:18.01
Robin_Watts	malc_: Eh?	22:18.08
	We can know how many pages there are, without having loaded them all.	22:18.30
malc_	you need to know how tall is the entire document, you can't know that unless you count the individual pages heights	22:18.46
Robin_Watts	malc_: Right, yes.	22:18.54
malc_	"..ent is,"	22:18.55
Robin_Watts	But we have to be a tad circumspect about this, especially as we add support for new formats, like the forthcoming epub support.	22:19.24
	We want to move away from knowing the number of pages at load time - cos for epub that requires us to lay out the whole damn document.	22:19.46
malc_	even el'cheepo ebooks show you how many pages there are	22:20.43
	but that's beside the point i suppose	22:20.58
Robin_Watts	malc_: Right, but if you load a book in an ebook reader it will tell you 'x' pages, but it lies.	22:21.25
	(at least many of them lie). Change the font size, and the number of pages doesn't change.	22:21.42
	Moving pages doesn't always change the page number.	22:21.49
	The smart way to do this would be to have the app load the PDF file, read the size of the first page, read the number of pages, and then guess at a size.	22:22.52
	Then you could run through in the background reading a page at a time and adjusting the height to be correct.	22:23.12
	but that's app level cleverness, not core cleverness, to my mind.	22:23.34
	To have to load 5158 pages before you show the first one just because you want the scrollbar size to be exact seems... excessive.	22:24.07
malc_	i don't have to load 5158 pages	22:26.08
	only get their mediaboxen	22:26.13
	big difference	22:26.16
	and guesswork is inadequate if you spoort whitespace trimming and such	22:27.11
Robin_Watts	That requires us to hunt through though.	22:27.20
malc_	Robin_Watts: sorry for being obtuse, but i'm still a bit unsure how to achieve what i want	22:58.47
	drop "a bit"	22:59.13
Robin_Watts	malc_: If your intention is to speedily run through the entire file fetching the page boxes, then we probably need to write some new code.	23:00.30
	Possibly to have a page iterator.	23:00.40
	Or page 'map' function.	23:00.52
	But that will require some coding within the mupdf core, which you could do.	23:01.11
	Alternatively, if you just want to remain a 'user' of the core, then you'd need to do some cleverer coding. Like guessing at a size based on the first page, and then refining that guess on a background thread.	23:02.25
malc_	Robin_Watts: i was thinking about stealing https://github.com/sumatrapdfreader/sumatrapdf/blob/master/src/PdfEngine.cpp (line 1097 and bellow) but prefer to understand what i'm doing...	23:05.21
Robin_Watts	malc_: The simplest 'fast' way of traversing all the pages would be to copy that pdf_lookup_page_loc_imp function, as a new function and modifying it.	23:05.32
	where pdf_lookup_page_loc_imp skips to a particular page within the page tree, just make it run through every single entry.	23:06.07
	And for each entry it finds, call a function pointer that's passed in.	23:06.25
	That gives you a 'map this function across every page' function, right?	23:06.38
	You can then call that new 'map' function with a function that extracts the page mediabox.	23:07.16
malc_	Robin_Watts: that's what i was trying to do :)	23:07.19
Robin_Watts	Ok.	23:07.23
malc_	but i'm failing	23:07.30
Robin_Watts	ok.	23:07.40
	The current 'stack' in there is used to store each node as we pass through it.	23:09.28
	At the moment we only ever move down the tree - hence the stack only gets extended (within the traversal)	23:09.49
	and then gets wound up at the end.	23:09.56
	You'll need to change that to step both up and down the tree. (basically you're doing a depth first search)	23:10.16
	You can probably lose indexp and parentp.	23:10.36
	and skip goes away too.	23:10.43
	You may be best dropping back to a simple recursive implementation while you get it right, and then turn that into an iterative one later - it all depends what you're most comfortable with.	23:11.39
malc_	Robin_Watts: yes, i was pondering that route too, thanks	23:12.05
	Robin_Watts: and done.. thanks a lot	23:31.11
Robin_Watts	malc_: Fab. let me know how you get on. This may be something we should consider generalising.	23:42.06
	not entirely sure how given we want it to work with lots of file formats, but...	23:42.25
malc_	Robin_Watts: i was discussing it with Tor and he was against the idea of having some sort of visitor function supplied to lookup	23:43.06
	which would have allowed one to cache the stuff	23:43.16
Robin_Watts	malc_: I think I agree with him that we don't want to clutter the existing lookup functions.	23:44.03
	What we're talking about here is a different beast. An addition to the API if you like.	23:44.31
malc_	Robin_Watts: sure	23:46.08
	Forward 1 day (to 2014/12/25)>>>

IRC Logs

Log of #ghostscript at irc.freenode.net.