MuPDF IRC logs

	<<<Back 1 day (to 2017/08/13)	20170814
tor8	sebras: Guest26768: (for the logs) mutool convert is not a good tool to use to split a document	09:37.55
	for the same reasons that using ghostscript and the pdfwrite device	09:38.10
	you want 'mutool clean -o subset.pdf input.pdf 5-10' to extract a subset of pages	09:38.32
	mutool convert recreates a new PDF file from the graphics... it's only going to bear a visual resemblance to the input file	09:38.52
	Guest57283: an image is drawn using filling the unit rectangle [0 0 1 1], which can be transformed using the matrix to fill whatever region of the page you desire	09:41.27
	when creating a PDF from scratch, you're right in that you're not building the resource dictionary properly	09:43.11
	you can look at the 'mutool run' example in docs/examples/pdf-create.js for an example you can adapt to C	09:44.11
Guest52101	Hi tor, thanks for the help. For drawing an image, is it better to add to the dictionary (my second attempt)?	11:21.19
tor8	Guest52101: depending on what you want to accomplish in the long run, using the pdf document writer or creating your own content stream are both valid options	11:32.04
	Guest52101: you create the resObj dictionary. in that one you need to create a second imResObj dictionary which has an entry for the imObj	11:33.01
	so if imObj = pdf_add_image() that has the image object	11:33.11
	then you need to create a new PDF dictionary imResObj = pdf_new_dict(). add the imObj using the same name as in your content stream: pdf_dict_puts(imResObj, "Im0", imObj)	11:33.58
	then add the imResObj to the resource dictionary: resObj = pdf_new_dict(); pdf_dict_puts(resObj, "XObject", imResObj)	11:34.21
	then when you create the page you pass the resObj to the pdf_add_page call	11:35.39
	you do not need the pdf_add_object_drop(ctx, doc, imgObj) call	11:36.21
	Guest52101: the "200 0 0 200 50 100 cm" line is equivalent to fz_matrix ctm = { 200, 0, 0, 200, 50, 100 } and passing that to fz_fill_image if using the pdf document writer	11:38.06
	passing &fz_identity to fz_fill_image draws a 1-pixel by 1-pixel sized image in the upper left corner	11:38.38
	BTW, the coordinate systems between the two approaches differ -- with the pdf document writer interface +Y is descending and the origin (0,0) is in the top left	11:39.27
	in PDF content streams, +Y is ascending and the origin is in the bottom left	11:39.42
avih	tor8: hey :) so what do you think of mujs' regex match and it's stack allocation of 280k? i'd have written a patch for it but it's not clear to me how that spawn works. it seems to be modifying the js PC and i don't really follow the codeflow there...	12:04.11
	a trivial solution would be to allocate it on the heap instead, and another approach would be to allocate an array of pointers rather than of structs, and populate it as the need grows.	12:05.59
tor8	avih: reducing the REG_MAXSUB is also a trivial fix	12:07.14
avih	(each item of that array is, iirc ~34 pointers and an int, which is 280 bytes on 64 system, so 280k for 1000 items, while musl'alpine default stack is 80k)	12:07.16
tor8	given that the Resub part of the regex matching thread state	12:07.36
avih	yes, but that will also limit the behavior	12:07.41
	(it is also possible to patch it just for alpine and create the js threads with bigger stack - which worked too)	12:08.36
tor8	MAXSUB can easily be reduced to 10 and still stay within the ecma spec limits	12:09.09
	actually, it should be 99 to be perfectly compliant	12:09.38
	but I certainly hope no-one uses more than 9 captures	12:09.48
avih	re smaller array, iirc i already bumped into teh limit at least once (not sure if it's this array, but a regex limit nevertheless) when trying bable - es6 to es5 transpiler	12:10.06
tor8	doing a malloc/free for each match is also possible	12:10.39
	the spawn function just clones the current thread's state to fork	12:12.11
	the regex matcher runs all possible branches of a regex program in parallel.	12:13.09
	it is not a backtracking matcher	12:13.26
	now if you need smaller space, it would be pretty easy/trivial to write a backtracking matcher but it will have (like most fancy regex engines) pathological behaviour for certain classes of regex	12:14.15
	aaa* for instance	12:14.19
	you could write a backtracking matcher using the same bytecode program	12:14.50
	80k stack is very tight	12:17.23
avih	it is, but surprisingly relatively few package need bigger stacks. sometimes it's improved upstream because it makes sense, sometimes an alpine specific patch is apply to the package by increasing a specific or all thread's stack for a program.	12:21.23
	it's easy to get spoiled when glibc's default stack is 8M :)	12:22.26
	but it has visible advantages. other than using the heap for what it's designed for, alpine memory usage is extremely low. booted without X, in bash and some httpd services, it uses less than 30M ram. run xfce with a nice theme and it's still less than 100M ram. for instance.	12:25.07
tor8	avih: there are two commits on the top of tor/wip you could give a spin	12:38.40
	I haven't tested them myself yet, beyond verifying that it actually compiles	12:38.53
avih	thanks. my regex use case is very limited (and i'm far from a regex guru). nut i can check that it still works.	12:39.50
	but*	12:39.58
tor8	yeah, there could be some silly typo or two in there	12:40.07
avih	tor8: hmm.. did you really prefer to implement a backtracking matcher over handling allocation a bit differently?	12:43.11
	also at the first commit message s/Invorke/Invoke/	12:53.03
	tor8: i have a feeling matchbt recurses for each nwe char of the regex expression. if it's the same with non-bt, then maybe it may also spawns more threads then you expect it to.	13:03.52
	new*	13:03.57
	(i added printf at the beginning of matchbt which prints sp, and there are a LOT of prints)	13:04.42
	(for simple regex)	13:04.48
	sorry, it recurses for every char of the tested string. so if test a 1k string, as far as i can tell you're gonna have 1000 deep recursion	13:13.18
	this can't be good	13:13.43
tor8	avih: it recurses for every split in the regex	13:25.45
	i.e. each time it needs to branch	13:25.52
avih	tor8: is it expected to recurse every new char of the tested string?	13:26.11
tor8	and a simple regex needs to be tested for each position in the string, so it needs to split for each character in 'top' loop	13:26.22
avih	it doesn't seem right to me TBH	13:26.27
	tor8: is a recursion depth equivalent to number of items use at the Rethread array?	13:27.08
tor8	the regex /foo/ is compiled to the same program as /^.*foo/	13:27.10
avih	the*	13:27.13
	because this seems to me like it could hit memory limits very quickly when you search a substring in a big-ish string	13:27.52
tor8	avih: gcc -DTEST regexp.c utf*.c	13:28.18
	then run ./a.out 'foo' and see the program that it compiles	13:28.29
avih	i see it but can't interpret it.	13:29.55
	regardless, i find it really hard to believe that substring match in a string of 5k will use 5k threads or recursion depth	13:30.25
tor8	"split 3 1" means to fork and continue matching from both 3 and 1	13:30.47
	in matchbt, try swapping pc->x and pc->y in case I_SPLIT	13:31.40
	so we recurse on the pc->y and iterate on pc->x instead	13:32.12
avih	replaced, the DTEST output looks the same. gonna try in my use case.	13:33.08
	recursion depth seems the same to me	13:33.59
tor8	the output should be the same, the recursion should be different	13:34.12
avih	i added/have printf at the beginning of matchbt, so every print is equivalent to entry. it looks the same	13:35.20
	(but indeed it doesn't count depth. just entries)	13:35.54
	actually not exactly the same. it's about half deep on some cases.	13:36.53
tor8	with x/y reversed, it stops earlier	13:37.25
avih	or half entries. but still entry per char of the tested string. only previously i also saw the "back", and now it indeed stops earlier	13:37.43
	(i.e. on some cases it was 2 * strlen and now it's "just" strlen entries)	13:38.18
tor8	avih: yes, it calls the function once for each character, but the depth in the two cases should differ	13:38.31
	in the swapped x/y they should be flat	13:38.44
avih	hmm.. ok. let me count the depth too then (i do believe you of course :) )	13:39.06
tor8	just add printf('}') before the return statements	13:39.59
	https://pastebin.com/raw/5M1RZ6vS	13:41.22
	think of the 'split' instruction as a combined if-else instruction	13:43.09
avih	yes, i can see. what are the x and y? progression in the regex and the tested string?	13:43.25
tor8	the x and y are the two branches of the if-else chain	13:43.47
avih	(i added a depth argument and just print it)	13:43.53
tor8	with the swapped x/y it should be pretty similar in behavior to say PCRE	13:44.49
	in terms of stack use and recursion etc	13:44.57
avih	tor8: was the non bt implementation also spawning as many threads as the bt implementation recursion depth?	13:45.03
	so it couldn't for instance test a 2k string?	13:45.46
tor8	the parallel implementation spawns "threads" and reuses them as they die	13:46.08
	so each 'split' instruction it encounters will just add a new 'thread' to the queue	13:47.02
	instead of recursing here and now	13:47.09
	and once the current thread has died, it pops the next spawned thread off the queue and runs thaht	13:47.33
avih	are they actual threads? and wouldn't the initial run while testing 2k string still hit the limit?	13:47.53
tor8	this has nothing to do with system threads	13:48.05
avih	ok, that explains some things. so just "initial state which needs to be exhausted"?	13:48.43
	anyway, how would you describe the recursion depth now roughly in terms of complexity? (O(...) )	13:49.52
tor8	the regex matcher is a sort of VM	13:49.54
	with very specialised 'threads' where each fork of the regex spawns a new clone of the program for matching	13:50.25
avih	yeah, i sort of realized that.	13:50.26
tor8	then it just runs them sequentially until they die	13:50.40
avih	so Rethread size is sort of the regex vm stack?	13:51.10
tor8	the instruction set doesn't have a 'branch' instruction like normal programs, it does branching by forking and running the branch in a new thread	13:51.29
	rethread size is the max number of threads that can be queued	13:51.40
avih	and typically each iteration queues one?	13:52.14
	(i guess i should just read it if i really want to understand how it works, rather than wasting your time...)	13:52.58
tor8	avih: https://swtch.com/~rsc/regexp/regexp1.html	13:53.53
avih	and the recursion approach and/or the x/y swap changes from one paradigm to the other?	13:55.46
tor8	neither, really. this just explains the approach taken by my implementation compared to a naive regex implementation.	13:57.38
avih	so yours is Thompson's NFA?	13:58.26
tor8	ah crap, something's wrong with the matchbt parenthesis capturing	14:00.15
	it's based on the same algorithm	14:00.24
	https://swtch.com/~rsc/regexp/regexp2.html is the followup, which is what I modeled the implementation on	14:01.06
avih	i actually have very good testcases for regex, other than babel. highlightjs and prism are syntax highlighters based on regex and written in js. i'm using them elsewhere in my own source-highlight replacement which i've written in node.js, and duktape is able to run them while mujs wasn't. maybe it'll be fixed now :)	14:04.09
	(in mujs iirc one of them reached depth limit and the other you claimed is invalid regex)	14:05.06
	(or maybe duktape was only able to run one of them - and slow, but it definitely got further than mujs)	14:06.32
tor8	avih: gotta go, back in a few hours	14:11.04
avih	k, thanks for your time	14:11.13
	tor8: fwiw, without the x-y swap, it indeed explodes very quickly (on alpine and elsewhere). with the xy swap it works nicely for 500k x "hello world " and match(/hello/g).length	17:32.55
tor8	avih: there's a fixed version on tor/wip	17:32.58
avih	it also happens to be almost 2x faster than the "threads" code	17:33.09
tor8	I'm thinking of implementing the variant that executes the threads in lock-step which should let us reduce the stack use even more	17:33.18
avih	tor8: and interestingly, your wip branch as is, with the fast array, is yet another ~x6 faster than the recurtion (and xy swap)	17:33.52
tor8	the "threads" code uses a fixed amount of stack, the backtracking one can explode the C stack due to recursion	17:34.00
avih	i understand this, yes, though you could limit the length artificially too	17:34.26
tor8	the fast array stuff is really incomplete :)	17:34.29
avih	i accidentally checkout tor8/wip rather than just cherry-pick the top commits, so i noticed :p	17:34.59
tor8	the 'threads' implementation could probably be somewhat optimized	17:35.05
avih	anyway, the recursive approach with xy swap seems very good, and didn't explode on alpine, though without the "bad" split cases. and it should be easy to artificially limit the depth too	17:35.57
	however, assuming the "thread" approach is better, why not just allocate this array on the heap?	17:36.36
tor8	a lockstep non-backtracking matcher (which is possible, I just never got around to it) will run in O(n * m) speed (where n=regexp length, m=search string length)	17:37.15
avih	it's a fixed size array, even with consecutive calls to match, the allocator should easily use the just-release space without much penalty imo	17:37.45
tor8	it might be a constant factor slower for normal cases but should cope much more gracefully with pathological cases	17:37.57
	and the MAXTHREADS in a lockstep implementation need only ever be the same length as the regex program	17:38.18
avih	what about the "crap matching parenthesis" thingy you mentioned earlier?	17:38.52
tor8	so a lockstep implementation could allocate the exact amount needed on the heap	17:38.55
	avih: I fixed that bug on tor/wip	17:39.11
avih	oh. sec	17:39.17
	tor8: i don't think i see it. is it part of the recursive code which isn't at the non recursive code?	17:40.44
	(i.e. the threaded code is still broken?)	17:41.18
tor8	avih: nothing is broken now.	17:51.49
avih	tor8: so what was broken before? just the recursive code which you later fixed?	17:52.19
tor8	I had some typos in how the Resub state was handled	17:52.27
	in the recursive code	17:52.34
avih	at the new recursive code	17:52.40
	k	17:52.40
	tor8: something is wrong with the new squashed patch: 1. it doesn't enable recursive by default. 2. even if i set the two places with opts=REG_RECURSIVE it still doesn't use the recursive match. 3. even if i also set flags=REG_RECURSIVE at two places at jsB_new_RegExp (instead of 0), it STILL doesn't use the recursive match.	18:46.49
	i don't think i can enable the recursive implementation for "hello".match(/h/g)	18:47.14
	there's clearly a code path which still sets it to 0 (eflags at regexec is 0 after all the above changes). it needs a more convenient way to choose an implementation, i don't think the caller should control it but rather a flag at regex.c, and i really think it should be recursive by default, possibly with depth limitation like the array size limits it for the "threaded" implementation	18:49.34
	or, just allocate the array on the heap and call it a day instead of the new implementation.	18:50.28
	also, imo, until a patchset is reasonably stable, IMO you shouldn't squash or force push. reflog is a lot less convinient to sift through than plain commits history	18:52.28
Guest64308	Hi tor8, thanks for the help. Now both methods show the image on the page. But, in both cases, the image is distorced and cropped. How do I show it properly?	18:53.01
	I am using fz_image *image = fz_new_image_from_file(ctx, "2012_2.png");	18:53.53
avih	specifically, jsstring.c has a lot of places like this: if (js_regexec(re->prog, a, &m, a > text ? REG_NOTBOL : 0)) ...	19:25.17
	where it uses 0 flags, which disables REG_RECURSIVE	19:25.41
	it's implementation details, and should be an implementation flag outside of the control of the callers, imo. or, at the very least, let callers control it but via specific api to change its mode, once, on init, rather than on every regexec call	19:26.51
	or just make it a #define IMP_RECURSIVE at regex.c and then #ifdef out one of the implementations.	19:34.33
Guest7162	Hello everyone. Besides \mupdf\docs\, are there other examples in C for using mupdf? For instance, spliting pdfs, drawing images, compressing using JBIG2, etc?	22:26.37
	Forward 1 day (to 2017/08/15)>>>

Log of #mupdf at irc.freenode.net.