| <<<Back 1 day (to 2017/08/13) | 20170814 |
tor8 | sebras: Guest26768: (for the logs) mutool convert is *not* a good tool to use to split a document | 09:37.55 |
| for the same reasons that using ghostscript and the pdfwrite device | 09:38.10 |
| you want 'mutool clean -o subset.pdf input.pdf 5-10' to extract a subset of pages | 09:38.32 |
| mutool convert recreates a new PDF file from the graphics... it's only going to bear a visual resemblance to the input file | 09:38.52 |
| Guest57283: an image is drawn using filling the unit rectangle [0 0 1 1], which can be transformed using the matrix to fill whatever region of the page you desire | 09:41.27 |
| when creating a PDF from scratch, you're right in that you're not building the resource dictionary properly | 09:43.11 |
| you can look at the 'mutool run' example in docs/examples/pdf-create.js for an example you can adapt to C | 09:44.11 |
Guest52101 | Hi tor, thanks for the help. For drawing an image, is it better to add to the dictionary (my second attempt)? | 11:21.19 |
tor8 | Guest52101: depending on what you want to accomplish in the long run, using the pdf document writer or creating your own content stream are both valid options | 11:32.04 |
| Guest52101: you create the resObj dictionary. in that one you need to create a second imResObj dictionary which has an entry for the imObj | 11:33.01 |
| so if imObj = pdf_add_image() that has the image object | 11:33.11 |
| then you need to create a new PDF dictionary imResObj = pdf_new_dict(). add the imObj using the same name as in your content stream: pdf_dict_puts(imResObj, "Im0", imObj) | 11:33.58 |
| then add the imResObj to the resource dictionary: resObj = pdf_new_dict(); pdf_dict_puts(resObj, "XObject", imResObj) | 11:34.21 |
| then when you create the page you pass the resObj to the pdf_add_page call | 11:35.39 |
| you do not need the pdf_add_object_drop(ctx, doc, imgObj) call | 11:36.21 |
| Guest52101: the "200 0 0 200 50 100 cm" line is equivalent to fz_matrix ctm = { 200, 0, 0, 200, 50, 100 } and passing that to fz_fill_image if using the pdf document writer | 11:38.06 |
| passing &fz_identity to fz_fill_image draws a 1-pixel by 1-pixel sized image in the upper left corner | 11:38.38 |
| BTW, the coordinate systems between the two approaches differ -- with the pdf document writer interface +Y is descending and the origin (0,0) is in the top left | 11:39.27 |
| in PDF content streams, +Y is ascending and the origin is in the bottom left | 11:39.42 |
avih | tor8: hey :) so what do you think of mujs' regex match and it's stack allocation of 280k? i'd have written a patch for it but it's not clear to me how that spawn works. it seems to be modifying the js PC and i don't really follow the codeflow there... | 12:04.11 |
| a trivial solution would be to allocate it on the heap instead, and another approach would be to allocate an array of pointers rather than of structs, and populate it as the need grows. | 12:05.59 |
tor8 | avih: reducing the REG_MAXSUB is also a trivial fix | 12:07.14 |
avih | (each item of that array is, iirc ~34 pointers and an int, which is 280 bytes on 64 system, so 280k for 1000 items, while musl'alpine default stack is 80k) | 12:07.16 |
tor8 | given that the Resub part of the regex matching thread state | 12:07.36 |
avih | yes, but that will also limit the behavior | 12:07.41 |
| (it is also possible to patch it just for alpine and create the js threads with bigger stack - which worked too) | 12:08.36 |
tor8 | MAXSUB can easily be reduced to 10 and still stay within the ecma spec limits | 12:09.09 |
| actually, it should be 99 to be perfectly compliant | 12:09.38 |
| but I certainly hope no-one uses more than 9 captures | 12:09.48 |
avih | re smaller array, iirc i already bumped into teh limit at least once (not sure if it's this array, but a regex limit nevertheless) when trying bable - es6 to es5 transpiler | 12:10.06 |
tor8 | doing a malloc/free for each match is also possible | 12:10.39 |
| the spawn function just clones the current thread's state to fork | 12:12.11 |
| the regex matcher runs all possible branches of a regex program in parallel. | 12:13.09 |
| it is not a backtracking matcher | 12:13.26 |
| now if you need smaller space, it would be pretty easy/trivial to write a backtracking matcher but it will have (like most fancy regex engines) pathological behaviour for certain classes of regex | 12:14.15 |
| a*a*a* for instance | 12:14.19 |
| you could write a backtracking matcher using the same bytecode program | 12:14.50 |
| 80k stack is very tight | 12:17.23 |
avih | it is, but surprisingly relatively few package need bigger stacks. sometimes it's improved upstream because it makes sense, sometimes an alpine specific patch is apply to the package by increasing a specific or all thread's stack for a program. | 12:21.23 |
| it's easy to get spoiled when glibc's default stack is 8M :) | 12:22.26 |
| but it has visible advantages. other than using the heap for what it's designed for, alpine memory usage is extremely low. booted without X, in bash and some httpd services, it uses less than 30M ram. run xfce with a nice theme and it's still less than 100M ram. for instance. | 12:25.07 |
tor8 | avih: there are two commits on the top of tor/wip you could give a spin | 12:38.40 |
| I haven't tested them myself yet, beyond verifying that it actually compiles | 12:38.53 |
avih | thanks. my regex use case is very limited (and i'm far from a regex guru). nut i can check that it still works. | 12:39.50 |
| but* | 12:39.58 |
tor8 | yeah, there could be some silly typo or two in there | 12:40.07 |
avih | tor8: hmm.. did you really prefer to implement a backtracking matcher over handling allocation a bit differently? | 12:43.11 |
| also at the first commit message s/Invorke/Invoke/ | 12:53.03 |
| tor8: i have a feeling matchbt recurses for each nwe char of the regex expression. if it's the same with non-bt, then maybe it may also spawns more threads then you expect it to. | 13:03.52 |
| new* | 13:03.57 |
| (i added printf at the beginning of matchbt which prints sp, and there are a LOT of prints) | 13:04.42 |
| (for simple regex) | 13:04.48 |
| sorry, it recurses for every char of the tested string. so if test a 1k string, as far as i can tell you're gonna have 1000 deep recursion | 13:13.18 |
| this can't be good | 13:13.43 |
tor8 | avih: it recurses for every *split* in the regex | 13:25.45 |
| i.e. each time it needs to branch | 13:25.52 |
avih | tor8: is it expected to recurse every new char of the tested string? | 13:26.11 |
tor8 | and a simple regex needs to be tested for each position in the string, so it needs to split for each character in 'top' loop | 13:26.22 |
avih | it doesn't seem right to me TBH | 13:26.27 |
| tor8: is a recursion depth equivalent to number of items use at the Rethread array? | 13:27.08 |
tor8 | the regex /foo/ is compiled to the same program as /^.*foo/ | 13:27.10 |
avih | the* | 13:27.13 |
| because this seems to me like it could hit memory limits very quickly when you search a substring in a big-ish string | 13:27.52 |
tor8 | avih: gcc -DTEST regexp.c utf*.c | 13:28.18 |
| then run ./a.out 'foo' and see the program that it compiles | 13:28.29 |
avih | i see it but can't interpret it. | 13:29.55 |
| regardless, i find it really hard to believe that substring match in a string of 5k will use 5k threads or recursion depth | 13:30.25 |
tor8 | "split 3 1" means to fork and continue matching from both 3 and 1 | 13:30.47 |
| in matchbt, try swapping pc->x and pc->y in case I_SPLIT | 13:31.40 |
| so we recurse on the pc->y and iterate on pc->x instead | 13:32.12 |
avih | replaced, the DTEST output looks the same. gonna try in my use case. | 13:33.08 |
| recursion depth seems the same to me | 13:33.59 |
tor8 | the output should be the same, the recursion should be different | 13:34.12 |
avih | i added/have printf at the beginning of matchbt, so every print is equivalent to entry. it looks the same | 13:35.20 |
| (but indeed it doesn't count depth. just entries) | 13:35.54 |
| actually not exactly the same. it's about half deep on some cases. | 13:36.53 |
tor8 | with x/y reversed, it stops earlier | 13:37.25 |
avih | or half entries. but still entry per char of the tested string. only previously i also saw the "back", and now it indeed stops earlier | 13:37.43 |
| (i.e. on some cases it was 2 * strlen and now it's "just" strlen entries) | 13:38.18 |
tor8 | avih: yes, it calls the function once for each character, but the depth in the two cases should differ | 13:38.31 |
| in the swapped x/y they should be flat | 13:38.44 |
avih | hmm.. ok. let me count the depth too then (i do believe you of course :) ) | 13:39.06 |
tor8 | just add printf('}') before the return statements | 13:39.59 |
| https://pastebin.com/raw/5M1RZ6vS | 13:41.22 |
| think of the 'split' instruction as a combined if-else instruction | 13:43.09 |
avih | yes, i can see. what are the x and y? progression in the regex and the tested string? | 13:43.25 |
tor8 | the x and y are the two branches of the if-else chain | 13:43.47 |
avih | (i added a depth argument and just print it) | 13:43.53 |
tor8 | with the swapped x/y it should be pretty similar in behavior to say PCRE | 13:44.49 |
| in terms of stack use and recursion etc | 13:44.57 |
avih | tor8: was the non bt implementation also spawning as many threads as the bt implementation recursion depth? | 13:45.03 |
| so it couldn't for instance test a 2k string? | 13:45.46 |
tor8 | the parallel implementation spawns "threads" and reuses them as they die | 13:46.08 |
| so each 'split' instruction it encounters will just add a new 'thread' to the queue | 13:47.02 |
| instead of recursing here and now | 13:47.09 |
| and once the current thread has died, it pops the next spawned thread off the queue and runs thaht | 13:47.33 |
avih | are they actual threads? and wouldn't the initial run while testing 2k string still hit the limit? | 13:47.53 |
tor8 | this has *nothing* to do with system threads | 13:48.05 |
avih | ok, that explains some things. so just "initial state which needs to be exhausted"? | 13:48.43 |
| anyway, how would you describe the recursion depth now roughly in terms of complexity? (O(...) ) | 13:49.52 |
tor8 | the regex matcher is a sort of VM | 13:49.54 |
| with very specialised 'threads' where each fork of the regex spawns a new clone of the program for matching | 13:50.25 |
avih | yeah, i sort of realized that. | 13:50.26 |
tor8 | then it just runs them sequentially until they die | 13:50.40 |
avih | so Rethread size is sort of the regex vm stack? | 13:51.10 |
tor8 | the instruction set doesn't have a 'branch' instruction like normal programs, it does branching by forking and running the branch in a new thread | 13:51.29 |
| rethread size is the max number of threads that can be queued | 13:51.40 |
avih | and typically each iteration queues one? | 13:52.14 |
| (i guess i should just read it if i really want to understand how it works, rather than wasting your time...) | 13:52.58 |
tor8 | avih: https://swtch.com/~rsc/regexp/regexp1.html | 13:53.53 |
avih | and the recursion approach and/or the x/y swap changes from one paradigm to the other? | 13:55.46 |
tor8 | neither, really. this just explains the approach taken by my implementation compared to a naive regex implementation. | 13:57.38 |
avih | so yours is Thompson's NFA? | 13:58.26 |
tor8 | ah crap, something's wrong with the matchbt parenthesis capturing | 14:00.15 |
| it's based on the same algorithm | 14:00.24 |
| https://swtch.com/~rsc/regexp/regexp2.html is the followup, which is what I modeled the implementation on | 14:01.06 |
avih | i actually have very good testcases for regex, other than babel. highlightjs and prism are syntax highlighters based on regex and written in js. i'm using them elsewhere in my own source-highlight replacement which i've written in node.js, and duktape is able to run them while mujs wasn't. maybe it'll be fixed now :) | 14:04.09 |
| (in mujs iirc one of them reached depth limit and the other you claimed is invalid regex) | 14:05.06 |
| (or maybe duktape was only able to run one of them - and slow, but it definitely got further than mujs) | 14:06.32 |
tor8 | avih: gotta go, back in a few hours | 14:11.04 |
avih | k, thanks for your time | 14:11.13 |
| tor8: fwiw, without the x-y swap, it indeed explodes very quickly (on alpine and elsewhere). with the xy swap it works nicely for 500k x "hello world " and match(/hello/g).length | 17:32.55 |
tor8 | avih: there's a fixed version on tor/wip | 17:32.58 |
avih | it also happens to be almost 2x faster than the "threads" code | 17:33.09 |
tor8 | I'm thinking of implementing the variant that executes the threads in lock-step which should let us reduce the stack use even more | 17:33.18 |
avih | tor8: and interestingly, your wip branch as is, with the fast array, is yet another ~x6 faster than the recurtion (and xy swap) | 17:33.52 |
tor8 | the "threads" code uses a fixed amount of stack, the backtracking one can explode the C stack due to recursion | 17:34.00 |
avih | i understand this, yes, though you could limit the length artificially too | 17:34.26 |
tor8 | the fast array stuff is *really* incomplete :) | 17:34.29 |
avih | i accidentally checkout tor8/wip rather than just cherry-pick the top commits, so i noticed :p | 17:34.59 |
tor8 | the 'threads' implementation could probably be somewhat optimized | 17:35.05 |
avih | anyway, the recursive approach with xy swap seems very good, and didn't explode on alpine, though without the "bad" split cases. and it should be easy to artificially limit the depth too | 17:35.57 |
| however, assuming the "thread" approach is better, why not just allocate this array on the heap? | 17:36.36 |
tor8 | a lockstep non-backtracking matcher (which is possible, I just never got around to it) will run in O(n * m) speed (where n=regexp length, m=search string length) | 17:37.15 |
avih | it's a fixed size array, even with consecutive calls to match, the allocator should easily use the just-release space without much penalty imo | 17:37.45 |
tor8 | it might be a constant factor slower for normal cases but should cope much more gracefully with pathological cases | 17:37.57 |
| and the MAXTHREADS in a lockstep implementation need only ever be the same length as the regex program | 17:38.18 |
avih | what about the "crap matching parenthesis" thingy you mentioned earlier? | 17:38.52 |
tor8 | so a lockstep implementation could allocate the exact amount needed on the heap | 17:38.55 |
| avih: I fixed that bug on tor/wip | 17:39.11 |
avih | oh. sec | 17:39.17 |
| tor8: i don't think i see it. is it part of the recursive code which isn't at the non recursive code? | 17:40.44 |
| (i.e. the threaded code is still broken?) | 17:41.18 |
tor8 | avih: nothing is broken now. | 17:51.49 |
avih | tor8: so what was broken before? just the recursive code which you later fixed? | 17:52.19 |
tor8 | I had some typos in how the Resub state was handled | 17:52.27 |
| in the recursive code | 17:52.34 |
avih | at the new recursive code | 17:52.40 |
| k | 17:52.40 |
| tor8: something is wrong with the new squashed patch: 1. it doesn't enable recursive by default. 2. even if i set the two places with opts=REG_RECURSIVE it still doesn't use the recursive match. 3. even if i also set flags=REG_RECURSIVE at two places at jsB_new_RegExp (instead of 0), it STILL doesn't use the recursive match. | 18:46.49 |
| i don't think i can enable the recursive implementation for "hello".match(/h/g) | 18:47.14 |
| there's clearly a code path which still sets it to 0 (eflags at regexec is 0 after all the above changes). it needs a more convenient way to choose an implementation, i don't think the caller should control it but rather a flag at regex.c, and i really think it should be recursive by default, possibly with depth limitation like the array size limits it for the "threaded" implementation | 18:49.34 |
| or, just allocate the array on the heap and call it a day instead of the new implementation. | 18:50.28 |
| also, imo, until a patchset is reasonably stable, IMO you shouldn't squash or force push. reflog is a lot less convinient to sift through than plain commits history | 18:52.28 |
Guest64308 | Hi tor8, thanks for the help. Now both methods show the image on the page. But, in both cases, the image is distorced and cropped. How do I show it properly? | 18:53.01 |
| I am using fz_image *image = fz_new_image_from_file(ctx, "2012_2.png"); | 18:53.53 |
avih | specifically, jsstring.c has a lot of places like this: if (js_regexec(re->prog, a, &m, a > text ? REG_NOTBOL : 0)) ... | 19:25.17 |
| where it uses 0 flags, which disables REG_RECURSIVE | 19:25.41 |
| it's implementation details, and should be an implementation flag outside of the control of the callers, imo. or, at the very least, let callers control it but via specific api to change its mode, once, on init, rather than on every regexec call | 19:26.51 |
| or just make it a #define IMP_RECURSIVE at regex.c and then #ifdef out one of the implementations. | 19:34.33 |
Guest7162 | Hello everyone. Besides \mupdf\docs\, are there other examples in C for using mupdf? For instance, spliting pdfs, drawing images, compressing using JBIG2, etc? | 22:26.37 |
| Forward 1 day (to 2017/08/15)>>> | |