Log of #mupdf at irc.freenode.net.

Search:
 <<<Back 1 day (to 2017/08/13)20170814 
tor8 sebras: Guest26768: (for the logs) mutool convert is *not* a good tool to use to split a document09:37.55 
  for the same reasons that using ghostscript and the pdfwrite device09:38.10 
  you want 'mutool clean -o subset.pdf input.pdf 5-10' to extract a subset of pages09:38.32 
  mutool convert recreates a new PDF file from the graphics... it's only going to bear a visual resemblance to the input file09:38.52 
  Guest57283: an image is drawn using filling the unit rectangle [0 0 1 1], which can be transformed using the matrix to fill whatever region of the page you desire09:41.27 
  when creating a PDF from scratch, you're right in that you're not building the resource dictionary properly09:43.11 
  you can look at the 'mutool run' example in docs/examples/pdf-create.js for an example you can adapt to C09:44.11 
Guest52101 Hi tor, thanks for the help. For drawing an image, is it better to add to the dictionary (my second attempt)?11:21.19 
tor8 Guest52101: depending on what you want to accomplish in the long run, using the pdf document writer or creating your own content stream are both valid options11:32.04 
  Guest52101: you create the resObj dictionary. in that one you need to create a second imResObj dictionary which has an entry for the imObj11:33.01 
  so if imObj = pdf_add_image() that has the image object11:33.11 
  then you need to create a new PDF dictionary imResObj = pdf_new_dict(). add the imObj using the same name as in your content stream: pdf_dict_puts(imResObj, "Im0", imObj)11:33.58 
  then add the imResObj to the resource dictionary: resObj = pdf_new_dict(); pdf_dict_puts(resObj, "XObject", imResObj)11:34.21 
  then when you create the page you pass the resObj to the pdf_add_page call11:35.39 
  you do not need the pdf_add_object_drop(ctx, doc, imgObj) call11:36.21 
  Guest52101: the "200 0 0 200 50 100 cm" line is equivalent to fz_matrix ctm = { 200, 0, 0, 200, 50, 100 } and passing that to fz_fill_image if using the pdf document writer11:38.06 
  passing &fz_identity to fz_fill_image draws a 1-pixel by 1-pixel sized image in the upper left corner11:38.38 
  BTW, the coordinate systems between the two approaches differ -- with the pdf document writer interface +Y is descending and the origin (0,0) is in the top left11:39.27 
  in PDF content streams, +Y is ascending and the origin is in the bottom left11:39.42 
avih tor8: hey :) so what do you think of mujs' regex match and it's stack allocation of 280k? i'd have written a patch for it but it's not clear to me how that spawn works. it seems to be modifying the js PC and i don't really follow the codeflow there...12:04.11 
  a trivial solution would be to allocate it on the heap instead, and another approach would be to allocate an array of pointers rather than of structs, and populate it as the need grows.12:05.59 
tor8 avih: reducing the REG_MAXSUB is also a trivial fix12:07.14 
avih (each item of that array is, iirc ~34 pointers and an int, which is 280 bytes on 64 system, so 280k for 1000 items, while musl'alpine default stack is 80k)12:07.16 
tor8 given that the Resub part of the regex matching thread state12:07.36 
avih yes, but that will also limit the behavior12:07.41 
  (it is also possible to patch it just for alpine and create the js threads with bigger stack - which worked too)12:08.36 
tor8 MAXSUB can easily be reduced to 10 and still stay within the ecma spec limits12:09.09 
  actually, it should be 99 to be perfectly compliant12:09.38 
  but I certainly hope no-one uses more than 9 captures12:09.48 
avih re smaller array, iirc i already bumped into teh limit at least once (not sure if it's this array, but a regex limit nevertheless) when trying bable - es6 to es5 transpiler12:10.06 
tor8 doing a malloc/free for each match is also possible12:10.39 
  the spawn function just clones the current thread's state to fork12:12.11 
  the regex matcher runs all possible branches of a regex program in parallel.12:13.09 
  it is not a backtracking matcher12:13.26 
  now if you need smaller space, it would be pretty easy/trivial to write a backtracking matcher but it will have (like most fancy regex engines) pathological behaviour for certain classes of regex12:14.15 
  a*a*a* for instance12:14.19 
  you could write a backtracking matcher using the same bytecode program12:14.50 
  80k stack is very tight12:17.23 
avih it is, but surprisingly relatively few package need bigger stacks. sometimes it's improved upstream because it makes sense, sometimes an alpine specific patch is apply to the package by increasing a specific or all thread's stack for a program.12:21.23 
  it's easy to get spoiled when glibc's default stack is 8M :)12:22.26 
  but it has visible advantages. other than using the heap for what it's designed for, alpine memory usage is extremely low. booted without X, in bash and some httpd services, it uses less than 30M ram. run xfce with a nice theme and it's still less than 100M ram. for instance.12:25.07 
tor8 avih: there are two commits on the top of tor/wip you could give a spin12:38.40 
  I haven't tested them myself yet, beyond verifying that it actually compiles12:38.53 
avih thanks. my regex use case is very limited (and i'm far from a regex guru). nut i can check that it still works.12:39.50 
  but*12:39.58 
tor8 yeah, there could be some silly typo or two in there12:40.07 
avih tor8: hmm.. did you really prefer to implement a backtracking matcher over handling allocation a bit differently?12:43.11 
  also at the first commit message s/Invorke/Invoke/12:53.03 
  tor8: i have a feeling matchbt recurses for each nwe char of the regex expression. if it's the same with non-bt, then maybe it may also spawns more threads then you expect it to.13:03.52 
  new*13:03.57 
  (i added printf at the beginning of matchbt which prints sp, and there are a LOT of prints)13:04.42 
  (for simple regex)13:04.48 
  sorry, it recurses for every char of the tested string. so if test a 1k string, as far as i can tell you're gonna have 1000 deep recursion13:13.18 
  this can't be good13:13.43 
tor8 avih: it recurses for every *split* in the regex13:25.45 
  i.e. each time it needs to branch13:25.52 
avih tor8: is it expected to recurse every new char of the tested string?13:26.11 
tor8 and a simple regex needs to be tested for each position in the string, so it needs to split for each character in 'top' loop13:26.22 
avih it doesn't seem right to me TBH13:26.27 
  tor8: is a recursion depth equivalent to number of items use at the Rethread array?13:27.08 
tor8 the regex /foo/ is compiled to the same program as /^.*foo/13:27.10 
avih the*13:27.13 
  because this seems to me like it could hit memory limits very quickly when you search a substring in a big-ish string13:27.52 
tor8 avih: gcc -DTEST regexp.c utf*.c13:28.18 
  then run ./a.out 'foo' and see the program that it compiles13:28.29 
avih i see it but can't interpret it.13:29.55 
  regardless, i find it really hard to believe that substring match in a string of 5k will use 5k threads or recursion depth13:30.25 
tor8 "split 3 1" means to fork and continue matching from both 3 and 113:30.47 
  in matchbt, try swapping pc->x and pc->y in case I_SPLIT13:31.40 
  so we recurse on the pc->y and iterate on pc->x instead13:32.12 
avih replaced, the DTEST output looks the same. gonna try in my use case.13:33.08 
  recursion depth seems the same to me13:33.59 
tor8 the output should be the same, the recursion should be different13:34.12 
avih i added/have printf at the beginning of matchbt, so every print is equivalent to entry. it looks the same13:35.20 
  (but indeed it doesn't count depth. just entries)13:35.54 
  actually not exactly the same. it's about half deep on some cases.13:36.53 
tor8 with x/y reversed, it stops earlier13:37.25 
avih or half entries. but still entry per char of the tested string. only previously i also saw the "back", and now it indeed stops earlier13:37.43 
  (i.e. on some cases it was 2 * strlen and now it's "just" strlen entries)13:38.18 
tor8 avih: yes, it calls the function once for each character, but the depth in the two cases should differ13:38.31 
  in the swapped x/y they should be flat13:38.44 
avih hmm.. ok. let me count the depth too then (i do believe you of course :) )13:39.06 
tor8 just add printf('}') before the return statements13:39.59 
  https://pastebin.com/raw/5M1RZ6vS13:41.22 
  think of the 'split' instruction as a combined if-else instruction13:43.09 
avih yes, i can see. what are the x and y? progression in the regex and the tested string?13:43.25 
tor8 the x and y are the two branches of the if-else chain13:43.47 
avih (i added a depth argument and just print it)13:43.53 
tor8 with the swapped x/y it should be pretty similar in behavior to say PCRE13:44.49 
  in terms of stack use and recursion etc13:44.57 
avih tor8: was the non bt implementation also spawning as many threads as the bt implementation recursion depth?13:45.03 
  so it couldn't for instance test a 2k string?13:45.46 
tor8 the parallel implementation spawns "threads" and reuses them as they die13:46.08 
  so each 'split' instruction it encounters will just add a new 'thread' to the queue13:47.02 
  instead of recursing here and now13:47.09 
  and once the current thread has died, it pops the next spawned thread off the queue and runs thaht13:47.33 
avih are they actual threads? and wouldn't the initial run while testing 2k string still hit the limit?13:47.53 
tor8 this has *nothing* to do with system threads13:48.05 
avih ok, that explains some things. so just "initial state which needs to be exhausted"?13:48.43 
  anyway, how would you describe the recursion depth now roughly in terms of complexity? (O(...) )13:49.52 
tor8 the regex matcher is a sort of VM13:49.54 
  with very specialised 'threads' where each fork of the regex spawns a new clone of the program for matching13:50.25 
avih yeah, i sort of realized that.13:50.26 
tor8 then it just runs them sequentially until they die13:50.40 
avih so Rethread size is sort of the regex vm stack?13:51.10 
tor8 the instruction set doesn't have a 'branch' instruction like normal programs, it does branching by forking and running the branch in a new thread13:51.29 
  rethread size is the max number of threads that can be queued13:51.40 
avih and typically each iteration queues one?13:52.14 
  (i guess i should just read it if i really want to understand how it works, rather than wasting your time...)13:52.58 
tor8 avih: https://swtch.com/~rsc/regexp/regexp1.html13:53.53 
avih and the recursion approach and/or the x/y swap changes from one paradigm to the other?13:55.46 
tor8 neither, really. this just explains the approach taken by my implementation compared to a naive regex implementation.13:57.38 
avih so yours is Thompson's NFA?13:58.26 
tor8 ah crap, something's wrong with the matchbt parenthesis capturing14:00.15 
  it's based on the same algorithm14:00.24 
  https://swtch.com/~rsc/regexp/regexp2.html is the followup, which is what I modeled the implementation on14:01.06 
avih i actually have very good testcases for regex, other than babel. highlightjs and prism are syntax highlighters based on regex and written in js. i'm using them elsewhere in my own source-highlight replacement which i've written in node.js, and duktape is able to run them while mujs wasn't. maybe it'll be fixed now :)14:04.09 
  (in mujs iirc one of them reached depth limit and the other you claimed is invalid regex)14:05.06 
  (or maybe duktape was only able to run one of them - and slow, but it definitely got further than mujs)14:06.32 
tor8 avih: gotta go, back in a few hours14:11.04 
avih k, thanks for your time14:11.13 
  tor8: fwiw, without the x-y swap, it indeed explodes very quickly (on alpine and elsewhere). with the xy swap it works nicely for 500k x "hello world " and match(/hello/g).length17:32.55 
tor8 avih: there's a fixed version on tor/wip17:32.58 
avih it also happens to be almost 2x faster than the "threads" code17:33.09 
tor8 I'm thinking of implementing the variant that executes the threads in lock-step which should let us reduce the stack use even more17:33.18 
avih tor8: and interestingly, your wip branch as is, with the fast array, is yet another ~x6 faster than the recurtion (and xy swap)17:33.52 
tor8 the "threads" code uses a fixed amount of stack, the backtracking one can explode the C stack due to recursion17:34.00 
avih i understand this, yes, though you could limit the length artificially too17:34.26 
tor8 the fast array stuff is *really* incomplete :)17:34.29 
avih i accidentally checkout tor8/wip rather than just cherry-pick the top commits, so i noticed :p17:34.59 
tor8 the 'threads' implementation could probably be somewhat optimized17:35.05 
avih anyway, the recursive approach with xy swap seems very good, and didn't explode on alpine, though without the "bad" split cases. and it should be easy to artificially limit the depth too17:35.57 
  however, assuming the "thread" approach is better, why not just allocate this array on the heap?17:36.36 
tor8 a lockstep non-backtracking matcher (which is possible, I just never got around to it) will run in O(n * m) speed (where n=regexp length, m=search string length)17:37.15 
avih it's a fixed size array, even with consecutive calls to match, the allocator should easily use the just-release space without much penalty imo17:37.45 
tor8 it might be a constant factor slower for normal cases but should cope much more gracefully with pathological cases17:37.57 
  and the MAXTHREADS in a lockstep implementation need only ever be the same length as the regex program17:38.18 
avih what about the "crap matching parenthesis" thingy you mentioned earlier?17:38.52 
tor8 so a lockstep implementation could allocate the exact amount needed on the heap17:38.55 
  avih: I fixed that bug on tor/wip17:39.11 
avih oh. sec17:39.17 
  tor8: i don't think i see it. is it part of the recursive code which isn't at the non recursive code?17:40.44 
  (i.e. the threaded code is still broken?)17:41.18 
tor8 avih: nothing is broken now.17:51.49 
avih tor8: so what was broken before? just the recursive code which you later fixed?17:52.19 
tor8 I had some typos in how the Resub state was handled17:52.27 
  in the recursive code17:52.34 
avih at the new recursive code17:52.40 
  k17:52.40 
  tor8: something is wrong with the new squashed patch: 1. it doesn't enable recursive by default. 2. even if i set the two places with opts=REG_RECURSIVE it still doesn't use the recursive match. 3. even if i also set flags=REG_RECURSIVE at two places at jsB_new_RegExp (instead of 0), it STILL doesn't use the recursive match.18:46.49 
  i don't think i can enable the recursive implementation for "hello".match(/h/g)18:47.14 
  there's clearly a code path which still sets it to 0 (eflags at regexec is 0 after all the above changes). it needs a more convenient way to choose an implementation, i don't think the caller should control it but rather a flag at regex.c, and i really think it should be recursive by default, possibly with depth limitation like the array size limits it for the "threaded" implementation18:49.34 
  or, just allocate the array on the heap and call it a day instead of the new implementation.18:50.28 
  also, imo, until a patchset is reasonably stable, IMO you shouldn't squash or force push. reflog is a lot less convinient to sift through than plain commits history18:52.28 
Guest64308 Hi tor8, thanks for the help. Now both methods show the image on the page. But, in both cases, the image is distorced and cropped. How do I show it properly?18:53.01 
  I am using fz_image *image = fz_new_image_from_file(ctx, "2012_2.png");18:53.53 
avih specifically, jsstring.c has a lot of places like this: if (js_regexec(re->prog, a, &m, a > text ? REG_NOTBOL : 0)) ...19:25.17 
  where it uses 0 flags, which disables REG_RECURSIVE19:25.41 
  it's implementation details, and should be an implementation flag outside of the control of the callers, imo. or, at the very least, let callers control it but via specific api to change its mode, once, on init, rather than on every regexec call19:26.51 
  or just make it a #define IMP_RECURSIVE at regex.c and then #ifdef out one of the implementations.19:34.33 
Guest7162 Hello everyone. Besides \mupdf\docs\, are there other examples in C for using mupdf? For instance, spliting pdfs, drawing images, compressing using JBIG2, etc?22:26.37 
 Forward 1 day (to 2017/08/15)>>> 
ghostscript.com #ghostscript
Search: