Log of #mupdf at irc.freenode.net.

Search:
 <<<Back 1 day (to 2017/05/15)20170516 
user875 here there! have you considered to add a back-function? For example if I click on link jumping to another page, and then to be able to return to the page where I came from?07:54.27 
  (hopefully I have not overlooked it in the man page ...)07:54.53 
kens tor8 that new bug report about text extraction is 'interesting'08:59.49 
tor8 kens: yeah... to say the least09:08.06 
kens I cut the file down to just '2008'09:08.18 
tor8 in the body text, the digits come out to things like """%09:08.44 
kens WHich is 0x32 0x30 0x30 0x38, all single bytes in the CMap09:08.50 
tor8 can you put up the cut down file?09:09.46 
kens If you look in the ToUnicode CMap it maps <0030> <20> <0032> <20> <0038> <20>09:09.49 
  Sure09:09.53 
  err, when I dfind it09:10.10 
  ah found it09:11.28 
  OK there it is09:12.19 
  decompressed and reduced, still 700Kb09:12.28 
tor8 thanks09:12.36 
kens So when I run this file through GS it extracts the '2008' as ' ' which looks correct to me.09:13.06 
tor8 mupdf also extracts that one as " "09:13.47 
  <span font="HAIWKP+FZDBSJW--GB1-0" wmode="0" trm="29.014 0 0 29.014">09:13.59 
  <g unicode=" " glyph="19" x="287.20976" y="993.08688" />09:13.59 
  <g unicode=" " glyph="17" x="301.65873" y="993.08688" />09:13.59 
  <g unicode=" " glyph="17" x="316.1077" y="993.08688" />09:13.59 
  <g unicode=" " glyph="25" x="330.55668" y="993.08688" />09:13.59 
  </span>09:13.59 
kens So I wonder if Acrobat is not using the ToUnicode CMap, but instead using maybe the GBK-EUC-H ENcoding09:14.18 
  Oh oops, I messed up slightly with that file. I had doubled up the <32> to <0032>09:14.46 
  You'll want to delete the leading 00 from those text characters09:15.03 
  I was just running an experiment there09:15.13 
tor8 if I search the mutool draw -Ftrace output for instances using the same font name09:15.19 
  I get the same output09:15.25 
  <span font="HAIWKP+FZDBSJW--GB1-0" wmode="0" trm="29.0135 0 0 29.0135">09:15.29 
  <g unicode=" " glyph="19" x="287.2106" y="993.08877" />09:15.30 
  <g unicode=" " glyph="17" x="301.66883" y="993.08877" />09:15.30 
  <g unicode=" " glyph="17" x="316.12705" y="993.08877" />09:15.30 
  <g unicode=" " glyph="25" x="330.58528" y="993.08877" />09:15.30 
kens And yet, Acrobat gets '2008' :-(09:15.45 
tor8 maybe they ignore the ToUnicode?09:16.02 
kens Possibly09:16.12 
  But they definitely don't usually ignore it, so how to know when they will ?09:16.24 
  It does seem clear its ignoring the ToUnicode09:18.02 
tor8 yeah, if I nuke the ToUnicode it comes out as 200809:18.10 
kens In MuPDF ? I haven't tried GS doing that, but I'm sure that's the result I would get09:18.33 
tor8 in object 3909:18.52 
kens If you go back to the original file and copy/paste from Acrobat the line with 2008 in it, then it comes out as you would expect09:19.22 
  2-byte text is correct as well as single byte text09:19.38 
tor8 in mupdf with a nuked ToUnicode for that fornt09:20.06 
  the text comes out correct for both the latin and han characters09:20.31 
  with the ToUnicode entry intact, only the han characters come out correct09:21.02 
kens Really ? You just treat the Han as Unicode ?09:21.07 
  When no ToUnicode I mean09:21.17 
  It does appear that Acrobat is simply ignoring the ToUnicode CMap, but Icannot see any justification for doing that09:21.52 
tor8 in mupdf we use a builtin ToUnicode based on the CID system info09:21.55 
kens Oh OK09:22.08 
tor8 if the font does not have its own ToUnicode09:22.23 
kens THe system info is OK, I wonder if Acrobat is doign that also09:22.27 
  Given the Registry is Adobe and the Ordering is GB109:22.44 
  I wonder what happens if I nuke that09:22.51 
tor8 I suspect it might decide not to even look at a ToUnicode if the system info is one of the CJK ones09:23.07 
kens If I garble the system info it still gets 2008 correct09:23.33 
  Let zee what happens if I mess wiht the ToUnicode09:23.49 
  LOL still gets 2008 correct09:24.02 
  Its clearly not using the ToUnicode at all09:24.12 
  Oh there are 2 orderings09:25.04 
tor8 could it base it on the /Encoding /GBK-EUC-H entry?09:25.33 
kens I tried garbling that too earlier09:25.46 
tor8 hm, the /CIDSet maybe?09:26.32 
avih_ tor8: morning! 1. i think toFixed(0) should not be an alias to toFixed(1) but it seems mujs behaves like it is. 2. i don't think mujs should print to stdout. e.g. on gc or "warning: function statements are not standard". maybe the solution to 2 is use stdout by default, but allow registering a print function (plain str, no va_list)?09:26.34 
tor8 I've never used that09:26.36 
kens Ah, if I mess up the Encodng then Acrobat throws an error09:26.38 
avih also, congrats on the gpl win? :)09:28.04 
kens Changing it to 83-pv-RKSJ-H didn't change the Latin, I guess I need to put back the Han to test it09:28.17 
  avih long way to go yet09:28.26 
tor8 avih: morning09:28.31 
avih as always :)09:28.35 
  kens: better than losing right off the bat though...09:28.54 
kens Sure09:28.59 
  It'll probably end up in an out of courst settlement is my guess09:29.12 
tor8 avih: toFixed (and similar) are very simplistic implementations at the moment09:29.23 
avih that's one way to call it. "non standard" would be another :)09:29.49 
  do you have a list yet of non compliant stuff?09:30.13 
  kens: yeah, hopefully a nice one09:30.44 
kens Well we'll see. Stopping them infringing would be good, sends a message out if nothing else09:31.05 
  tor8 looks like it is using the CIDSystemInfo09:31.37 
  sorry, the ENmcoding I mean09:32.44 
avih kens: problem with settlements though, other than typically being shy on the details, is that they don't set a precedent :/09:34.26 
kens If I garbel the CIDSystemInfo for the font and descendant, then the text is OK, if I additionally change the Encoding from GBK-EUC-H to 83pv-RKSJ-H then the text disappears on rendering, and copy/paste returns garbage latin text for the Han09:34.32 
  avih, not legally09:34.41 
tor8 kens: with this knowledeg in hand, I'm inclined to close the bug as "wontfix" it works as expected -- don't put a bad ToUnicode in the file if you want proper output.09:34.49 
kens But they tell infringers we will sue09:34.49 
  tor8 I would agree completely09:35.00 
  The ToUnicode is incorrect and that's where the problem comes from09:35.13 
tor8 it is good to know that Adobe uses the /Encoding only if it's a known encoding09:35.24 
kens You might watn to put the information about the ENcoding there, just to capture it09:35.28 
kens goes back to PDF gstates09:35.48 
  more coffeee...09:36.52 
avih tor8: what about stdout? i don't think non debugging APIs should print to stdout. but i do find the info useful. both on gc and the function warning, even if those two clearly serve different functions.09:38.09 
  (or maybe it's stderr, didn't check. but still the same)09:38.58 
  so maybe warning and info print callbacks. that should cover nicely a need to print stuff09:39.45 
  oh, i _think_ there's a bug in gc where it always prints.09:40.36 
tor8 kens: will do. thanks!09:41.13 
  avih: yeah, that's probably just an oversight.09:41.23 
  avih: the stdout/stderr printouts09:41.32 
avih yeah. the info in those prints is good, but i'm using mujs as a lib, with multiple threads, i don't want the lib to print stuff09:42.13 
  if it goes through me then i can channel it correctly to log files or other outputs09:43.03 
tor8 avih: yes, I hear you.09:44.34 
avih (i'd say the same for dofile/dostring, but those are easily rewritable using other APIs)09:45.49 
  (so i consider those "non production")09:47.08 
kens tor8 just for giggles, it looks like that PDF file was produced by 'PSPNT' which as far as I can tell is the Founder Rip.09:54.20 
  Looks like they are calling it EagleRIP today09:56.31 
tor8 user875 (for the logs): to go back to before where you clicked the link, use the 't' key.10:08.06 
  avih: there's a commit on mujs:tor/master that adds a report callback function10:11.48 
avih hmm.. so no warning/info?10:13.55 
tor8 I only found one warning message, and that one should strictly speaking be an error10:14.22 
  function statements are not in the ES5 spec10:14.34 
  but IMO they should be, and everything supports them10:14.48 
avih gc is pure info10:16.20 
tor8 every other message is informational10:17.45 
avih (i'm fine with the function statement thingy going into the "not a good thing happened" bucket)10:17.57 
  oh, i see your point.10:18.21 
  what is js_report good for?10:21.52 
  (as public api)10:21.59 
tor8 client functions may also want to print warning/debug/report messages and have them go to the same place10:22.52 
avih also the regex overflow.. does the reason appear in an exception? otherwise that's a really useful info10:23.33 
tor8 it does not; I'm not sure quite how to tackle it10:24.12 
  the regexp.c code is from a separate non-JS project and I've wanted to keep the files independent of mujs10:24.48 
  otherwise it'd be easy :)10:24.52 
avih hmm.. :)10:25.05 
  you mean the entire file is a verbatim copy from another project? that can't be true..10:25.57 
  or maybe it can. no "js" there10:26.31 
tor8 avih: it's also why the file doesn't have a 'js' prefix10:30.19 
avih maybe it should have some reporting facility which both the js and the other project could use? but you know your code better. all i know is that it's a i'd like to get that info as a user10:31.44 
  s/it's a//10:32.40 
  regex are already hard to debug. some code uses big structures with regex in them and the line number on such is not always useful (maybe i'll recheck after your line number fix)10:34.44 
  (iirc it was a big object with many properties, where some of them were regex, and the error line number was on the last line of the struct)10:36.01 
  (which had nothing but '}')10:36.21 
tor8 we only track lines by statement10:51.00 
avih tor8: thx for the lines thing. what does JS_ASTLIMIT serve other than using it because you can? as far as i can tell you don't pre-allocate anything depending on its size. a statement cannot be infinite by definition, so where does it become useful?13:06.48 
  (i'd understand recursion limit which you can't predict if infinite or not, but i think it's not the case here)13:07.36 
tor8 avih: prevents stack smashing and segfaults when parsing ridiculously deep (maliciously constructed) expressions13:12.07 
  something like "((((((((((((((((((((((" but for a few kilobytes more13:12.18 
avih wouldn't malloc just fail at some point?13:12.39 
tor8 it's the stack that runs out, not the heap13:12.51 
avih oh, it's on thee stack13:12.52 
  -e13:12.59 
tor8 recursive descent parsing has its drawbacks13:13.02 
avih and before this patch?13:14.09 
  stack overflow on malicious input?13:14.31 
  (you could still overflow though, depending on your stack size. and musl for instance has a notoriously small default stack)13:15.12 
  (iirc 2k by default and alpine linux for instance set it to 8k. _iirc_)13:15.57 
tor8 avih: that's why I made the JS_ASTLIMIT a #define13:16.09 
  before this patch we'd segfault on such input13:16.20 
avih right13:16.26 
  it doesn't sound too little for hand written code for sure. but generated is harder to expect (minified/amalgamated/asm.js/etc)13:17.57 
  s/expect/predict/13:20.23 
tor8 nothing to do with minification or asm.js would add such deeply nested expressions13:22.00 
  there's always a question of where to draw the line; if it turns out to be problematic we can just bump the magic number13:22.42 
avih true13:22.53 
  how i do think it's related though. things which are compiled to js can use the language as they see fit, and nesting is a perfectly useful tool13:24.34 
  however*13:24.39 
  but anyway, cross the bridge when the time comes :)13:25.27 
 Forward 1 day (to 2017/05/17)>>> 
ghostscript.com #ghostscript
Search: