| <<<Back 1 day (to 2017/05/15) | 20170516 |
user875 | here there! have you considered to add a back-function? For example if I click on link jumping to another page, and then to be able to return to the page where I came from? | 07:54.27 |
| (hopefully I have not overlooked it in the man page ...) | 07:54.53 |
kens | tor8 that new bug report about text extraction is 'interesting' | 08:59.49 |
tor8 | kens: yeah... to say the least | 09:08.06 |
kens | I cut the file down to just '2008' | 09:08.18 |
tor8 | in the body text, the digits come out to things like """% | 09:08.44 |
kens | WHich is 0x32 0x30 0x30 0x38, all single bytes in the CMap | 09:08.50 |
tor8 | can you put up the cut down file? | 09:09.46 |
kens | If you look in the ToUnicode CMap it maps <0030> <20> <0032> <20> <0038> <20> | 09:09.49 |
| Sure | 09:09.53 |
| err, when I dfind it | 09:10.10 |
| ah found it | 09:11.28 |
| OK there it is | 09:12.19 |
| decompressed and reduced, still 700Kb | 09:12.28 |
tor8 | thanks | 09:12.36 |
kens | So when I run this file through GS it extracts the '2008' as ' ' which looks correct to me. | 09:13.06 |
tor8 | mupdf also extracts that one as " " | 09:13.47 |
| <span font="HAIWKP+FZDBSJW--GB1-0" wmode="0" trm="29.014 0 0 29.014"> | 09:13.59 |
| <g unicode=" " glyph="19" x="287.20976" y="993.08688" /> | 09:13.59 |
| <g unicode=" " glyph="17" x="301.65873" y="993.08688" /> | 09:13.59 |
| <g unicode=" " glyph="17" x="316.1077" y="993.08688" /> | 09:13.59 |
| <g unicode=" " glyph="25" x="330.55668" y="993.08688" /> | 09:13.59 |
| </span> | 09:13.59 |
kens | So I wonder if Acrobat is not using the ToUnicode CMap, but instead using maybe the GBK-EUC-H ENcoding | 09:14.18 |
| Oh oops, I messed up slightly with that file. I had doubled up the <32> to <0032> | 09:14.46 |
| You'll want to delete the leading 00 from those text characters | 09:15.03 |
| I was just running an experiment there | 09:15.13 |
tor8 | if I search the mutool draw -Ftrace output for instances using the same font name | 09:15.19 |
| I get the same output | 09:15.25 |
| <span font="HAIWKP+FZDBSJW--GB1-0" wmode="0" trm="29.0135 0 0 29.0135"> | 09:15.29 |
| <g unicode=" " glyph="19" x="287.2106" y="993.08877" /> | 09:15.30 |
| <g unicode=" " glyph="17" x="301.66883" y="993.08877" /> | 09:15.30 |
| <g unicode=" " glyph="17" x="316.12705" y="993.08877" /> | 09:15.30 |
| <g unicode=" " glyph="25" x="330.58528" y="993.08877" /> | 09:15.30 |
kens | And yet, Acrobat gets '2008' :-( | 09:15.45 |
tor8 | maybe they ignore the ToUnicode? | 09:16.02 |
kens | Possibly | 09:16.12 |
| But they definitely don't usually ignore it, so how to know when they will ? | 09:16.24 |
| It does seem clear its ignoring the ToUnicode | 09:18.02 |
tor8 | yeah, if I nuke the ToUnicode it comes out as 2008 | 09:18.10 |
kens | In MuPDF ? I haven't tried GS doing that, but I'm sure that's the result I would get | 09:18.33 |
tor8 | in object 39 | 09:18.52 |
kens | If you go back to the original file and copy/paste from Acrobat the line with 2008 in it, then it comes out as you would expect | 09:19.22 |
| 2-byte text is correct as well as single byte text | 09:19.38 |
tor8 | in mupdf with a nuked ToUnicode for that fornt | 09:20.06 |
| the text comes out correct for both the latin and han characters | 09:20.31 |
| with the ToUnicode entry intact, only the han characters come out correct | 09:21.02 |
kens | Really ? You just treat the Han as Unicode ? | 09:21.07 |
| When no ToUnicode I mean | 09:21.17 |
| It does appear that Acrobat is simply ignoring the ToUnicode CMap, but Icannot see any justification for doing that | 09:21.52 |
tor8 | in mupdf we use a builtin ToUnicode based on the CID system info | 09:21.55 |
kens | Oh OK | 09:22.08 |
tor8 | if the font does not have its own ToUnicode | 09:22.23 |
kens | THe system info is OK, I wonder if Acrobat is doign that also | 09:22.27 |
| Given the Registry is Adobe and the Ordering is GB1 | 09:22.44 |
| I wonder what happens if I nuke that | 09:22.51 |
tor8 | I suspect it might decide not to even look at a ToUnicode if the system info is one of the CJK ones | 09:23.07 |
kens | If I garble the system info it still gets 2008 correct | 09:23.33 |
| Let zee what happens if I mess wiht the ToUnicode | 09:23.49 |
| LOL still gets 2008 correct | 09:24.02 |
| Its clearly not using the ToUnicode at all | 09:24.12 |
| Oh there are 2 orderings | 09:25.04 |
tor8 | could it base it on the /Encoding /GBK-EUC-H entry? | 09:25.33 |
kens | I tried garbling that too earlier | 09:25.46 |
tor8 | hm, the /CIDSet maybe? | 09:26.32 |
avih_ | tor8: morning! 1. i think toFixed(0) should not be an alias to toFixed(1) but it seems mujs behaves like it is. 2. i don't think mujs should print to stdout. e.g. on gc or "warning: function statements are not standard". maybe the solution to 2 is use stdout by default, but allow registering a print function (plain str, no va_list)? | 09:26.34 |
tor8 | I've never used that | 09:26.36 |
kens | Ah, if I mess up the Encodng then Acrobat throws an error | 09:26.38 |
avih | also, congrats on the gpl win? :) | 09:28.04 |
kens | Changing it to 83-pv-RKSJ-H didn't change the Latin, I guess I need to put back the Han to test it | 09:28.17 |
| avih long way to go yet | 09:28.26 |
tor8 | avih: morning | 09:28.31 |
avih | as always :) | 09:28.35 |
| kens: better than losing right off the bat though... | 09:28.54 |
kens | Sure | 09:28.59 |
| It'll probably end up in an out of courst settlement is my guess | 09:29.12 |
tor8 | avih: toFixed (and similar) are very simplistic implementations at the moment | 09:29.23 |
avih | that's one way to call it. "non standard" would be another :) | 09:29.49 |
| do you have a list yet of non compliant stuff? | 09:30.13 |
| kens: yeah, hopefully a nice one | 09:30.44 |
kens | Well we'll see. Stopping them infringing would be good, sends a message out if nothing else | 09:31.05 |
| tor8 looks like it is using the CIDSystemInfo | 09:31.37 |
| sorry, the ENmcoding I mean | 09:32.44 |
avih | kens: problem with settlements though, other than typically being shy on the details, is that they don't set a precedent :/ | 09:34.26 |
kens | If I garbel the CIDSystemInfo for the font and descendant, then the text is OK, if I additionally change the Encoding from GBK-EUC-H to 83pv-RKSJ-H then the text disappears on rendering, and copy/paste returns garbage latin text for the Han | 09:34.32 |
| avih, not legally | 09:34.41 |
tor8 | kens: with this knowledeg in hand, I'm inclined to close the bug as "wontfix" it works as expected -- don't put a bad ToUnicode in the file if you want proper output. | 09:34.49 |
kens | But they tell infringers we will sue | 09:34.49 |
| tor8 I would agree completely | 09:35.00 |
| The ToUnicode is incorrect and that's where the problem comes from | 09:35.13 |
tor8 | it is good to know that Adobe uses the /Encoding only if it's a known encoding | 09:35.24 |
kens | You might watn to put the information about the ENcoding there, just to capture it | 09:35.28 |
kens | goes back to PDF gstates | 09:35.48 |
| more coffeee... | 09:36.52 |
avih | tor8: what about stdout? i don't think non debugging APIs should print to stdout. but i do find the info useful. both on gc and the function warning, even if those two clearly serve different functions. | 09:38.09 |
| (or maybe it's stderr, didn't check. but still the same) | 09:38.58 |
| so maybe warning and info print callbacks. that should cover nicely a need to print stuff | 09:39.45 |
| oh, i _think_ there's a bug in gc where it always prints. | 09:40.36 |
tor8 | kens: will do. thanks! | 09:41.13 |
| avih: yeah, that's probably just an oversight. | 09:41.23 |
| avih: the stdout/stderr printouts | 09:41.32 |
avih | yeah. the info in those prints is good, but i'm using mujs as a lib, with multiple threads, i don't want the lib to print stuff | 09:42.13 |
| if it goes through me then i can channel it correctly to log files or other outputs | 09:43.03 |
tor8 | avih: yes, I hear you. | 09:44.34 |
avih | (i'd say the same for dofile/dostring, but those are easily rewritable using other APIs) | 09:45.49 |
| (so i consider those "non production") | 09:47.08 |
kens | tor8 just for giggles, it looks like that PDF file was produced by 'PSPNT' which as far as I can tell is the Founder Rip. | 09:54.20 |
| Looks like they are calling it EagleRIP today | 09:56.31 |
tor8 | user875 (for the logs): to go back to before where you clicked the link, use the 't' key. | 10:08.06 |
| avih: there's a commit on mujs:tor/master that adds a report callback function | 10:11.48 |
avih | hmm.. so no warning/info? | 10:13.55 |
tor8 | I only found one warning message, and that one should strictly speaking be an error | 10:14.22 |
| function statements are not in the ES5 spec | 10:14.34 |
| but IMO they should be, and everything supports them | 10:14.48 |
avih | gc is pure info | 10:16.20 |
tor8 | every other message is informational | 10:17.45 |
avih | (i'm fine with the function statement thingy going into the "not a good thing happened" bucket) | 10:17.57 |
| oh, i see your point. | 10:18.21 |
| what is js_report good for? | 10:21.52 |
| (as public api) | 10:21.59 |
tor8 | client functions may also want to print warning/debug/report messages and have them go to the same place | 10:22.52 |
avih | also the regex overflow.. does the reason appear in an exception? otherwise that's a really useful info | 10:23.33 |
tor8 | it does not; I'm not sure quite how to tackle it | 10:24.12 |
| the regexp.c code is from a separate non-JS project and I've wanted to keep the files independent of mujs | 10:24.48 |
| otherwise it'd be easy :) | 10:24.52 |
avih | hmm.. :) | 10:25.05 |
| you mean the entire file is a verbatim copy from another project? that can't be true.. | 10:25.57 |
| or maybe it can. no "js" there | 10:26.31 |
tor8 | avih: it's also why the file doesn't have a 'js' prefix | 10:30.19 |
avih | maybe it should have some reporting facility which both the js and the other project could use? but you know your code better. all i know is that it's a i'd like to get that info as a user | 10:31.44 |
| s/it's a// | 10:32.40 |
| regex are already hard to debug. some code uses big structures with regex in them and the line number on such is not always useful (maybe i'll recheck after your line number fix) | 10:34.44 |
| (iirc it was a big object with many properties, where some of them were regex, and the error line number was on the last line of the struct) | 10:36.01 |
| (which had nothing but '}') | 10:36.21 |
tor8 | we only track lines by statement | 10:51.00 |
avih | tor8: thx for the lines thing. what does JS_ASTLIMIT serve other than using it because you can? as far as i can tell you don't pre-allocate anything depending on its size. a statement cannot be infinite by definition, so where does it become useful? | 13:06.48 |
| (i'd understand recursion limit which you can't predict if infinite or not, but i think it's not the case here) | 13:07.36 |
tor8 | avih: prevents stack smashing and segfaults when parsing ridiculously deep (maliciously constructed) expressions | 13:12.07 |
| something like "((((((((((((((((((((((" but for a few kilobytes more | 13:12.18 |
avih | wouldn't malloc just fail at some point? | 13:12.39 |
tor8 | it's the stack that runs out, not the heap | 13:12.51 |
avih | oh, it's on thee stack | 13:12.52 |
| -e | 13:12.59 |
tor8 | recursive descent parsing has its drawbacks | 13:13.02 |
avih | and before this patch? | 13:14.09 |
| stack overflow on malicious input? | 13:14.31 |
| (you could still overflow though, depending on your stack size. and musl for instance has a notoriously small default stack) | 13:15.12 |
| (iirc 2k by default and alpine linux for instance set it to 8k. _iirc_) | 13:15.57 |
tor8 | avih: that's why I made the JS_ASTLIMIT a #define | 13:16.09 |
| before this patch we'd segfault on such input | 13:16.20 |
avih | right | 13:16.26 |
| it doesn't sound too little for hand written code for sure. but generated is harder to expect (minified/amalgamated/asm.js/etc) | 13:17.57 |
| s/expect/predict/ | 13:20.23 |
tor8 | nothing to do with minification or asm.js would add such deeply nested expressions | 13:22.00 |
| there's always a question of where to draw the line; if it turns out to be problematic we can just bump the magic number | 13:22.42 |
avih | true | 13:22.53 |
| how i do think it's related though. things which are compiled to js can use the language as they see fit, and nesting is a perfectly useful tool | 13:24.34 |
| however* | 13:24.39 |
| but anyway, cross the bridge when the time comes :) | 13:25.27 |
| Forward 1 day (to 2017/05/17)>>> | |