| <<<Back 1 day (to 2017/05/22) | 20170523 |
tor8 | avih: morning. | 08:58.41 |
| finally got some time to look at your patch | 08:58.47 |
| adding a special fast case for integers in jsV_numbertostring does speed things up quite a bit | 08:59.10 |
| even on top of the grisu2 code | 08:59.21 |
| avih: is there a particular reason for the isnormal() and range checks, rather than just if ((double)(int)f == f) ? | 09:00.49 |
avih | tor8: i didn't want to handle anything around 0. for instance, without it, that fast path would be entered for -0.0 and i preferred to leave that to the more complete code (which in your case handled it a the ==0 test, and i'm not sure it's right). the <= INT_MAX is required since because casting an out of range double to int is undefined behavior | 09:37.07 |
| and there are also subnormals, which i preferred not handle. i.e. i took the easy case of numbers which are non-0 integers | 09:39.12 |
| (the <= INT_MAX also validly rule out nan and inf, and otherwise casting any of them to int is also UB) | 09:42.06 |
tor8 | implementation defined, or undefined? | 09:43.33 |
avih | (but isnormal filters them too, so the isnormal part only addition is for things around 0) | 09:43.51 |
| i believe undefined | 09:44.09 |
tor8 | then range checks and (double)(int)f == f should be enough | 09:46.40 |
| that should take care of subnormals too | 09:46.49 |
avih | how so? or maybe add range also > 0 ? | 09:47.08 |
| everything around 0 is dodgy though. i think (f > 0) can still be subnormal, depending on compiler/cpu modes | 09:48.43 |
tor8 | in jsV_numbertostring we don't care about the signedness of 0 | 09:49.34 |
avih | and also not about subnormals? if both are correct, then it just needs d >= INT_MIN && d <= INT_MAX && d == (int)d | 09:50.41 |
| actually, no. it must be non negative for js_itoa. so >= 0 && <= INT_MAX would be enough | 09:51.44 |
tor8 | I can change js_itoa to allow signed numbers. | 09:52.44 |
| In fact, I did that as part of this change. | 09:52.52 |
avih | i considered it too, but didn't know what its role is in the grand scheme of things | 09:53.09 |
tor8 | minor :) | 09:53.15 |
avih | :) | 09:53.18 |
tor8 | it's a convenience function for js_getindex and several other places that handle array indices | 09:53.34 |
| which are generally only positive | 09:53.40 |
| look at mujs tor/master branch for a commit | 09:54.09 |
avih | right. i think more than generally though, isn't it? iirc the es5 spec defines "valid array index" as equivalent to anything which uint32 can hold | 09:54.44 |
| or maybe UINT32_MAX - 1 or some such | 09:56.22 |
tor8 | yes. there's a bunch of voodoo for array objects (auto updating the .length property if you set a property which is classified as an "array index") | 09:56.37 |
avih | indeed | 09:57.11 |
tor8 | given that we work with signed integers (mixing signed and unsigned math is just too bug prone, IMO) our use of that is slightly more limited | 09:57.13 |
avih | nothing prevents you from using uint32 at hasindex etc | 09:57.48 |
| tor8: you have a bug. you're entering undefined behavior for f == INT_MIN. because -f is bigger than INT_MAX | 09:59.34 |
tor8 | avih: where? | 10:00.54 |
avih | you could test > INT_MIN instead of >=, but I don't think you get guarantees that intmin is -intmax - 1 | 10:01.10 |
| oh, sorry, js_itoa now handles negs | 10:02.13 |
| what's 'a' used for? | 10:03.16 |
| ah, it's at js_itoa | 10:03.34 |
| it's ok. same as my patch, it misses half the range which uint can handle and int cannot | 10:05.03 |
| tor8: did you measure any perf impact the patch has on top of grisu2? | 10:06.01 |
| also, i think unrelated to the patch, if you notice the numbers i posted for "frac strings" then grisu2 improves x3+, which i don't understand why. do you? | 10:07.12 |
tor8 | the grisu2 patch also adds a new faster strtod | 10:08.58 |
| I'm sure it's got different bugs from the other strtod we used... | 10:09.29 |
avih | yes, but it shouldn't be entered for a["1.5"], right? | 10:09.29 |
tor8 | if a is an Array object, it is entered | 10:09.55 |
| to check whether the property is an integer | 10:10.02 |
avih | oh? | 10:10.04 |
tor8 | js_isarrayindex | 10:10.12 |
| I'm rewriting that to be faster too | 10:10.17 |
avih | but it doesn't seem to affect perf of a["1"] | 10:10.26 |
| but a["0.5"] is now more than 3x faster | 10:10.53 |
tor8 | probably due to the difference in behaviour of old strtod? | 10:11.18 |
avih | in general though, i think the next step in array access improvement would be to use C arrays under some conditions (probably e.g. at least N items and sparseness factor up to 2 or some such) without going through the properties search code. | 10:12.18 |
| that grisu2 is fantastic. | 10:13.02 |
tor8 | a is the unsigned integer used to convert the integer to a decimal string | 10:13.27 |
| avih: I've got timings doing 'a[2] = a[0] + a[1]' one million times | 10:15.55 |
avih | how does it compare? | 10:16.10 |
tor8 | pre-grisu2: 9000ms, grisu2: 652ms, fast-path numbertostring: 627ms, fast-path isarrayindex: 600ms | 10:18.52 |
avih | so 10% at most? | 10:19.37 |
| (for this test case) | 10:19.45 |
tor8 | for this integer only test case, yes | 10:19.52 |
avih | well, a[0] is already fastpath in the pre-grisu2 code, so you're only testing two out of three cases which fast paths improve. | 10:21.12 |
| you should test a[2] = a[1] + a[1] + a[1] + a[1] at least IMO | 10:21.38 |
tor8 | a[2] and a[1] both exercise all the fast path codes | 10:23.30 |
| the a[0] is largely irrelevant and just adds a bit more fixed overhead | 10:23.40 |
avih | they do, but your loop does a lot more than accessing a[2] and a[1] | 10:23.53 |
| which is why the diff you're seeing does not represent the actual improvement | 10:24.26 |
tor8 | of course. percentages are only percentages of the current program, not of the time spent in number conversions. | 10:25.14 |
avih | yep. the higher the percentage of the time the test prog spends in using the fastpath, the closer the diff becomes to the actual improvement. | 10:26.39 |
tor8 | just moving the loop into a function (as opposed to at top level script scope) takes it from 600ms to 400ms | 10:27.11 |
avih | heh | 10:27.23 |
tor8 | because of the variable accesses being optimized to direct stack slots rather than using globals in the global scope object | 10:27.34 |
avih | don't start unrolling loop and manually inlining code :) | 10:27.43 |
tor8 | nah, I doubt unrolling a js loop is going to have any measurable impact at all | 10:28.16 |
avih | oh, you meant at the js code. | 10:28.24 |
tor8 | it's the property accesses that cost | 10:28.26 |
avih | yeah, that's why i think it's the next meaningful step, and till then, i think grisu fills the gap very nicely | 10:29.26 |
| and don't forget the new fastpath also has some cost when it's not entered. | 10:30.56 |
| small, but everything counts | 10:31.16 |
| especially when the fastpath now is only slightly faster than grisu2, it means it could be almost doubling the evaluation time when not entered | 10:32.54 |
| (less actually, only the checks are made, not the js_itoa code) | 10:33.45 |
| seems that "milo" is almost twice faster than grisu2 https://github.com/miloyip/dtoa-benchmark | 10:35.03 |
| not sure what "null" is. if it's actually a generic conversion then it's another x2-3 faster. | 10:36.31 |
| ah "the null implementation does nothing" | 10:37.00 |
tor8 | "milo" is C++ | 10:37.03 |
avih | hmm.. other than some static_cast<...> i don't see cpp stuff in a glance, looking here https://github.com/miloyip/dtoa-benchmark/blob/master/src/milo/dtoa_milo.h | 10:39.26 |
tor8 | what about the classes then? | 10:42.30 |
avih | must have missed them? :) | 10:42.40 |
| where? | 10:42.52 |
| the code does have some annoying compiler tests and different codepaths | 10:44.02 |
jogux | sebras: okay, I'm here :) | 10:44.42 |
avih | tor8: is the grisu2 code expected to work on arm and other processors? | 10:44.56 |
tor8 | avih: yes. anything that has uint64_t. | 10:45.18 |
avih | that's good | 10:45.30 |
tor8 | no special voodoo defines required like gay's dtoa | 10:45.44 |
avih | IMO then don't add my/your fastpath patch. grisu2 is really good, and the next step should be IMO C array access where possible to avoid property search | 10:46.50 |
tor8 | faster number <-> string conversions for integers is still worth doing, and it's easy low hanging fruit | 10:47.38 |
| avoiding property search is non-trivial | 10:47.54 |
avih | i get both :) | 10:48.29 |
| and while it does hang low, it's a pretty small fruit :) but hey, it's not nothing :) | 10:49.22 |
sebras | jogux: nice! | 10:54.17 |
| jogux: can I trouble you with attempting to compile and test this one? http://git.ghostscript.com/?p=user/sebras/mupdf-ios-viewer.git;a=commit;h=087aef04 | 10:54.59 |
| jogux: I hope I got it reasonable correct but objc is not my forte. | 10:55.12 |
tor8 | avih: timings with optimised loop at 10 million entries, grisu: 4.2s, fast-path numbertostring: 3.3s, fast-path isarrayindex: 2.1s | 11:02.46 |
| avih: commits on tor/master if you want to try your own timings | 11:03.09 |
avih | tor8: right. so it's worth it. what's the last case though? | 11:03.22 |
| i'll give it a go, yes. | 11:03.39 |
jogux | there's some problem with iOS mupdf builds btw, Xcode now frequently gets stuck in the custom build step. I don't know if it's a Xcode upgrade or a mupdf change that's caused it :( | 11:03.55 |
tor8 | rewrote js_isarrayindex to do fast string to integer conversion with early bailout on non-integer and overflow | 11:03.59 |
jogux | sebras: That seems to work. I see 'MuPDF 1.11 Documents' in the homescreen toolbar now. | 11:04.27 |
avih | gotcha. | 11:04.56 |
sebras | jogux: nice. | 11:05.08 |
tor8 | instead of the "this is what the spec says, make it work, don't care about speed" implementation it used to have | 11:05.20 |
sebras | jogux: I'm worried that the title string might leak. | 11:05.26 |
tor8 | which did a string-to-number + number-to-string + strcmp | 11:05.31 |
sebras | jogux: I'm not sure how reference counting and stuff works here. | 11:05.36 |
jogux | sebras: all the methods you are calling return autoreleased strings. | 11:05.48 |
| (so it won't leak) | 11:05.55 |
sebras | jogux: alright, how would I find this out? (so I don't have to worry about this next time) | 11:06.22 |
jogux | there's some rule like unless the method has alloc or new or something obvious like that in the name it returns an autoreleased thing. | 11:06.37 |
avih | tor8: that's some roundtrip :) | 11:06.44 |
| i hope it enjoyed the view at least | 11:07.04 |
jogux | sebras: But no one worries about this anymore, we should convert mupdf to use ARC (Automatic Reference Counting) sometime. | 11:07.10 |
| Other than mupdf it's years since I've had to do manual memory management on iOS. | 11:07.21 |
sebras | jogux: do you want me to check this fix in, or is it better that you do it? | 11:07.48 |
jogux | sebras: you might as well :) | 11:08.00 |
sebras | or is it merge/commit/push/argh? | 11:08.07 |
| will do. | 11:08.13 |
| jogux: thanks for helping out. :) | 11:09.57 |
jogux | no problem | 11:10.03 |
avih | tor8: your branch is a meaningful speed up for the test case i posted at the PR. i say push :) | 11:19.46 |
| (just tested) | 11:19.59 |
| compared to mujs master with grisu, the "int numbers" case is 30% faster, "int strings" and "frac strings" ~15% faster, and "frac numbers" ~8% faster | 11:22.48 |
| that's pretty good in my book | 11:22.53 |
| however, in a real world code which was suffering terribly before grisu, and which made me investigate and submit the PR, the diff from grisu2 to your branch is barely measurable. probably 2% or so. though surely there are real world cases which will benefit more. as reference, grisu/my-PR sped up my "real world" code almost x2. | 11:27.37 |
tor8 | yes, pre-grisu to grisu was a 20x speed difference | 11:30.38 |
avih | yes | 11:30.44 |
tor8 | the extra 2x is barely noticeable after that :) | 11:30.53 |
avih | :) | 11:31.18 |
| tor8: fwiw, my code which was too-slow-for-my-taste is the set_timer and insert_sorted functions here https://0x0.st/EqT.txt | 11:32.49 |
| (this iterative-move insert_sorted actually ends up faster than splice, which initially surprised me, but then i realized splice also creates a new array, and has to iterate all the items pretty much the same as insert_sorted now does anyway) | 11:34.09 |
tor8 | yes, and it needs to deal with odd cases that your code doesn't need to worry about | 11:35.13 |
avih | yes | 11:36.28 |
| when splice is implemented with memmove i might get back to splice ;) | 11:52.00 |
| tor8: fwiw, on the test cases from the PR, with your branch, duktape is now almost 2x faster for "int numbers", ~30% faster on int/frac strings, and finally almost 2x slower on frac numbers :) | 12:13.51 |
| (admittedly though, accessing an array with fractional numbers is not too common ;) ) | 12:14.33 |
| but i _think_ it uses C arrays where possible, and if that's true, then the always-search-property implementation is really nice and fast | 12:19.53 |
tor8 | duktape also has 6x as many lines of code... | 12:36.21 |
avih | very true indeed | 12:40.24 |
| 2x slower in something which doesn't pretend to be blazingly fast to begin with is something i'd take every time over x6 LOC | 12:41.25 |
| hence "FWIW". just to give you some motivation ;) | 12:42.08 |
| tor8: do string vars assignments duplicate chars? or just char*? | 15:33.34 |
tor8 | avih: it depends. | 15:49.23 |
avih | whether it fits in a value or not? | 15:49.33 |
| and if not? | 15:49.39 |
tor8 | short strings (< 16 bytes) fit in the value | 15:49.41 |
| other strings are all by pointer | 15:49.47 |
avih | cool. thx | 15:49.52 |
| Forward 1 day (to 2017/05/24)>>> | |