| <<<Back 1 day (to 2020/08/05) | Fwd 1 day (to 2020/08/07)>>> | 20200806 |
malc_ | llpp user have hit an issue with cland+mupdf+build-type=native, some of the details are at: https://github.com/moosotc/llpp/pull/144 | 03:54.37 |
sebras | malc_: I'm confused by the log pointed to by that pull request: https://github.com/nojb/llpp/runs/951298331?check_suite_focus=true#step:5:2780 | 04:08.42 |
| malc_: it shows that it is compiling lablGL/ml_gl.c what is that? | 04:08.56 |
malc_ | sebras: part of lablGL that is shipped with llpp (kinda like you to thirdparty libs) | 04:09.33 |
sebras | malc_: please help me understand how this is mupdf's fault... I'm not sure I get this yet. | 04:10.44 |
malc_ | sebras: i just don't know, the original discussion happened over e-mail with Didier and Nicolás part of that i can share if you wish | 04:13.50 |
| sebras: it might have something to do with the way i build C parts of llpp with clang... namely -Weverything is enabled and selected options is manually turned off | 04:16.10 |
sebras | malc_: seems like someone else has seen the same? https://github.com/xianyi/OpenBLAS/pull/2329 | 04:17.49 |
malc_ | sebras: perhaps... the only mac machine i have access too is in disarray right now :( have to sort out disk shortage there first | 04:19.00 |
sebras | malc_: they seem to implicate -O2 vs -O1 | 04:19.27 |
malc_ | sebras: perhaps... i can not help at all before sorting out those problems with disk space... disk is consumed by mairs, mail.app crashes when trying to fee up some space indicating lick of space as the reason for failure... programmers death: rinse repeat | 04:21.23 |
sebras | malc_: but the problems happen in CI, right? wouldn't you be able to run another llpp branch through the same CI setup to see what helps fix the problem? | 04:22.45 |
malc_ | sebras: Nicolás added the CI stuff after the problem was identified, IOW i don't really follow you | 04:23.37 |
sebras | malc_: if the CI stuff is there, couldn't you change the Makefile/setup of the CI runs to see what compiler flags work in the CI build environment? | 04:24.59 |
| thereby I'm thinking you don't have to test this on your local machine. | 04:25.11 |
| malc_: but this line is rather problematic: https://github.com/nojb/llpp/runs/951325249?check_suite_focus=true#step:5:332 | 04:25.38 |
| malc_: so clang 11.0.3 segfaults. | 04:26.42 |
malc_ | sebras: it appears so | 04:26.50 |
sebras | malc_: no wonder the other project managed to "resolve" the problem with -O1. | 04:31.16 |
malc_ | sebras: hmm this (arch) linux box has clang 10.0.1 | 04:33.40 |
| apple is on the bleeding edge | 04:33.47 |
sebras | malc_: maybe not apple themselves, but whatever system is running CI | 04:34.35 |
| malc_: but what nojb seems to say is that make build=release works better than make build=native | 04:35.27 |
malc_ | sebras: i think either you are mistaken or i am confused, in any case breakfast time | 04:35.30 |
| sebras: yes that is what he is saying | 04:35.45 |
| and i have already pushed his change | 04:35.51 |
sebras | malc_: so you have. so this is more of info for us in case someone see the same problem on macos when building with clang 11.0.3, we might advice them to build the release build instead of the native build..? | 04:38.20 |
malc_ | sebras: it is a native macos caveat emptor, yeah | 04:42.53 |
pedr0 | hi all, I am using mutool draw to translate a PDF into an HTML file, I can see that it produces a lot of '<p>' tags with coordinates nailed down as part of the tags' attribute - is it possible to have that generating a tag per-character ? I would basically like to have the same level of information as in the XML trace debug but the easy of rendering of an HTML file, which I can open with a web browser and write some JS code to deal wit | 13:50.44 |
| h it. | 13:50.44 |
| feel free to tell me what I am talking about is complete rubbish :-) | 14:54.51 |
ator | pedr0: try the SVG output? | 14:58.15 |
| though I don't think that will be quite what you want either | 14:58.31 |
Robin_Watts_ | pedr0: Are you only interested in text? or other graphics too? | 14:59.23 |
| If you're only interested in text, then what you are looking for is for a different textual output of the stext structures. | 15:02.45 |
| If you're interested in non-text stuff too, then you'd need to implement your own device. | 15:02.59 |
| both of those are doable, but a different output of the stext structures is much simpler. | 15:03.14 |
pedr0 | I am sorry, what's stext ? | 15:07.07 |
| Okay, I get it. | 15:09.42 |
Robin_Watts_ | pedr0: OK, so MuPDF does text extraction to 'structured text'. | 15:09.55 |
pedr0 | mutool draw -F stext - correct ? | 15:10.09 |
Robin_Watts_ | That's a set of structures that holds text locations/styles etc. | 15:10.15 |
| -F stext will dump that in the rawest possible textual form, yes. | 15:10.31 |
| so either you can read that in and postprocess it as you want, OR you can fiddle in the C so it directly outputs what you want. | 15:10.54 |
pedr0 | I think I will need to write some code myself as what I am doing may be slightly specific to my application, I've the additional complication of having to do this within a web-page, which is why I was asking about web-assembly. | 15:12.11 |
| Yesterday I guess. | 15:12.21 |
Robin_Watts_ | ok, we have a webassembly thing on the way. | 15:13.00 |
pedr0 | The HTML representation - forgive me my ignorance, how are fonts handled ? It may well be that a PDF font does not exist in the web-browser, what happens in that case ? Again, am I talking rubbish maybe ? | 15:13.56 |
| I am thinking of tweaking the HTML generator to output a character at the time '<p style="...">character</p>' - I understand it seems crazy but it helps in my case. am I on-boarding myself towards misery and desperation you reckon ? | 15:15.46 |
Robin_Watts_ | pedr0: for html we just list the font name. We don't attempt to export the font data. | 15:15.52 |
pedr0 | so it works as soon as it's available - correct ? | 15:16.23 |
| and it's up to the web-browser to do the right thing if the font isn't available. | 15:16.57 |
Robin_Watts_ | pedr0: Going from PDF to html is going to be a thankless task if you're expecting to match exact layout. | 15:17.04 |
| cos PDF doesn't say "here is some text, lay it out". Effectively what it says is "Put this glyph here." "Put this glyph here". etc. | 15:17.49 |
pedr0 | Yes I am aware of that and I am reasonably happy to what it is currently generated, I just need to have a more granular view of the characters on the screen and their position. | 15:17.56 |
Robin_Watts_ | pedr0: Well, then working with the info in stext is probably as good as you can do. | 15:18.23 |
pedr0 | You must forgive me, I've no a clear idea of many details, maybe I can re-phrase in this way: | 15:20.45 |
| <p style="position:absolute;white-space:pre;margin:0;padding:0;top:217pt;left:454pt"><span style="font-family:F3,sans-serif;font-size:9.480375pt">... | 15:20.49 |
| That is the output of the mutool draw -> HTML | 15:21.08 |
| the bit '<span style="font-family:F3,sans-serif;font-size:9.480375pt">' | 15:21.27 |
| where is it coming from ? Because then I look at the stext I can't find those information there - again, I am probably wrong but I don't see them. | 15:22.04 |
| Because ... *when* I look at .... | 15:23.07 |
Robin_Watts_ | Just a tick... | 15:23.53 |
| ok, so for stext output I see: | 15:27.42 |
| <font name="Helvetica" size="10"> | 15:28.03 |
| <char quad="72 40.59 78.11 40.59 72 50.18 78.11 50.18" x="72" y="48" color="#000000" c="Z"/> | 15:28.03 |
| for html I see: <span style="font-family:Helvetica,serif;font-size:10pt">ZLIB(3) </span></p> | 15:28.46 |
| So it's essentially the same information in different forms. | 15:29.18 |
pedr0 | so I am essentially a blinded fool :) Sorry about that. | 15:33.41 |
| and thanks a lot, this is *very* useful for me | 15:34.00 |
Robin_Watts_ | no worries. | 15:35.22 |
| Can you tell us more about your use case? It's always interesting to hear how it's being used. | 15:35.41 |
pedr0 | Listen, I am trying to draw a pdf on a page, then what I want to do is to draw a grid on it and extract all text within those 'user-defined cells' | 15:38.43 |
| *a webpage | 15:38.50 |
| any suggestion, you're welcome | 15:39.22 |
Robin_Watts_ | pedr0: So you could render the PDF to an image, display the image on the webpage, and then just use the stext for the actual extraction? | 15:40.02 |
| i.e. you draw the grid on the image. | 15:40.29 |
pedr0 | I was thinking of that, but let me phrase this better. I could render the PDF on a webpage and I draw overlays that defined the grid, now, making the simplification of a 1:1 mapping among coordinates (PDF->Web Canvas/Browser/Image) I could then gather all elements falling into the defined areas. | 15:41.29 |
| is that roughly what you meant ? | 15:41.55 |
Robin_Watts_ | yes. | 15:42.24 |
| Presumably you're already having to map from the coords in an stext to coords within the HTML page. You'd be doing the same thing, and discarding stuff that fell outside of the bounds as you extract. | 15:43.11 |
pedr0 | Yes, I think I am following you. | 15:43.38 |
Robin_Watts_ | That would, I think, give a nicer visual experience for the user. | 15:44.13 |
| We routinely use the bboxes returned from an stext in order to do cut/paste/highlighting on pdf rendered pages. | 15:44.51 |
pedr0 | the quad='' field - is that the CTM ? and x,y are those two elements the top/left corner of the bbox for the glyph ? | 15:45.35 |
| you know what, don't bother I will find out myself digging around, I think I can make it | 15:46.32 |
Robin_Watts_ | no, quad is 4 coords. | 15:46.59 |
| each glyph fits into a box. | 15:47.11 |
| so, top/left/right/bottom. | 15:47.18 |
pedr0 | I see | 15:47.26 |
Robin_Watts_ | but the image of that box under the ctm is not a box, cos it might be rotated. | 15:47.33 |
| so the quad is the list of 4 corner points (hence 8 values). | 15:47.52 |
| (x y x y x y x y) | 15:47.57 |
| does that make sense? | 15:48.16 |
pedr0 | I must admit it does not. | 15:49.13 |
| A glyph is a box, and it can be rotated | 15:49.22 |
| does the quad specify, given the 'original' glyph's box, where to place its corners on the actual page ? | 15:50.02 |
Robin_Watts_ | pedr0: The quad is the image of the original glyphs box, yes. | 15:51.02 |
| glyphs are always designed within a box which has orthogonal axes, right? | 15:51.27 |
pedr0 | yes | 15:51.33 |
Robin_Watts_ | so left/right/top/bottom is all you need. | 15:51.39 |
| but if that's mapped onto the page with (say) a 45 degree rotation, the resultant bounding box that encloses all the glyph is WAY larger than we want. | 15:52.08 |
| much better to have the 4 corners of where the original bbox got mapped to. | 15:52.36 |
| and that requires 4 coordinates (corner 0, corner 1, corner 2, corner 3), not just 2 (top/left, bottom/right) | 15:53.32 |
pedr0 | Let me try | 15:56.26 |
| A glyph is drawn within a box, OK. | 15:56.44 |
Robin_Watts_ | Suppose the glyph is drawn in the box between (0,0) and (1,1) | 15:57.37 |
| now suppose that's mapped onto the screen with a scale factor and a rotation. | 15:57.59 |
| so the (0,0) corner stays at (0,0). | 15:58.11 |
pedr0 | Than, if I rot. the box, let's say 45 yes, the rectangle which encloses the the rotated box is bigger than the original box. Is that what you meant ? | 15:58.18 |
| I am sorry, keep going. | 15:58.37 |
Robin_Watts_ | the (0, 1) corner goes to (-1, 1) | 15:58.37 |
| the (1, 1) corner goes to (0, 2) | 15:58.57 |
| the (1, 0) corner goes to (1,1) | 15:59.10 |
| When you come to select that glyph on the screen, it'd be really misleading to have to render an axis aligned highlight box from (-1,0) to (1,2) | 16:00.06 |
| so you'd rather draw a non-axis-aligned highlight between (0,0) (-1,1) (0,2) (1,1) | 16:00.45 |
| So we give the quad as 0,0,-1,1,0,2,1,1 | 16:00.59 |
| make sense? | 16:01.04 |
pedr0 | The quad are the actual coordinate on the screen. | 16:02.38 |
| *coordinates | 16:02.43 |
| is that right ? | 16:02.53 |
| I think I need a little bit of time to digest the thing, it's been a while since I had to deal with some algebra | 16:03.44 |
| was that a rotation of 45 degrees , anti clock-wise ? Just to double check my drawings here | 16:08.22 |
| 'centered' on (0,0) | 16:09.13 |
| but yes it seems to me that the quad are the actual coordinates on the screen, what about x,y ? | 16:09.28 |
| (shoot me down if I talk rubbish, is always good to be aware that you have no clue what you're talking about) :-) | 16:10.56 |
Robin_Watts_ | Rotation plus a scale, done in my head, so bear with my crap maths :) | 16:12.13 |
| The quad are the image of the corners. the x,y is the image of the 'cursor position' of the glyph. | 16:13.22 |
| We move to the cursor position and then draw the glyph. | 16:13.52 |
| Often glyphs have descenders that go beyond the baseline (like the tail on a g etc). | 16:14.22 |
pedr0 | cursor position before or after the glyph ? | 16:14.45 |
Robin_Watts_ | so the bbox is typically slightly negative on the bottom y. Whereas the glyph counts as being positioned at y = 0. | 16:14.54 |
| cursor position before the glyph is printed. | 16:15.07 |
pedr0 | Thank you very, very much. Hope to give back sooner or later. | 16:16.37 |
| This is much clearer now | 16:16.46 |
| <<<Back 1 day (to 2020/08/05) | Forward 1 day (to 2020/08/07)>>> | |