Log of #mupdf at irc.freenode.net.

Search:
 <<<Back 1 day (to 2020/08/05)Fwd 1 day (to 2020/08/07)>>>20200806 
malc_ llpp user have hit an issue with cland+mupdf+build-type=native, some of the details are at: https://github.com/moosotc/llpp/pull/14403:54.37 
sebras malc_: I'm confused by the log pointed to by that pull request: https://github.com/nojb/llpp/runs/951298331?check_suite_focus=true#step:5:278004:08.42 
  malc_: it shows that it is compiling lablGL/ml_gl.c what is that?04:08.56 
malc_ sebras: part of lablGL that is shipped with llpp (kinda like you to thirdparty libs)04:09.33 
sebras malc_: please help me understand how this is mupdf's fault... I'm not sure I get this yet.04:10.44 
malc_ sebras: i just don't know, the original discussion happened over e-mail with Didier and Nicolás part of that i can share if you wish04:13.50 
  sebras: it might have something to do with the way i build C parts of llpp with clang... namely -Weverything is enabled and selected options is manually turned off04:16.10 
sebras malc_: seems like someone else has seen the same? https://github.com/xianyi/OpenBLAS/pull/232904:17.49 
malc_ sebras: perhaps... the only mac machine i have access too is in disarray right now :( have to sort out disk shortage there first04:19.00 
sebras malc_: they seem to implicate -O2 vs -O104:19.27 
malc_ sebras: perhaps... i can not help at all before sorting out those problems with disk space... disk is consumed by mairs, mail.app crashes when trying to fee up some space indicating lick of space as the reason for failure... programmers death: rinse repeat04:21.23 
sebras malc_: but the problems happen in CI, right? wouldn't you be able to run another llpp branch through the same CI setup to see what helps fix the problem?04:22.45 
malc_ sebras: Nicolás added the CI stuff after the problem was identified, IOW i don't really follow you04:23.37 
sebras malc_: if the CI stuff is there, couldn't you change the Makefile/setup of the CI runs to see what compiler flags work in the CI build environment?04:24.59 
  thereby I'm thinking you don't have to test this on your local machine.04:25.11 
  malc_: but this line is rather problematic: https://github.com/nojb/llpp/runs/951325249?check_suite_focus=true#step:5:33204:25.38 
  malc_: so clang 11.0.3 segfaults.04:26.42 
malc_ sebras: it appears so04:26.50 
sebras malc_: no wonder the other project managed to "resolve" the problem with -O1.04:31.16 
malc_ sebras: hmm this (arch) linux box has clang 10.0.104:33.40 
  apple is on the bleeding edge04:33.47 
sebras malc_: maybe not apple themselves, but whatever system is running CI04:34.35 
  malc_: but what nojb seems to say is that make build=release works better than make build=native04:35.27 
malc_ sebras: i think either you are mistaken or i am confused, in any case breakfast time04:35.30 
  sebras: yes that is what he is saying04:35.45 
  and i have already pushed his change04:35.51 
sebras malc_: so you have. so this is more of info for us in case someone see the same problem on macos when building with clang 11.0.3, we might advice them to build the release build instead of the native build..?04:38.20 
malc_ sebras: it is a native macos caveat emptor, yeah04:42.53 
pedr0 hi all, I am using mutool draw to translate a PDF into an HTML file, I can see that it produces a lot of '<p>' tags with coordinates nailed down as part of the tags' attribute - is it possible to have that generating a tag per-character ? I would basically like to have the same level of information as in the XML trace debug but the easy of rendering of an HTML file, which I can open with a web browser and write some JS code to deal wit13:50.44 
  h it.13:50.44 
  feel free to tell me what I am talking about is complete rubbish :-)14:54.51 
ator pedr0: try the SVG output?14:58.15 
  though I don't think that will be quite what you want either14:58.31 
Robin_Watts_ pedr0: Are you only interested in text? or other graphics too?14:59.23 
  If you're only interested in text, then what you are looking for is for a different textual output of the stext structures.15:02.45 
  If you're interested in non-text stuff too, then you'd need to implement your own device.15:02.59 
  both of those are doable, but a different output of the stext structures is much simpler.15:03.14 
pedr0 I am sorry, what's stext ?15:07.07 
  Okay, I get it.15:09.42 
Robin_Watts_ pedr0: OK, so MuPDF does text extraction to 'structured text'.15:09.55 
pedr0 mutool draw -F stext - correct ?15:10.09 
Robin_Watts_ That's a set of structures that holds text locations/styles etc.15:10.15 
  -F stext will dump that in the rawest possible textual form, yes.15:10.31 
  so either you can read that in and postprocess it as you want, OR you can fiddle in the C so it directly outputs what you want.15:10.54 
pedr0 I think I will need to write some code myself as what I am doing may be slightly specific to my application, I've the additional complication of having to do this within a web-page, which is why I was asking about web-assembly.15:12.11 
  Yesterday I guess.15:12.21 
Robin_Watts_ ok, we have a webassembly thing on the way.15:13.00 
pedr0 The HTML representation - forgive me my ignorance, how are fonts handled ? It may well be that a PDF font does not exist in the web-browser, what happens in that case ? Again, am I talking rubbish maybe ?15:13.56 
  I am thinking of tweaking the HTML generator to output a character at the time '<p style="...">character</p>' - I understand it seems crazy but it helps in my case. am I on-boarding myself towards misery and desperation you reckon ?15:15.46 
Robin_Watts_ pedr0: for html we just list the font name. We don't attempt to export the font data.15:15.52 
pedr0 so it works as soon as it's available - correct ?15:16.23 
  and it's up to the web-browser to do the right thing if the font isn't available.15:16.57 
Robin_Watts_ pedr0: Going from PDF to html is going to be a thankless task if you're expecting to match exact layout.15:17.04 
  cos PDF doesn't say "here is some text, lay it out". Effectively what it says is "Put this glyph here." "Put this glyph here". etc.15:17.49 
pedr0 Yes I am aware of that and I am reasonably happy to what it is currently generated, I just need to have a more granular view of the characters on the screen and their position.15:17.56 
Robin_Watts_ pedr0: Well, then working with the info in stext is probably as good as you can do.15:18.23 
pedr0 You must forgive me, I've no a clear idea of many details, maybe I can re-phrase in this way:15:20.45 
  <p style="position:absolute;white-space:pre;margin:0;padding:0;top:217pt;left:454pt"><span style="font-family:F3,sans-serif;font-size:9.480375pt">...15:20.49 
  That is the output of the mutool draw -> HTML15:21.08 
  the bit '<span style="font-family:F3,sans-serif;font-size:9.480375pt">'15:21.27 
  where is it coming from ? Because then I look at the stext I can't find those information there - again, I am probably wrong but I don't see them.15:22.04 
  Because ... *when* I look at ....15:23.07 
Robin_Watts_ Just a tick...15:23.53 
  ok, so for stext output I see:15:27.42 
  <font name="Helvetica" size="10">15:28.03 
  <char quad="72 40.59 78.11 40.59 72 50.18 78.11 50.18" x="72" y="48" color="#000000" c="Z"/>15:28.03 
  for html I see: <span style="font-family:Helvetica,serif;font-size:10pt">ZLIB(3) </span></p>15:28.46 
  So it's essentially the same information in different forms.15:29.18 
pedr0 so I am essentially a blinded fool :) Sorry about that.15:33.41 
  and thanks a lot, this is *very* useful for me15:34.00 
Robin_Watts_ no worries.15:35.22 
  Can you tell us more about your use case? It's always interesting to hear how it's being used.15:35.41 
pedr0 Listen, I am trying to draw a pdf on a page, then what I want to do is to draw a grid on it and extract all text within those 'user-defined cells'15:38.43 
  *a webpage15:38.50 
  any suggestion, you're welcome15:39.22 
Robin_Watts_ pedr0: So you could render the PDF to an image, display the image on the webpage, and then just use the stext for the actual extraction?15:40.02 
  i.e. you draw the grid on the image.15:40.29 
pedr0 I was thinking of that, but let me phrase this better. I could render the PDF on a webpage and I draw overlays that defined the grid, now, making the simplification of a 1:1 mapping among coordinates (PDF->Web Canvas/Browser/Image) I could then gather all elements falling into the defined areas.15:41.29 
  is that roughly what you meant ?15:41.55 
Robin_Watts_ yes.15:42.24 
  Presumably you're already having to map from the coords in an stext to coords within the HTML page. You'd be doing the same thing, and discarding stuff that fell outside of the bounds as you extract.15:43.11 
pedr0 Yes, I think I am following you.15:43.38 
Robin_Watts_ That would, I think, give a nicer visual experience for the user.15:44.13 
  We routinely use the bboxes returned from an stext in order to do cut/paste/highlighting on pdf rendered pages.15:44.51 
pedr0 the quad='' field - is that the CTM ? and x,y are those two elements the top/left corner of the bbox for the glyph ?15:45.35 
  you know what, don't bother I will find out myself digging around, I think I can make it15:46.32 
Robin_Watts_ no, quad is 4 coords.15:46.59 
  each glyph fits into a box.15:47.11 
  so, top/left/right/bottom.15:47.18 
pedr0 I see15:47.26 
Robin_Watts_ but the image of that box under the ctm is not a box, cos it might be rotated.15:47.33 
  so the quad is the list of 4 corner points (hence 8 values).15:47.52 
  (x y x y x y x y)15:47.57 
  does that make sense?15:48.16 
pedr0 I must admit it does not.15:49.13 
  A glyph is a box, and it can be rotated15:49.22 
  does the quad specify, given the 'original' glyph's box, where to place its corners on the actual page ?15:50.02 
Robin_Watts_ pedr0: The quad is the image of the original glyphs box, yes.15:51.02 
  glyphs are always designed within a box which has orthogonal axes, right?15:51.27 
pedr0 yes15:51.33 
Robin_Watts_ so left/right/top/bottom is all you need.15:51.39 
  but if that's mapped onto the page with (say) a 45 degree rotation, the resultant bounding box that encloses all the glyph is WAY larger than we want.15:52.08 
  much better to have the 4 corners of where the original bbox got mapped to.15:52.36 
  and that requires 4 coordinates (corner 0, corner 1, corner 2, corner 3), not just 2 (top/left, bottom/right)15:53.32 
pedr0 Let me try15:56.26 
  A glyph is drawn within a box, OK.15:56.44 
Robin_Watts_ Suppose the glyph is drawn in the box between (0,0) and (1,1)15:57.37 
  now suppose that's mapped onto the screen with a scale factor and a rotation.15:57.59 
  so the (0,0) corner stays at (0,0).15:58.11 
pedr0 Than, if I rot. the box, let's say 45 yes, the rectangle which encloses the the rotated box is bigger than the original box. Is that what you meant ?15:58.18 
  I am sorry, keep going.15:58.37 
Robin_Watts_ the (0, 1) corner goes to (-1, 1)15:58.37 
  the (1, 1) corner goes to (0, 2)15:58.57 
  the (1, 0) corner goes to (1,1)15:59.10 
  When you come to select that glyph on the screen, it'd be really misleading to have to render an axis aligned highlight box from (-1,0) to (1,2)16:00.06 
  so you'd rather draw a non-axis-aligned highlight between (0,0) (-1,1) (0,2) (1,1)16:00.45 
  So we give the quad as 0,0,-1,1,0,2,1,116:00.59 
  make sense?16:01.04 
pedr0 The quad are the actual coordinate on the screen.16:02.38 
  *coordinates16:02.43 
  is that right ?16:02.53 
  I think I need a little bit of time to digest the thing, it's been a while since I had to deal with some algebra16:03.44 
  was that a rotation of 45 degrees , anti clock-wise ? Just to double check my drawings here16:08.22 
  'centered' on (0,0)16:09.13 
  but yes it seems to me that the quad are the actual coordinates on the screen, what about x,y ?16:09.28 
  (shoot me down if I talk rubbish, is always good to be aware that you have no clue what you're talking about) :-)16:10.56 
Robin_Watts_ Rotation plus a scale, done in my head, so bear with my crap maths :)16:12.13 
  The quad are the image of the corners. the x,y is the image of the 'cursor position' of the glyph.16:13.22 
  We move to the cursor position and then draw the glyph.16:13.52 
  Often glyphs have descenders that go beyond the baseline (like the tail on a g etc).16:14.22 
pedr0 cursor position before or after the glyph ?16:14.45 
Robin_Watts_ so the bbox is typically slightly negative on the bottom y. Whereas the glyph counts as being positioned at y = 0.16:14.54 
  cursor position before the glyph is printed.16:15.07 
pedr0 Thank you very, very much. Hope to give back sooner or later.16:16.37 
  This is much clearer now16:16.46 
 <<<Back 1 day (to 2020/08/05)Forward 1 day (to 2020/08/07)>>> 
ghostscript.com #ghostscript
Search: