IRC Logs

Log of #ghostscript at irc.freenode.net.

Search:
 <<<Back 1 day (to 2016/01/18)20160119 
Robin_Watts tor8: Are you here?10:31.35 
tor8 Robin_Watts: I am.10:31.40 
Robin_Watts Ok, so this bidi stuff.10:31.47 
  I've basically got a first implementation working that works along the lines of your "My original thought" bit.10:32.36 
tor8 Yes, I've seen the code on your branch.10:32.57 
Robin_Watts It all seems very natural to do it that way, so I'm confused as to why you want to change to be working in a different way.10:33.34 
  It has the important property that the order of chars sent to be drawn is never changed - stuff still gets plotted in the "logical" order, which is important for text extraction.10:34.20 
  It means the bidi evaluation code is only run once.10:34.53 
tor8 Robin_Watts: hm, have you tested the text extraction with this?10:35.01 
  because the text extraction assumes the text arrives in visual order and does a bidi reversing pass10:35.18 
Robin_Watts No. I expect the current text extraction code to be confused :)10:35.21 
  but I think the right thing to do is to keep characters being sent in the logical order.10:35.45 
tor8 we should look at what PDF documents in R2L languages tend to do10:35.49 
  yeah, that might be true10:36.20 
  hm, actually, that might be true10:36.43 
  we junk the logical order when we make the text extraction character soup10:37.00 
  which is something I'm starting to like less and less10:37.16 
  it seemed like the right choice back then, but it's come back to bite us more often than not :(10:37.34 
kens As I recall right to left reading languages still put the charaqcters in left to right order, in general, because that's hwo PDF text handling works10:38.27 
tor8 okay, so this approach looks doable. I'm not thrilled with the detect_directionality post-processing step but I also don't see how to do it on the fly with the bidi code we've copied in10:38.34 
  kens: yeah, glyph advances always go left to right in PDF don't they?10:39.07 
kens I believe so, yes10:39.14 
  You could fiddle it with negative numbers but that would be unreasonably complex, easier to reverser the text order10:39.36 
Robin_Watts tor8: We can modify the text extraction code to look to see if chars are from left to right, or right to left scripts, and append/prepend within the soup.10:39.59 
tor8 Robin_Watts: my main motivation for doing it in the draw step was to skip the detect_directionality step, and because I'll need the API for form filling and that would let us reuse more bits of functionality and look similar10:40.52 
Robin_Watts tor8: For editing, you'll want to pass in a unicode block, and get the fragments within that block identified.10:41.48 
  That's exactly what the underlying code does.10:41.56 
  The wrapper around that, which we currently call, copes with collating the fz_html_text nodes into a buffer, characterising that buffer, and splitting the fragments.10:42.36 
tor8 the underlying code duplicates a lot of the unicode character property tables we already have, but I assume you intend the bidi code to be polished up so it fits the mupdf style anyway?10:42.44 
Robin_Watts tor8: I'm certainly up for that, yes.10:42.56 
  If there are duplications in tables, and we can easily remove them, then we can do that.10:43.15 
tor8 the ucdn character database we have already has pretty well optimized character categorization and mirror pair lookups etc10:43.23 
  and they work for unicode 8.0 with stuff outside the BMP as well10:43.37 
  but it looks ilke it should be easy to replace those calls10:43.52 
Robin_Watts BMP ?10:43.54 
tor8 base mulitlingual plane ... the 16-bit part10:44.03 
Robin_Watts right.10:44.08 
kens Basic Multilingual Plane10:44.08 
Robin_Watts tor8: So I reckon that what you need for editing, should fall out fairly neatly as just another wrapper around the core characterisation functions.10:45.22 
tor8 Robin_Watts: yeah, I expect so10:45.32 
Robin_Watts SO had 2 different wrappers around the core, probably for exactly this. I only imported the one we needed.10:45.49 
tor8 duplicating the fribidi api should be doable without too much trouble10:45.53 
  why don't you mirror the characters directly in the detect_directionality step?10:46.31 
Robin_Watts so, if you have no strong objections, I'll tidy up what I've got to make it more mupdf-y and then we can get that in?10:46.31 
  tor8: because detect_directionality is supposed to keep the 'logical' characters.10:46.55 
tor8 Robin_Watts: okay, fair enough10:47.03 
  Robin_Watts: yeah. I don't mind camelCase for the internals but the whitespace and external apis should be kept in check10:47.39 
  and use the ucdn stuff if possible10:47.49 
Robin_Watts tor8: I'll have a look.10:47.54 
tor8 ucdn_get_bidi_class and ucdn_get_mirrored and ucdn_mirror should be all you need10:48.32 
Robin_Watts tor8: hmm.13:22.33 
  ucdn lists 4 char classes that the bidi code does not.13:22.49 
  (LRI, RLI, FSI, PDI)13:22.58 
  and there is note in the bidi stuff that says "BDI_ON = BDI_N = 0, or the code doesn't work"13:23.29 
  which of course is not the encoding that ucdn uses.13:23.40 
  Ah, ok, those came in at 6.3.0, and the bidi code only goes to 8.0.013:26.28 
  Ah, ok, those came in at 6.3.0, and the bidi code only goes to 6.2.0 not 8.0.013:26.36 
gerry2016 joined..14:35.09 
Robin_Watts gerry2016: Welcome to irc.14:39.16 
  You need to register your nick with freenode, and then I can invite you to #artifex.14:40.16 
  You need to do: /msg nickserv register password email@address.com14:41.19 
gerry2016 yeah i did that before but it seemed to fail, maybe cos the nick i connected with was already in use, ill try again with this nick..14:53.49 
  think its registered now14:57.15 
Robin_Watts gerry2016: You're not logged in yet.14:58.14 
  Try /msg nickserv identify password14:58.28 
HenryStiles gerry2016: any luch getting to #artifex. I was planning to have our there.15:18.10 
  gerry2016: any luch getting to #artifex? I was planning to have our meeting there.15:18.41 
Robin_Watts Hi Ron___.16:52.41 
Ron___ greetings16:52.54 
Robin_Watts I'm going to take a stab at guessing that there is already a "Ron" and a "Ron_" and a "Ron__" online.16:53.21 
Ron___ Let me change to .....16:53.36 
RonL ok. changed to RonL16:54.07 
Robin_Watts Ok, so /msg nickserv register password email@address16:55.32 
  Then they'll send you an email with some instructions. Let me know when you've got that and followed the instructions.16:56.10 
RonL OK. Registered and verified17:06.30 
  I used the email address ron@rlicht.com17:06.42 
kens goodnight folks17:07.01 
Robin_Watts RonL: OK. I don't think you're identified yet.17:08.00 
  Try: /msg nickserv identify password17:08.12 
  Each time you connect to irc, you'll want to do: /msg nickserv identify password and /msg Chanserv invite #artifex17:09.32 
RonL Robin - I'm sorry I don't understand your instructions. In what channel do I enter the /msg command?17:11.51 
Robin_Watts RonL: #ghostscript will do.17:12.15 
  The message to nickserv says: "Hey, it's really me, here's my password to prove it".17:12.40 
  The message to chanserv says: "Hey, invite me to #artifex"17:13.06 
RonL and when you typed "password", you mean I enter MY password, right?17:13.11 
Robin_Watts RonL, yes :)17:13.18 
RonL (thanks)17:13.59 
  identify command returned "you are already logged in"17:14.45 
Robin_Watts Ok, that's good.17:14.57 
RonL the invite command seems to have worked too.17:15.33 
bofh_ btw I found out why pdfextract sometimes generates invalid TTF files19:36.01 
  it saves CID TrueType fonts (which amazingly are a thing) as a TTF SFNT without a CMAP table19:36.20 
  which, while legal, seems to break just about anything that isn't freetype19:36.30 
  in fact, a ton of programs straight-up crash on these files.19:36.42 
  (I have a fix, but it's sorta nasty, it essentially decompiles the sfnt, adds an identity cmap table, and then regenerates the sfnt).19:37.10 
  05:35 < tor8> we should look at what PDF documents in R2L languages tend to do <- determine rotation, then determine L2R or R2L, then sort blocks into columns, then do a reading-order sort is what nearly everyone says to do in my experience19:40.02 
Robin_Watts bofh_: For text extraction?19:40.36 
bofh_ pdf text extraction, yes19:40.44 
Robin_Watts the problem is not how to do text extraction - we already have stuff in place for that.19:40.53 
bofh_ the current one does rtl before column alignment though, which will generate invalid output 19:41.18 
  sometimes, at least19:41.23 
  (I'm referring to stext-device)19:41.30 
Robin_Watts The problem is more that we want to present stuff to the device interface so that (where it is sane to do so) the device interface gets driven in the same way by epub files as pdf files.19:41.51 
bofh_ in fact it doesn't really do column *alignment* at all, despite stext_analyze() finding all that info19:42.00 
Robin_Watts bofh_: PDF text extraction is a research level thing, with lots of heuristics etc.19:42.41 
bofh_ oh I agree19:42.52 
Robin_Watts Our implementation isn't perfect at all, but it's a start.19:42.54 
bofh_ it's basically hell19:42.55 
Robin_Watts yes, that was what I though.19:43.02 
  t19:43.04 
  bofh_: Did you try the bidirectional stuff in epub?19:43.21 
bofh_ I'm trying to essentially implement pdftotext -layout in stext-device and even just that's being an incredible series of hilarious edge cases19:43.50 
  I tried it yesterday and it seems to do the right thing to my eyes, but I'll have to cross-check with someone I know that actually can read Hebrew or Arabic.19:44.18 
  (also I see tor already raised the point that the ucdn source already has bidi info. it also has NFD/NFKD, so I should post my patch for using that in text search if anyone wants it)19:45.31 
  Robin_Watts: consistency in the device interface is nice, I agree.19:46.04 
  epub has a separate layout step that pdf lacks, though.19:46.25 
Robin_Watts bofh_: Yes.20:14.59 
  The primary purpose of the device interface is that it has enough information to render things.20:15.24 
  (and we don't want layout happening below that interface for example, cos people implementing simple devices don't want to have to deal with that)20:15.52 
  but we should try to make the information we pass across that interface be as coherent as possible so that stuff like text extraction can work.20:16.26 
  I think the current text extraction code will get r2l text wrong, currently, in that it'll extract it in l2r display order, rather than r2l logical order.20:17.04 
 Forward 1 day (to 2016/01/20)>>> 
ghostscript.com
Search: