| <<<Back 1 day (to 2016/01/18) | 20160119 |
Robin_Watts | tor8: Are you here? | 10:31.35 |
tor8 | Robin_Watts: I am. | 10:31.40 |
Robin_Watts | Ok, so this bidi stuff. | 10:31.47 |
| I've basically got a first implementation working that works along the lines of your "My original thought" bit. | 10:32.36 |
tor8 | Yes, I've seen the code on your branch. | 10:32.57 |
Robin_Watts | It all seems very natural to do it that way, so I'm confused as to why you want to change to be working in a different way. | 10:33.34 |
| It has the important property that the order of chars sent to be drawn is never changed - stuff still gets plotted in the "logical" order, which is important for text extraction. | 10:34.20 |
| It means the bidi evaluation code is only run once. | 10:34.53 |
tor8 | Robin_Watts: hm, have you tested the text extraction with this? | 10:35.01 |
| because the text extraction assumes the text arrives in visual order and does a bidi reversing pass | 10:35.18 |
Robin_Watts | No. I expect the current text extraction code to be confused :) | 10:35.21 |
| but I think the right thing to do is to keep characters being sent in the logical order. | 10:35.45 |
tor8 | we should look at what PDF documents in R2L languages tend to do | 10:35.49 |
| yeah, that might be true | 10:36.20 |
| hm, actually, that might be true | 10:36.43 |
| we junk the logical order when we make the text extraction character soup | 10:37.00 |
| which is something I'm starting to like less and less | 10:37.16 |
| it seemed like the right choice back then, but it's come back to bite us more often than not :( | 10:37.34 |
kens | As I recall right to left reading languages still put the charaqcters in left to right order, in general, because that's hwo PDF text handling works | 10:38.27 |
tor8 | okay, so this approach looks doable. I'm not thrilled with the detect_directionality post-processing step but I also don't see how to do it on the fly with the bidi code we've copied in | 10:38.34 |
| kens: yeah, glyph advances always go left to right in PDF don't they? | 10:39.07 |
kens | I believe so, yes | 10:39.14 |
| You could fiddle it with negative numbers but that would be unreasonably complex, easier to reverser the text order | 10:39.36 |
Robin_Watts | tor8: We can modify the text extraction code to look to see if chars are from left to right, or right to left scripts, and append/prepend within the soup. | 10:39.59 |
tor8 | Robin_Watts: my main motivation for doing it in the draw step was to skip the detect_directionality step, and because I'll need the API for form filling and that would let us reuse more bits of functionality and look similar | 10:40.52 |
Robin_Watts | tor8: For editing, you'll want to pass in a unicode block, and get the fragments within that block identified. | 10:41.48 |
| That's exactly what the underlying code does. | 10:41.56 |
| The wrapper around that, which we currently call, copes with collating the fz_html_text nodes into a buffer, characterising that buffer, and splitting the fragments. | 10:42.36 |
tor8 | the underlying code duplicates a lot of the unicode character property tables we already have, but I assume you intend the bidi code to be polished up so it fits the mupdf style anyway? | 10:42.44 |
Robin_Watts | tor8: I'm certainly up for that, yes. | 10:42.56 |
| If there are duplications in tables, and we can easily remove them, then we can do that. | 10:43.15 |
tor8 | the ucdn character database we have already has pretty well optimized character categorization and mirror pair lookups etc | 10:43.23 |
| and they work for unicode 8.0 with stuff outside the BMP as well | 10:43.37 |
| but it looks ilke it should be easy to replace those calls | 10:43.52 |
Robin_Watts | BMP ? | 10:43.54 |
tor8 | base mulitlingual plane ... the 16-bit part | 10:44.03 |
Robin_Watts | right. | 10:44.08 |
kens | Basic Multilingual Plane | 10:44.08 |
Robin_Watts | tor8: So I reckon that what you need for editing, should fall out fairly neatly as just another wrapper around the core characterisation functions. | 10:45.22 |
tor8 | Robin_Watts: yeah, I expect so | 10:45.32 |
Robin_Watts | SO had 2 different wrappers around the core, probably for exactly this. I only imported the one we needed. | 10:45.49 |
tor8 | duplicating the fribidi api should be doable without too much trouble | 10:45.53 |
| why don't you mirror the characters directly in the detect_directionality step? | 10:46.31 |
Robin_Watts | so, if you have no strong objections, I'll tidy up what I've got to make it more mupdf-y and then we can get that in? | 10:46.31 |
| tor8: because detect_directionality is supposed to keep the 'logical' characters. | 10:46.55 |
tor8 | Robin_Watts: okay, fair enough | 10:47.03 |
| Robin_Watts: yeah. I don't mind camelCase for the internals but the whitespace and external apis should be kept in check | 10:47.39 |
| and use the ucdn stuff if possible | 10:47.49 |
Robin_Watts | tor8: I'll have a look. | 10:47.54 |
tor8 | ucdn_get_bidi_class and ucdn_get_mirrored and ucdn_mirror should be all you need | 10:48.32 |
Robin_Watts | tor8: hmm. | 13:22.33 |
| ucdn lists 4 char classes that the bidi code does not. | 13:22.49 |
| (LRI, RLI, FSI, PDI) | 13:22.58 |
| and there is note in the bidi stuff that says "BDI_ON = BDI_N = 0, or the code doesn't work" | 13:23.29 |
| which of course is not the encoding that ucdn uses. | 13:23.40 |
| Ah, ok, those came in at 6.3.0, and the bidi code only goes to 8.0.0 | 13:26.28 |
| Ah, ok, those came in at 6.3.0, and the bidi code only goes to 6.2.0 not 8.0.0 | 13:26.36 |
gerry2016 | joined.. | 14:35.09 |
Robin_Watts | gerry2016: Welcome to irc. | 14:39.16 |
| You need to register your nick with freenode, and then I can invite you to #artifex. | 14:40.16 |
| You need to do: /msg nickserv register password email@address.com | 14:41.19 |
gerry2016 | yeah i did that before but it seemed to fail, maybe cos the nick i connected with was already in use, ill try again with this nick.. | 14:53.49 |
| think its registered now | 14:57.15 |
Robin_Watts | gerry2016: You're not logged in yet. | 14:58.14 |
| Try /msg nickserv identify password | 14:58.28 |
HenryStiles | gerry2016: any luch getting to #artifex. I was planning to have our there. | 15:18.10 |
| gerry2016: any luch getting to #artifex? I was planning to have our meeting there. | 15:18.41 |
Robin_Watts | Hi Ron___. | 16:52.41 |
Ron___ | greetings | 16:52.54 |
Robin_Watts | I'm going to take a stab at guessing that there is already a "Ron" and a "Ron_" and a "Ron__" online. | 16:53.21 |
Ron___ | Let me change to ..... | 16:53.36 |
RonL | ok. changed to RonL | 16:54.07 |
Robin_Watts | Ok, so /msg nickserv register password email@address | 16:55.32 |
| Then they'll send you an email with some instructions. Let me know when you've got that and followed the instructions. | 16:56.10 |
RonL | OK. Registered and verified | 17:06.30 |
| I used the email address ron@rlicht.com | 17:06.42 |
kens | goodnight folks | 17:07.01 |
Robin_Watts | RonL: OK. I don't think you're identified yet. | 17:08.00 |
| Try: /msg nickserv identify password | 17:08.12 |
| Each time you connect to irc, you'll want to do: /msg nickserv identify password and /msg Chanserv invite #artifex | 17:09.32 |
RonL | Robin - I'm sorry I don't understand your instructions. In what channel do I enter the /msg command? | 17:11.51 |
Robin_Watts | RonL: #ghostscript will do. | 17:12.15 |
| The message to nickserv says: "Hey, it's really me, here's my password to prove it". | 17:12.40 |
| The message to chanserv says: "Hey, invite me to #artifex" | 17:13.06 |
RonL | and when you typed "password", you mean I enter MY password, right? | 17:13.11 |
Robin_Watts | RonL, yes :) | 17:13.18 |
RonL | (thanks) | 17:13.59 |
| identify command returned "you are already logged in" | 17:14.45 |
Robin_Watts | Ok, that's good. | 17:14.57 |
RonL | the invite command seems to have worked too. | 17:15.33 |
bofh_ | btw I found out why pdfextract sometimes generates invalid TTF files | 19:36.01 |
| it saves CID TrueType fonts (which amazingly are a thing) as a TTF SFNT without a CMAP table | 19:36.20 |
| which, while legal, seems to break just about anything that isn't freetype | 19:36.30 |
| in fact, a ton of programs straight-up crash on these files. | 19:36.42 |
| (I have a fix, but it's sorta nasty, it essentially decompiles the sfnt, adds an identity cmap table, and then regenerates the sfnt). | 19:37.10 |
| 05:35 < tor8> we should look at what PDF documents in R2L languages tend to do <- determine rotation, then determine L2R or R2L, then sort blocks into columns, then do a reading-order sort is what nearly everyone says to do in my experience | 19:40.02 |
Robin_Watts | bofh_: For text extraction? | 19:40.36 |
bofh_ | pdf text extraction, yes | 19:40.44 |
Robin_Watts | the problem is not how to do text extraction - we already have stuff in place for that. | 19:40.53 |
bofh_ | the current one does rtl before column alignment though, which will generate invalid output | 19:41.18 |
| sometimes, at least | 19:41.23 |
| (I'm referring to stext-device) | 19:41.30 |
Robin_Watts | The problem is more that we want to present stuff to the device interface so that (where it is sane to do so) the device interface gets driven in the same way by epub files as pdf files. | 19:41.51 |
bofh_ | in fact it doesn't really do column *alignment* at all, despite stext_analyze() finding all that info | 19:42.00 |
Robin_Watts | bofh_: PDF text extraction is a research level thing, with lots of heuristics etc. | 19:42.41 |
bofh_ | oh I agree | 19:42.52 |
Robin_Watts | Our implementation isn't perfect at all, but it's a start. | 19:42.54 |
bofh_ | it's basically hell | 19:42.55 |
Robin_Watts | yes, that was what I though. | 19:43.02 |
| t | 19:43.04 |
| bofh_: Did you try the bidirectional stuff in epub? | 19:43.21 |
bofh_ | I'm trying to essentially implement pdftotext -layout in stext-device and even just that's being an incredible series of hilarious edge cases | 19:43.50 |
| I tried it yesterday and it seems to do the right thing to my eyes, but I'll have to cross-check with someone I know that actually can read Hebrew or Arabic. | 19:44.18 |
| (also I see tor already raised the point that the ucdn source already has bidi info. it also has NFD/NFKD, so I should post my patch for using that in text search if anyone wants it) | 19:45.31 |
| Robin_Watts: consistency in the device interface is nice, I agree. | 19:46.04 |
| epub has a separate layout step that pdf lacks, though. | 19:46.25 |
Robin_Watts | bofh_: Yes. | 20:14.59 |
| The primary purpose of the device interface is that it has enough information to render things. | 20:15.24 |
| (and we don't want layout happening below that interface for example, cos people implementing simple devices don't want to have to deal with that) | 20:15.52 |
| but we should try to make the information we pass across that interface be as coherent as possible so that stuff like text extraction can work. | 20:16.26 |
| I think the current text extraction code will get r2l text wrong, currently, in that it'll extract it in l2r display order, rather than r2l logical order. | 20:17.04 |
| Forward 1 day (to 2016/01/20)>>> | |