Ghostscript IRC logs

Log of #ghostscript at irc.freenode.net.

	<<<Back 1 day (to 2016/01/18)	20160119
Robin_Watts	tor8: Are you here?	10:31.35
tor8	Robin_Watts: I am.	10:31.40
Robin_Watts	Ok, so this bidi stuff.	10:31.47
	I've basically got a first implementation working that works along the lines of your "My original thought" bit.	10:32.36
tor8	Yes, I've seen the code on your branch.	10:32.57
Robin_Watts	It all seems very natural to do it that way, so I'm confused as to why you want to change to be working in a different way.	10:33.34
	It has the important property that the order of chars sent to be drawn is never changed - stuff still gets plotted in the "logical" order, which is important for text extraction.	10:34.20
	It means the bidi evaluation code is only run once.	10:34.53
tor8	Robin_Watts: hm, have you tested the text extraction with this?	10:35.01
	because the text extraction assumes the text arrives in visual order and does a bidi reversing pass	10:35.18
Robin_Watts	No. I expect the current text extraction code to be confused :)	10:35.21
	but I think the right thing to do is to keep characters being sent in the logical order.	10:35.45
tor8	we should look at what PDF documents in R2L languages tend to do	10:35.49
	yeah, that might be true	10:36.20
	hm, actually, that might be true	10:36.43
	we junk the logical order when we make the text extraction character soup	10:37.00
	which is something I'm starting to like less and less	10:37.16
	it seemed like the right choice back then, but it's come back to bite us more often than not :(	10:37.34
kens	As I recall right to left reading languages still put the charaqcters in left to right order, in general, because that's hwo PDF text handling works	10:38.27
tor8	okay, so this approach looks doable. I'm not thrilled with the detect_directionality post-processing step but I also don't see how to do it on the fly with the bidi code we've copied in	10:38.34
	kens: yeah, glyph advances always go left to right in PDF don't they?	10:39.07
kens	I believe so, yes	10:39.14
	You could fiddle it with negative numbers but that would be unreasonably complex, easier to reverser the text order	10:39.36
Robin_Watts	tor8: We can modify the text extraction code to look to see if chars are from left to right, or right to left scripts, and append/prepend within the soup.	10:39.59
tor8	Robin_Watts: my main motivation for doing it in the draw step was to skip the detect_directionality step, and because I'll need the API for form filling and that would let us reuse more bits of functionality and look similar	10:40.52
Robin_Watts	tor8: For editing, you'll want to pass in a unicode block, and get the fragments within that block identified.	10:41.48
	That's exactly what the underlying code does.	10:41.56
	The wrapper around that, which we currently call, copes with collating the fz_html_text nodes into a buffer, characterising that buffer, and splitting the fragments.	10:42.36
tor8	the underlying code duplicates a lot of the unicode character property tables we already have, but I assume you intend the bidi code to be polished up so it fits the mupdf style anyway?	10:42.44
Robin_Watts	tor8: I'm certainly up for that, yes.	10:42.56
	If there are duplications in tables, and we can easily remove them, then we can do that.	10:43.15
tor8	the ucdn character database we have already has pretty well optimized character categorization and mirror pair lookups etc	10:43.23
	and they work for unicode 8.0 with stuff outside the BMP as well	10:43.37
	but it looks ilke it should be easy to replace those calls	10:43.52
Robin_Watts	BMP ?	10:43.54
tor8	base mulitlingual plane ... the 16-bit part	10:44.03
Robin_Watts	right.	10:44.08
kens	Basic Multilingual Plane	10:44.08
Robin_Watts	tor8: So I reckon that what you need for editing, should fall out fairly neatly as just another wrapper around the core characterisation functions.	10:45.22
tor8	Robin_Watts: yeah, I expect so	10:45.32
Robin_Watts	SO had 2 different wrappers around the core, probably for exactly this. I only imported the one we needed.	10:45.49
tor8	duplicating the fribidi api should be doable without too much trouble	10:45.53
	why don't you mirror the characters directly in the detect_directionality step?	10:46.31
Robin_Watts	so, if you have no strong objections, I'll tidy up what I've got to make it more mupdf-y and then we can get that in?	10:46.31
	tor8: because detect_directionality is supposed to keep the 'logical' characters.	10:46.55
tor8	Robin_Watts: okay, fair enough	10:47.03
	Robin_Watts: yeah. I don't mind camelCase for the internals but the whitespace and external apis should be kept in check	10:47.39
	and use the ucdn stuff if possible	10:47.49
Robin_Watts	tor8: I'll have a look.	10:47.54
tor8	ucdn_get_bidi_class and ucdn_get_mirrored and ucdn_mirror should be all you need	10:48.32
Robin_Watts	tor8: hmm.	13:22.33
	ucdn lists 4 char classes that the bidi code does not.	13:22.49
	(LRI, RLI, FSI, PDI)	13:22.58
	and there is note in the bidi stuff that says "BDI_ON = BDI_N = 0, or the code doesn't work"	13:23.29
	which of course is not the encoding that ucdn uses.	13:23.40
	Ah, ok, those came in at 6.3.0, and the bidi code only goes to 8.0.0	13:26.28
	Ah, ok, those came in at 6.3.0, and the bidi code only goes to 6.2.0 not 8.0.0	13:26.36
*gerry2016*	joined..	14:35.09
Robin_Watts	gerry2016: Welcome to irc.	14:39.16
	You need to register your nick with freenode, and then I can invite you to #artifex.	14:40.16
	You need to do: /msg nickserv register password email@address.com	14:41.19
gerry2016	yeah i did that before but it seemed to fail, maybe cos the nick i connected with was already in use, ill try again with this nick..	14:53.49
	think its registered now	14:57.15
Robin_Watts	gerry2016: You're not logged in yet.	14:58.14
	Try /msg nickserv identify password	14:58.28
HenryStiles	gerry2016: any luch getting to #artifex. I was planning to have our there.	15:18.10
	gerry2016: any luch getting to #artifex? I was planning to have our meeting there.	15:18.41
Robin_Watts	Hi Ron___.	16:52.41
Ron___	greetings	16:52.54
Robin_Watts	I'm going to take a stab at guessing that there is already a "Ron" and a "Ron_" and a "Ron__" online.	16:53.21
Ron___	Let me change to .....	16:53.36
RonL	ok. changed to RonL	16:54.07
Robin_Watts	Ok, so /msg nickserv register password email@address	16:55.32
	Then they'll send you an email with some instructions. Let me know when you've got that and followed the instructions.	16:56.10
RonL	OK. Registered and verified	17:06.30
	I used the email address ron@rlicht.com	17:06.42
kens	goodnight folks	17:07.01
Robin_Watts	RonL: OK. I don't think you're identified yet.	17:08.00
	Try: /msg nickserv identify password	17:08.12
	Each time you connect to irc, you'll want to do: /msg nickserv identify password and /msg Chanserv invite #artifex	17:09.32
RonL	Robin - I'm sorry I don't understand your instructions. In what channel do I enter the /msg command?	17:11.51
Robin_Watts	RonL: #ghostscript will do.	17:12.15
	The message to nickserv says: "Hey, it's really me, here's my password to prove it".	17:12.40
	The message to chanserv says: "Hey, invite me to #artifex"	17:13.06
RonL	and when you typed "password", you mean I enter MY password, right?	17:13.11
Robin_Watts	RonL, yes :)	17:13.18
RonL	(thanks)	17:13.59
	identify command returned "you are already logged in"	17:14.45
Robin_Watts	Ok, that's good.	17:14.57
RonL	the invite command seems to have worked too.	17:15.33
bofh_	btw I found out why pdfextract sometimes generates invalid TTF files	19:36.01
	it saves CID TrueType fonts (which amazingly are a thing) as a TTF SFNT without a CMAP table	19:36.20
	which, while legal, seems to break just about anything that isn't freetype	19:36.30
	in fact, a ton of programs straight-up crash on these files.	19:36.42
	(I have a fix, but it's sorta nasty, it essentially decompiles the sfnt, adds an identity cmap table, and then regenerates the sfnt).	19:37.10
	05:35 < tor8> we should look at what PDF documents in R2L languages tend to do <- determine rotation, then determine L2R or R2L, then sort blocks into columns, then do a reading-order sort is what nearly everyone says to do in my experience	19:40.02
Robin_Watts	bofh_: For text extraction?	19:40.36
bofh_	pdf text extraction, yes	19:40.44
Robin_Watts	the problem is not how to do text extraction - we already have stuff in place for that.	19:40.53
bofh_	the current one does rtl before column alignment though, which will generate invalid output	19:41.18
	sometimes, at least	19:41.23
	(I'm referring to stext-device)	19:41.30
Robin_Watts	The problem is more that we want to present stuff to the device interface so that (where it is sane to do so) the device interface gets driven in the same way by epub files as pdf files.	19:41.51
bofh_	in fact it doesn't really do column alignment at all, despite stext_analyze() finding all that info	19:42.00
Robin_Watts	bofh_: PDF text extraction is a research level thing, with lots of heuristics etc.	19:42.41
bofh_	oh I agree	19:42.52
Robin_Watts	Our implementation isn't perfect at all, but it's a start.	19:42.54
bofh_	it's basically hell	19:42.55
Robin_Watts	yes, that was what I though.	19:43.02
	t	19:43.04
	bofh_: Did you try the bidirectional stuff in epub?	19:43.21
bofh_	I'm trying to essentially implement pdftotext -layout in stext-device and even just that's being an incredible series of hilarious edge cases	19:43.50
	I tried it yesterday and it seems to do the right thing to my eyes, but I'll have to cross-check with someone I know that actually can read Hebrew or Arabic.	19:44.18
	(also I see tor already raised the point that the ucdn source already has bidi info. it also has NFD/NFKD, so I should post my patch for using that in text search if anyone wants it)	19:45.31
	Robin_Watts: consistency in the device interface is nice, I agree.	19:46.04
	epub has a separate layout step that pdf lacks, though.	19:46.25
Robin_Watts	bofh_: Yes.	20:14.59
	The primary purpose of the device interface is that it has enough information to render things.	20:15.24
	(and we don't want layout happening below that interface for example, cos people implementing simple devices don't want to have to deal with that)	20:15.52
	but we should try to make the information we pass across that interface be as coherent as possible so that stuff like text extraction can work.	20:16.26
	I think the current text extraction code will get r2l text wrong, currently, in that it'll extract it in l2r display order, rather than r2l logical order.	20:17.04
	Forward 1 day (to 2016/01/20)>>>

IRC Logs

Log of #ghostscript at irc.freenode.net.