| <<<Back 1 day (to 2014/10/09) | 20141010 |
rsc | So I can influence the font name for example. | 09:50.29 |
| Right now the "font name" is "DejaVuLGCSans-Identity-H" | 09:50.43 |
kens | Well tell it to use DejaVuLGCSans then. But I have no idea if that will work at all | 09:51.07 |
| The PostScript has 2 byte encoded text | 09:51.32 |
rsc | Is it enough to do a search and replace in the *.ps? | 09:51.37 |
kens | God no. | 09:51.43 |
| PostScript is a programming language | 09:51.54 |
| As I said, your text is double byte, you would need a type 0 or CIDFont to be able to handle that properly | 09:52.27 |
| The application generating the PostScript needs to not do that in order for you to get soemthing which will work | 09:52.55 |
rsc | How to get "type 0"? I thought CIDFont is an issue? | 09:54.47 |
kens | Type 0 is what a CIDFOnt turns into when you load it. | 09:55.13 |
rsc | No I am totally confused. I thought CIDFont is the reason why I can't copy & paste without garbled results. | 09:55.43 |
| *Now | 09:55.46 |
kens | But it can also be manufactured by other means. However I doubt you can do it from the application | 09:55.48 |
| The fact that the application is using a CIDFont wis why the copy/paste doesn't work, yes | 09:56.08 |
| And because the application is using a CIDFOnt, it emits the text in a form encoded suitably for the CIDFont. THat form will *NOT* work with a type 1 or type 42 (what you would think of as TrueType) font. | 09:56.57 |
| SO you cannot simply search and replace the font name, replacing a CIDFont with a regular font, because hte text will not then be suitably encoded for that font. | 09:57.44 |
| FWIW the CIDSystemInfo attached to the CIDFont in the PDF file does say that the Ordering is Unicode, so a smart application could use that to figure out what the text is | 09:58.33 |
rsc | Then neither evince nor Adobe is smart. | 09:58.54 |
kens | Well its a heuristic, and not totally reliable. It would be effort to code that, so I guess most people don;t bother | 09:59.22 |
| THe chances of it being present, and correct, are small | 09:59.34 |
| I must admit I'm not completely sure how our own txtwrite device is getting mostly useful text out, and I wrote that device..... | 10:00.10 |
rsc | The copy & paste result from Adobe looks "correct" but it is "10 00 41" - while "41" is "A" | 10:00.39 |
kens | Like I said, you are using 2 byte encodings (Unicode) so the 1st byte is always going to be 0x00 for Western languages | 10:01.27 |
| Frankly there is no way you are going to get a PDF file you can reliably cut and paste from, starting from the PostScript you are using. All the PDF consumers are going to be forced to fall back on guesswork (because there is no ToUnicode information available) so therefore unreliable | 10:03.25 |
| Some may work better than others. | 10:03.43 |
rsc | But the text for copy & paste is not separately supplied? It is generated from the pdf writer? | 10:07.03 |
| And I can not make it trimming the first byte simply? :) | 10:07.36 |
kens | trimming the fist byte from what exactly ? THe PostScrip file ? The PDF file ? THe cut and pasted text ? | 10:08.25 |
| I don't know what you mean by the text being separately supplied | 10:08.53 |
rsc | Copy & paste result is 0x10 0x00 0x41 if I am not mistaken. So where does this exactly come from? | 10:09.03 |
| Is it possible to have some kind of hackish workaround there to only have 0x41 or 0x00 0x41 instead? | 10:09.42 |
| (to get a correct copy & paste result) | 10:09.54 |
kens | Well it comes form the application doing the cut and paste I guess. Where it comes from exactly I can't guess. Howeve3r the text is present i the PDF file so it comes (basically) from there | 10:09.58 |
rsc | If I have an "A" in a PDF, is it there twice? Once for representation and once for copy & paste? | 10:10.42 |
kens | rsc by changing which file ? The only thing you can change is the cut and pasted text, change either the PostScript file by removing the bytes and it will give you an error when you try to process it, change the PDF file by removing the bytes and it will not open | 10:11.07 |
| rsc, no the text is only there once. | 10:11.17 |
| Cut/paste/search is done by examining the text in the PDF file. First you look up which font is being used (also in the PDF file) then you take the correct number of bytes and make a numeric character code from it. | 10:12.24 |
| What happens after that depends on the font and the ionformation available. | 10:12.35 |
| If there's a ToUnicode CMap then you take the character code as an index, and that tells you the corresponding Unicode code point. | 10:13.03 |
| THat's 100% reliable and teh way most things work | 10:13.14 |
| If you don't have a ToUnicode CMap then you are left with guessing. | 10:13.28 |
rsc | So if I have to stick to CIDFont (which is likely because I can not change the application fundamentally), I need ToUnicode CMap definately to get rid of this, right? | 10:13.53 |
kens | You can use the glyph names from type 1 fonts. You can look up the POST table (if its present) from a TrueType font. If neither of those is availabvle then most apps simply say 'lets hope its ASCII' | 10:14.19 |
| rsc yes, if you are using a CIDFont the *only* reliable mechanism is a ToUnicode CMap | 10:14.45 |
rsc | Okay. For that usecase it would be enough if I cover characters from Western Europe. | 10:15.07 |
kens | You are using subset fonts, so you can't produce a 'one size fits all' ToUnicode CMap | 10:15.55 |
rsc | Let me go one step back. That fscking application here supports either "latin1" only by using Type1 fonts or "unicode" by using TTF. | 10:17.20 |
kens | If you say so. | 10:17.35 |
rsc | Can I somehow figure out if it uses CID for the "latin1 only" stuff? | 10:17.43 |
kens | Look at the PostScript and see what font name it uses | 10:17.57 |
| If its a name of the form <font name>-Identity-H or similar then its a CIDFont | 10:18.24 |
| Also you cna look at the text in the PostScript and see if its single byte or doube byte encoded | 10:18.44 |
rsc | NimbusSanL-Regu, Type 1C tells Evince here. | 10:19.30 |
kens | If your PostScript contained a GlyphNames2Unicode entry in the font dictioanry then you would get a ToUnicode CMap generated for you, but since the PostScript doesn't actually have the font embedded, that can't happen | 10:20.01 |
| rsc yes that's a type 2 font, but its basically the same | 10:20.26 |
| You should have single byte encoded text, I would guess it will copy/paste/search as you expect | 10:20.48 |
rsc | Copy/paste/search works, thus likely single byte encoded. | 10:21.17 |
kens | Yes. | 10:21.23 |
| Like I said, in the absence of any other information, applications will usually assume ASCII, and Latin1 is basically ASCII | 10:21.54 |
rsc | How can I generate such a "ToUnicode CMap"? | 10:22.52 |
kens | Like I said, you can't, it needs to be done programatically by the application embedding the font. | 10:23.24 |
| In case you hand't guessed, you're in a very complicated area of PDF here | 10:23.45 |
rsc | Can't I provide some mapping list to ghostscrit? | 10:25.00 |
kens | Not really, no. | 10:25.14 |
rsc | Means a "ToUnicode CMap" is only a hypothetical but not practical solution? | 10:25.35 |
kens | Its highly practical for certain tasks; starting from another PDF file, or PostScript generated on Windows for instance. | 10:26.16 |
| But if your application isn't generating it, its not easy to add afterwards. | 10:26.35 |
rsc | What would the application have to do exactly? | 10:27.33 |
kens | OK well there is no concept of a ToUnicode CMap in PostScript. THe Windows Postcript driver has a specific extension which includes a /GlyphNames2Unicode entry in an embedded font dictionary and we support that extension. | 10:28.38 |
| So an application (or PostScript producer) would have to firstly embed teh font (your app doesn't so it fails at the first hurdle) then it would have to add the entry to the dictioanry and fill it in so that the character codes are matched to Unicode (actually UTF-16) values. In your case that would be an identity mapping of course. | 10:29.49 |
rsc | Uhm. Nothing that can be easily done as non-C-programmer I guess. | 10:32.40 |
kens | No, I'm afraid not. | 10:32.49 |
| Just embedding the font would be a complex task | 10:32.58 |
rsc | But it is generic and not really application specific? | 10:35.17 |
| So is it something where Artifex could stick a price to it? | 10:35.42 |
kens | The ToUnicode CMap is part of the PDF specification, the GlyphNames2Unicode extension is specific to the Adobe PostScript driver on Windows | 10:35.54 |
rsc | No Windows involved here, just Linux. | 10:36.09 |
kens | I'm not sure what you are asking about.... | 10:36.25 |
chrisl | We'd have to modify every applications that emits Postscript...... | 10:36.45 |
rsc | I thought if it could be an option to let you change the application to include the GlyphNames2Unicode entries to the PostScript. | 10:37.40 |
kens | As chrisl says, we would have to modify every application that emits PostScript. We would also have to change at least the one you are using to embed teh fonts too. We don't have that kind of manpower | 10:38.23 |
rsc | Why every application that emits PostScript? I thought the issue is that my application here doesn't just do the right thing? | 10:38.56 |
kens | You seem to be talking generically, not about a specific application | 10:39.20 |
rsc | Oh, sorry if I was unprecise about that. | 10:39.47 |
kens | If you mean your specific application then it would need to be modified to embed fonts in the output, and add teh relevant GlyphNames2Unicode information | 10:39.53 |
chrisl | And, frankly, especially right now, we don't have the man power to take on work like that | 10:40.05 |
kens | It would be a major undertaking for the people who maintain that appplication, well outside of anything we coudl undertake, especially at the moment. | 10:40.43 |
rsc | kens: okay, because it takes months to change that? | 10:41.03 |
| kens: can you give me a very rough estimation how huge it would be? | 10:41.16 |
| I anyway need to run to somebody and ask for budget etc. | 10:41.34 |
kens | Well we don't have any background in that application, so we would first have to understand it. Embedding fonts is a *very* complicated process and that in itself would take an experienced engineer (experienced with fonts and PostScript) months to write and test fully. | 10:42.17 |
rsc | Okay, so months. | 10:42.44 |
kens | Please don't ask us to undertake such a task, we would have to say no. | 10:42.50 |
| rsc months *if* you have an engineer experienced in PostScript and fonts. | 10:43.12 |
rsc | kens: yes, I got this. | 10:43.20 |
kens | There are very few of those in the world. | 10:43.20 |
rsc | Is changing the application from CIDFont to something else better faster done? | 10:44.03 |
| s/better/ | 10:44.11 |
kens | I imagine the application uses CIDFonts for the very excellent reason that its the only way to support non-Western languages | 10:44.40 |
rsc | Is it? But how does say, libreoffice, solve this? I don't see "TrueType (CID)" there in such PDFs. | 10:45.29 |
kens | So changing to another font type probably isn't an option. I imagine that text is stored internally as Unicode code point values, so it would be hard to change | 10:45.37 |
| rsc You can include 2 methods of course, one for Western text and one for non-Western (>256 characters in the language) | 10:46.26 |
| More complex of course | 10:46.34 |
| Supporting two methods for achieving the same end is usually somethign engineers ahte | 10:46.59 |
| s/ahte/hate/ | 10:47.40 |
chrisl | kens: what's the procedure when a bountiable bug is resolved? | 10:50.02 |
kens | I don't recall right now | 10:50.15 |
| Probably best to notify henry | 10:50.33 |
chrisl | I'll do that.... | 10:50.45 |
kens | Doesn't SHelly already know the procedure ? He must have claimed before..... | 10:51.27 |
chrisl | Yeh, I wasn't sure if it's a "pull" procedure from Shelly's end, or a "push" procedure from henrys | 10:52.02 |
kens | I have a suspicion its a pull, but I could easily be mistaken, no harm in contacting henry anyway | 10:52.22 |
chrisl | Okay, I've let both Henry and Shelly know...... | 11:07.24 |
kens | Seems reasonable | 11:07.32 |
nsz | tor8: yesterday i tried the urls on http://git.ghostscript.com/?p=user/tor/mujs.git;a=summary but could not clone them | 11:25.08 |
| looking at the commit diff in browser looked ok, except i'd use isalpha instead of manual 'a'<=c && ... | 11:25.50 |
| libc isalpha generates smaller and faster code | 11:26.04 |
| http://git.musl-libc.org/cgit/musl/tree/include/ctype.h#n30 | 11:26.23 |
| this is how isalpha should be implemented | 11:26.35 |
| hm actually libc isaplha is not correct semantically but the musl implementation is how to do efficiently what you do there | 11:30.20 |
tor8 | nsz: libc isalpha is setlocale dependent, so unusable | 11:39.12 |
| and musl's isalpha (while minimal and elegant) only tests A-Za-z, not the full unicode range | 11:40.31 |
| nsz: I'm concerned that you couldn't clone the repo though | 11:41.00 |
nsz | i mean you could do muslisaplha(c) || isalpharune(c) | 11:42.16 |
| but it's just a minor nitpick | 11:42.29 |
tor8 | nsz: ahem, my bad. I'd confused git-export-daemon-ok and git-daemon-export-ok. should be able to clone now. | 11:42.45 |
nsz | :) | 11:42.55 |
tor8 | nsz: true, but as you said, it's a minor nitpick :) | 11:43.00 |
nsz | i's assume the current code is just optimization and isalpharune handles the ascii case as well | 11:43.34 |
tor8 | nsz: yeah. isalpharune handles ascii as well, but it's quite a bit slower since it involves a binary search through a table | 11:44.03 |
nsz | clone works but i can only checks things later | 11:49.07 |
tor8 | nsz: no rush | 11:49.45 |
nsz | btw locale is not an issue with isalpha unless setlocale is called (and the libc supports more than one 8bit encodings) | 11:49.58 |
| the problem is that if c>255 is ub | 11:50.19 |
rsc | kens: okay, thanks so far. | 11:52.31 |
tor8 | nsz: we're a library, we have no control over whether the user has called setlocale or not :( | 12:42.29 |
| hence we need to reimplement strtof and printf. such a stupid design, setlocale. | 12:42.49 |
nsz | yes that's a shame | 13:05.34 |
| btw strtof and float printf are tricky to implement correctly | 13:06.32 |
| (musl libc has correctly rounded implementations of these in c) | 13:06.54 |
henrys | chrisl, kens: shelly usually batches up a few and send me email then I review them. You don't need to do anything | 14:19.12 |
kens | thanks henrys | 14:19.48 |
zx | hello | 14:20.38 |
ghostbot | Welcome to #ghostscript, the channel for Ghostscript and MuPDF. If you have a question, please ask it, don't ask to ask it. Do be prepared to wait for a reply as devs will check the logs and reply when they come on line. | 14:20.38 |
chrisl | henrys: okay, cool, thanks! | 14:21.04 |
zx | I want to insert an image into an exist pdf file with mupdf.But there are not documents,I hope you can help me.Can you give me some examples? I use C language. I have added an annotation ,I can see the rect but cann't find the image. | 14:21.43 |
| anynone hereï¼ | 14:23.27 |
Robin_Watts | zx: yes, people are here | 14:23.58 |
kens | I see we're getting customer emails not cc'ed to support again. | 14:24.14 |
| Halfway through some conversation :-( | 14:24.23 |
zx | can you give me a good way to insert an image to an exist pdf file | 14:26.43 |
Robin_Watts | zx: Not using current mupdf, no. | 14:27.13 |
kens | Adobe Illustrator,possibly Photoshop | 14:27.17 |
Robin_Watts | zx: You could try to use the new filter stuff in mupdf. | 14:27.46 |
| That would enable you to tack on arbitrary content to the end of the content streams. | 14:28.15 |
zx | could you give me an example | 14:28.20 |
Robin_Watts | but that requires a degree of PDF knowledge. | 14:28.25 |
| No examples, no, it's still very new code. | 14:28.36 |
| It was written to allow people to add watermarks. | 14:28.42 |
| It may still only be on my repo... | 14:29.06 |
| zx: http://git.ghostscript.com/?p=user/robin/mupdf.git;a=summary | 14:30.56 |
| The 'add post processing option to page operator cleaning' commit is the one you need. | 14:31.13 |
zx | ok thanks alot | 14:32.13 |
Robin_Watts | essentially you call pdf_clean_page_contents and pass in the page you want to work with. | 14:32.28 |
| You also pass in a pdf_pagE_contents_process_fn. | 14:32.38 |
| That is called back after the page contents are cleaned, with the page contents in a buffer. | 14:33.02 |
| You can then append to the buffer. | 14:33.09 |
| Let me know how you get on with it. It's very new (almost completely untested) code. | 14:33.24 |
kens | gives up on the customer email, one for marcos to sanitise | 14:35.36 |
henrys | chrisl: NOCACHE doesn't work in pcl because it is done in gs_init.ps. so I need a call to 0 setcachelimit in all the other languages when we parse the parameter. | 15:47.40 |
chrisl | henrys: you could implement NOCACHE in pcl | 15:48.14 |
| or I can do it.... | 15:48.45 |
henrys | chrisl: no I got it. | 15:51.21 |
mvrhel_laptop | good morning | 15:53.11 |
kens | morning | 15:53.19 |
henrys | chrisl: I hate booleans that start with NO but I guess NOCACHE is something we are stuck with. | 17:10.18 |
rayjj | henrys: there's a lot of NO... options in the Ghostscript set | 17:11.36 |
henrys | we should try and be more positive | 17:12.09 |
rayjj | I think Peter sort of changed styles over time. | 17:12.16 |
| but I agree that -dUseCache=/false would be better (or even better -dUseFontCache=/false so we know which cache) | 17:13.06 |
chrisl | I guess the preference was for options that didn't need a "=....." | 17:16.07 |
kens | is amused by the Good emails :-) | 19:53.27 |
| Forward 1 day (to 2014/10/11)>>> | |