| <<<Back 1 day (to 2016/02/22) | 20160223 |
mvrhel_laptop | Robin_Watts: for the logs, there is a commit on my mupdf repos for you to review. It fixes some mem leaks in the pdf-device. When I was checking for leaks in my pdf_create branch I stumbled upon these in the master | 00:19.43 |
Robin_Watts | let me look. | 00:21.37 |
| looks good to me. | 00:22.32 |
mvrhel_laptop | oh Thanks Robin_Watts did not expect you to be up | 01:06.45 |
| I will push to golden then | 01:06.52 |
halabund | Can Ghostscript convert everything in a PDF to RGB (even if it CYMK)? Also, can it get rid of layers in the PDF (whatever those are)? | 10:14.02 |
kens | It can certainly colour convert, its in the documentation. | 10:14.23 |
| As for layers, it depends what you mean by layers | 10:14.31 |
halabund | My new colleagues donât know LaTeX, so I am forced to work in that abomination of MS Word and it keeps destroying some embedded PDFs. Google tells me that these two things may be the reason | 10:14.33 |
kens | But basically Ghostscript will attempt to maintain the content of a PDF file when processing it. So probably not, but like I said, it depends what you mean by layers | 10:15.17 |
| Oh of course you could always convert it into an EPS and embed that into MS Word | 10:18.08 |
| But depending ont the content of the PDF you may not be happy with the result | 10:18.30 |
Robin_Watts | tor8: You here? | 11:55.10 |
tor8 | Robin_Watts: yes. | 11:56.41 |
Robin_Watts | So, was pondering this shapy texty thing when running this morning. | 11:56.56 |
tor8 | okay. | 11:57.24 |
Robin_Watts | At the moment we're managing to keep stuff fairly high level in the device interface. | 11:57.40 |
| It would seem to be a bit of a shame if we lose the ability to pass high level text through the device interface. | 11:58.17 |
| So, how would you feel about fz_text_spans gaining text direction information? | 11:58.49 |
tor8 | you mean the bidi direction? | 11:59.08 |
Robin_Watts | yes, ish. | 11:59.35 |
| For every piece of text, we potentially have some extra information. | 11:59.56 |
| 1) What language it's specified to be in in the source text. | 12:00.06 |
| 2) The direction given to it in the source text. | 12:00.16 |
| 3) The direction of the text (unset/l2r/r2l/number-in-r2l-context) | 12:01.09 |
tor8 | I am not entirely sold on the idea ... for PDF, XPS, etc we won't have any of this information set | 12:01.18 |
Robin_Watts | tor8: And so for PDF/XPS we can ignore it. | 12:01.30 |
| But for html (and other sources) we can pass it through. | 12:01.50 |
tor8 | which means you'll get varying results depending on what you do on text that looks identical but comes from different sources | 12:01.52 |
Robin_Watts | tor8: Yes. | 12:02.07 |
| but with the language information, text may not look identical, for instance. | 12:03.01 |
| (once we get that hooked up to harfbuff) | 12:03.09 |
| We already have wmode as an int. | 12:03.46 |
| If we change that to be a bitfield, then we get all the other stuff included for no extra size. | 12:03.59 |
tor8 | Robin_Watts: ahem. I just zapped the wmode (and put it back into the fz_font where it belongs)... | 12:04.09 |
| so there's space for another field to take its place | 12:04.15 |
Robin_Watts | ok. | 12:04.20 |
tor8 | on the other hand, if we make it into a bitfield, we could put the wmode and all other extra bits you want here back into it | 12:04.47 |
Robin_Watts | We can also have a field in there for 'should be shaped' or not. | 12:05.04 |
| so PDF can leave that blank. | 12:05.23 |
tor8 | I was just being annoyed by passing around the wmode argument to all the text functions just because of XPS having IsSideways as an extra attribute not part of the font | 12:05.32 |
Robin_Watts | and we can have a routine that takes an unshaped fz_text to a shaped one. | 12:05.43 |
tor8 | but I could change my mind if we want to pass around other extra bits of information | 12:05.43 |
| hmmm, you mean stuffing raw unpositioned text into the fz_text and then shaping that to position it? not sure that belongs in there. | 12:06.47 |
| maybe if we add the extra bits of info that pdf text objects have, like the leading and charspace etc | 12:07.04 |
| so we can replicate the PDF text commands that go between BT and ET with the fz_text functions | 12:07.29 |
| text layout interfaces are complicated ... I would like to keep the fz_text simple, as a plain container for already laid out text | 12:08.28 |
| http://git.ghostscript.com/?p=user/tor/mupdf.git;a=blob;f=source/fitz/text.c;h=de4211cc8569eb61bcd30f9df7073e7cae43a5a5;hb=a1066e62b3337e3cb4c1108070f5f4b89d8fab3b#l99 | 12:09.17 |
| I'm pretty sure vertical text layout with that function is still "broken" -- we don't offset the origins using the metrics, etc | 12:10.44 |
| but if you could stuff harfbuzz into that function, that's all I wanted to start with | 12:11.11 |
| annotating the text spans with language and bidi levels; we could do that to let the text extraction device be smarter | 12:11.38 |
| or rather, let it be dumber by reading that info instead of trying to guess | 12:11.54 |
Robin_Watts | The problem is that if we have fz_text as a really dumb low-level "just put this text here" block of data, it means that text extraction etc or html-write or whatever has to work MUCH harder to extract the original information. | 12:13.46 |
tor8 | Robin_Watts: yeah, but it already needs to work that hard for PDF | 12:14.18 |
Robin_Watts | Having something that carries high level information which the low level info can be easily obtained from covers both ends. | 12:14.32 |
tor8 | still, I can see the point of having some high level information in there about bidi at least would be useful | 12:14.53 |
| seeing as we already carry along the unicode values | 12:15.06 |
Robin_Watts | yes, it needs to work hard for PDF, but it would be nice to remove some of the guesswork for cases that we can get away with. | 12:15.09 |
| yeah. | 12:15.10 |
| The bidi stuff is enough that we can work backwards from the shaped stuff losslessly, I think. | 12:15.42 |
tor8 | then I'm okay with adding bidi levels; and simply make the pdf/xps guess the bidi info | 12:16.40 |
| and then simplify the structured text bidi reversing stuff | 12:17.00 |
| in fact, I'd be perfectly okay with starting over the structured text extraction from scratch :) | 12:17.19 |
Robin_Watts | tor8: The text extraction falls loosely into 2 parts. | 12:44.26 |
| There is the gluing of text fragments back into spans, and then the derivation of things like columns etc from those spans. | 12:44.55 |
| I'm broadly happy with the approach we take for the first half of that problem (certainly it's better than things we've done before, cos it copes with text at an angle etc). | 12:45.31 |
| but it could probably be improved a bit. | 12:45.49 |
| The second half of the problem is a horrible nightmare though. I started it with good intentions and ended up just happy to get out alive. | 12:46.21 |
tor8 | Robin_Watts: yes, the first half is probably okay... it's the second half I'm having doubts about. | 12:49.09 |
Robin_Watts | tor8: The second half is a horrible problem. I am absolutely sure that it's possible to do a better job. | 12:49.43 |
| I'm also sure that it's a potential black hole for time. | 12:49.59 |
| It feels like a university level research project to me. | 12:50.28 |
| i.e. go away, and spend some time on it, and at the end of 3 years you might not have anything that works, but you should have enough stuff to write up a thesis on things you tried. | 12:51.07 |
tor8 | Robin_Watts: yes. not really something that belongs in a shipping product... | 12:54.06 |
Robin_Watts | The only saving graces of the stuff we have is that 1) it kinda works, and 2) it's optional. | 12:55.07 |
| You text extract, then you can call the analysis or not. | 12:55.22 |
ediee | Hi | 13:05.38 |
ghostbot | Welcome to #ghostscript, the channel for Ghostscript and MuPDF. If you have a question, please ask it, don't ask to ask it. Do be prepared to wait for a reply as devs will check the logs and reply when they come on line. | 13:05.38 |
ediee | I want to know that whether the images can be extracted from the pdf page?? | 13:06.07 |
tor8 | Robin_Watts: I wonder if (with the bidi flags added) we could skip the extraction step for stuff like search and copy&paste | 13:06.23 |
kens | If by 'images' you mean bitmaps, then yes | 13:06.26 |
ediee | ok | 13:06.48 |
tor8 | just get the fz_text objects and work from there. we'd still need to add the space insertion heuristics for pdf files that don't emit spaces. | 13:06.56 |
ediee | In mupdf's reflow mode why the images is not showing?? | 13:07.16 |
| In mupdf's reflow mode why the images is not showing?? | 13:08.39 |
kens | tor8 Robin_Watts that question is for you | 13:08.50 |
Robin_Watts | ediee: Dunno. | 13:09.04 |
| Presumably this is on Android ? | 13:09.32 |
ediee | yess... | 13:09.39 |
Robin_Watts | Is it all images, or just specific ones? | 13:09.48 |
ediee | In mupdf's reflow mode if the images exists in pdf page then it will not show | 13:09.57 |
| all the images | 13:10.05 |
| it shows only text | 13:10.09 |
Robin_Watts | I would have expected jpegs to work. Others should be converted to PNGs and get shown too. | 13:10.36 |
ediee | but how?? | 13:10.53 |
| bcoz we wont get any images array to do that | 13:11.11 |
kens | Are you writing an app eddiee ? | 13:11.40 |
ediee | yess... planning to do so | 13:12.06 |
| but reflow mode is in dilemma | 13:12.16 |
kens | http://www.bbc.co.uk/news/education-35631030OK are you clear on the licencing terms | 13:12.18 |
Robin_Watts | ediee: OK, so before we go any further, let's just check you understand the licensing terms. | 13:12.24 |
kens | D'oh | 13:12.25 |
Robin_Watts | MuPDF is released under 2 licences. You must use one of the licenses, or you can't distribute your app at all. | 13:12.53 |
ediee | ok... i didnt know that | 13:13.12 |
Robin_Watts | The first license is the GNU AGPL. | 13:13.19 |
ediee | wat are the 2 licenses... ?? | 13:13.24 |
Robin_Watts | This is a free license. It says (basically) that you can use the code for free, but in exchange, you must be prepared to give away the source for your ENTIRE app to any end user of your app that asks for it. | 13:14.09 |
| i.e. if fred bloggs gets your app, he gets the right to ask for the entire source code, which he can then pass on to anyone else he wants. | 13:15.03 |
| So, most people writing commercial applications think that that's a non-starter. | 13:15.24 |
| If you're writing a free app, then that may be fine though. | 13:15.48 |
ediee | im writing a free app only.. | 13:16.09 |
Robin_Watts | And you're happy to give away the full source code too ? | 13:16.24 |
ediee | and wat abt the second license? | 13:17.09 |
Robin_Watts | (Some people write free apps that talk to their own specific services, so they are unhappy to give away the source code.) | 13:17.15 |
| The second license is the Artifex Commercial license. | 13:17.29 |
| This costs money, but in exchange you are freed from all the strictures of the GNU AGPL. | 13:17.57 |
ediee | ok... wat are all the features I will get in v | 13:18.43 |
| Commercial license | 13:18.44 |
| ok... wat are all the features I will get in Commercial license?? | 13:18.58 |
Robin_Watts | ediee: Exactly the same code. | 13:19.03 |
| Exactly the same features. | 13:19.09 |
| Just you get to distribute it without having to abide by the terms of the GNU AGPL. | 13:19.32 |
ediee | ok | 13:19.42 |
| can I get solved with the reflow issue? | 13:19.50 |
| wat i described previously? | 13:19.57 |
Robin_Watts | ediee: Some commercial licenses come with support included. | 13:20.17 |
ediee | means? | 13:20.28 |
Robin_Watts | (or you can buy a separate support contract). | 13:20.30 |
| ediee: We're generally a friendly bunch, and will (time permitting) help out where we can. | 13:20.51 |
| Problems for commercial customers take priority of course. | 13:21.06 |
ediee | ok | 13:21.11 |
| so can u solve my problem? | 13:21.18 |
| for reflow mode? | 13:21.24 |
Robin_Watts | So, the way the reflow stuff works is that the page is run through the text extraction device. | 13:21.36 |
| This gives us a set of structures at the end (lines of text on the page etc). | 13:21.59 |
ediee | ok... but text extraction has no issues... the issue is with images | 13:22.11 |
Robin_Watts | We then have some code that converts those structures back into HTML. | 13:22.18 |
| And that's what the reflow code uses. | 13:22.25 |
ediee | if the page has images like mathematical formulaes, scientific notations, etc.... | 13:22.41 |
| they all wont be displayed in reflow mode | 13:22.52 |
Robin_Watts | If you set a flag on the text extraction device then it will keep images as part of that text extraction process too. | 13:22.54 |
| This did all work fine before. | 13:23.07 |
ediee | set a flag? | 13:23.16 |
| whr? | 13:23.20 |
Robin_Watts | It's possible it's been broken and we haven't noticed it. | 13:23.24 |
ediee | can u show some sample code? | 13:23.25 |
Robin_Watts | ediee: Are you using our example MuPDF app as a basis? | 13:23.51 |
kens | At ths point, sharing an example file that does not work might be helpful | 13:24.16 |
ediee | Robin : yesss | 13:24.45 |
| kens : ok.. then can u plz share some links | 13:24.56 |
| which i can refer | 13:25.01 |
kens | No, I'm suggesting you share a file with us | 13:25.12 |
Robin_Watts | ediee: OK, so in platform/android/jni/mupdf.c | 13:25.27 |
ediee | ok | 13:25.45 |
| robin : can u plz elaborate? | 13:26.01 |
Robin_Watts | ediee: I'm telling you to load that into an editor. | 13:26.46 |
| Then look for the JNI_FN(MuPDFCore_textAsHtml) function | 13:27.03 |
| In there, you should see a call: | 13:27.19 |
| dev = fz_new_stext_device(ctx, sheet, text); | 13:27.27 |
| After that, try adding: | 13:27.32 |
| fz_disable_device_hints(ctx, dev, FZ_IGNORE_IMAGE); | 13:27.56 |
| That should tell the text extraction to stop ignoring images. | 13:28.20 |
| Then try that out. | 13:28.36 |
ediee | ok | 13:29.07 |
| Robin : let me try and get back to u | 13:29.17 |
| textAsHtml is used for reflow mode?? | 13:29.40 |
Robin_Watts | I believe so. | 13:32.19 |
ediee | Robin : stop ignoring images means I assume that it should include image... right? | 13:33.09 |
Robin_Watts | Yes. | 13:33.18 |
ediee | Robin : what happens if the pdf page itself is an image.. for e.g., a scan copy... | 13:34.58 |
Robin_Watts | ediee: Then reflow ain't gonna help much :) | 13:38.28 |
ediee | ok... but it will display the page... i presume | 13:38.45 |
Robin_Watts | ediee: Should do. | 13:38.52 |
ediee | ok... :) | 13:39.27 |
| Robin : let me try this | 13:39.37 |
| Robin : it does not shows images | 13:42.07 |
| i have tried | 13:42.10 |
| i think there is no img tag in JNI_FN(MuPDFCore_textAsHtml) | 13:42.34 |
| there we write all the html | 13:42.42 |
Robin_Watts | fz_print_stext_page_html(ctx, out, text) knows how to write img tags. | 13:42.58 |
| OK, so presumably you are either on a windows or a linux box ? | 13:43.30 |
ediee | linux | 13:43.43 |
Robin_Watts | OK, so build "mutool" for linux. | 13:43.56 |
ediee | but I want so file... to include in my android app | 13:44.23 |
Robin_Watts | Should be as easy as doing "make build=debug" in the top level. | 13:44.30 |
| ediee: Yes, I know what you want, this is a test. | 13:44.54 |
ediee | how to build mutool | 13:45.30 |
| ? | 13:45.31 |
Robin_Watts | Should be as easy as doing "make build=debug" in the top level. | 13:45.47 |
| Once you've built that, run: mutool draw -o out.html in.pdf | 13:46.09 |
ediee | ok | 13:46.09 |
Robin_Watts | and then hopefully there should be images in the out.html file. | 13:46.37 |
ediee | Robin : ok let me check | 13:47.00 |
| im getting fatal error while doing "make build=debug" | 13:48.13 |
| error is : fatal error: X11/Xcursor/Xcursor.h: No such file or directory | 13:48.18 |
Robin_Watts | make build=debug HAVE_X11=no | 13:49.12 |
HenryStiles | kens: I meant to tell you sometime ago there isn't intended to be a "set" in pjl. The only way to set something is through the language. I wanted to keep that as is. Do you need that for some reason, it looked like you just added it for completeness. | 13:52.16 |
kens | I don't remember adding a SET, is this the C code ? | 13:52.53 |
| Because as I recall it only works with DEFAULT | 13:53.11 |
ediee | Robin : I cant able to use mutool draw | 13:55.21 |
Robin_Watts | ediee: Why not? | 13:55.45 |
ediee | i dont know | 13:55.58 |
| the draw option is not there | 13:56.05 |
Robin_Watts | ediee: What version of mupdf are you using? | 13:56.18 |
ediee | 1.8 | 13:56.24 |
Robin_Watts | Do you have build/debug/mudraw ? | 13:57.03 |
| (You were running build/debug/mutool draw, right?) | 13:57.28 |
HenryStiles | kens: you added pjl_set_envvar and pjl_set_defvar, no? | 13:57.57 |
kens | Err probably | 13:58.09 |
| And yes, I think I added set_defvar for completeness | 13:58.37 |
| Also possibly because there was a C warning, but I'm unsure of tht now. If its a problem then you can pull it back out | 13:59.00 |
HenryStiles | kens: yeah just verifying you didn't need it for something with PDF/A | 13:59.46 |
ediee | Robin : yess it got work now | 13:59.57 |
| im checking the output | 14:00.08 |
| Robin : no the output is not as like as pdf page | 14:01.49 |
Robin_Watts | ediee: That's not what I asked. | 14:02.07 |
| What I asked was "are there images in the output" ? | 14:02.13 |
ediee | yess.. there is images in the output | 14:02.33 |
Robin_Watts | ediee: Right. | 14:02.47 |
kens | HenryStiles : If I need it I'd be calling it, so removing them will stop it cvompiling :-) | 14:03.01 |
ediee | Robin : but the page is not is the format what the original pdf has | 14:03.44 |
| ? | 14:03.46 |
Robin_Watts | ediee: So, if you've done the alteration to mupdf.c as I described above, and rebuilt correctly, then there will be images in the page that is sent to the webview for reflow. | 14:04.13 |
| The layout not being correct is an entirely different question :) | 14:04.32 |
ediee | ok... now wat abt the layout?? | 14:04.59 |
| its getting different | 14:05.03 |
Robin_Watts | ediee: Well, I can't comment on that without seeing an example file. | 14:05.14 |
| And even then, this is likely to be something that will require me to invest some time into looking at it. | 14:05.38 |
ediee | Robin : ok | 14:07.20 |
| I will try | 14:07.26 |
| and let u knw | 14:07.30 |
| thanks you for ur support | 14:08.51 |
| will u be available tomorrow? | 14:08.59 |
Robin_Watts | ediee: I will be here tomorrow, yes. | 14:11.12 |
ediee | ok let me try today.. I will chat with u tomorrow abt today's progress | 14:11.45 |
inarus | Hi | 15:06.44 |
ghostbot | Welcome to #ghostscript, the channel for Ghostscript and MuPDF. If you have a question, please ask it, don't ask to ask it. Do be prepared to wait for a reply as devs will check the logs and reply when they come on line. | 15:06.44 |
inarus | I need concatenate pdf. I am working for a company. Can I use the publicly available soft or do I need the commercial one? | 15:10.27 |
kens | Which software are you referring to ? | 15:10.50 |
| In either case (MuPDF, Ghostscript) the software is provided under the terms of the GNU AGPL, provided you abide by the terms of the licence you can use it. Otherwise you need a commercial licence. | 15:11.40 |
| Please note that if you are referring to Ghostscript it does *NOT* concatenate PDF files. | 15:12.00 |
inarus | Ok. My bad, I read some Web pages explaining how to concatenate pdf files with ghostscript. I must have misunderstood | 15:14.12 |
kens | Many people think that Ghostscript concatenates PDF files, it does not. It interprets the input and can create a *new* PDF file which is visually the same as the input(s). However, the actual contents of the PDF files are not reflected in the output, so it is not concatenating the files. | 15:15.24 |
inarus | Do you mean that content and/or formating could be missed? | 15:19.37 |
kens | The *visual* appearance should be the same. Metadata may not be carried over and the internal representaton will not be the same | 15:20.06 |
Robin_Watts | inarus: Stuff like Outlines or Annotations etc | 15:20.23 |
kens | No Outlines and Annotatoins are preserved | 15:20.33 |
Robin_Watts | kens: Stuff *like* Outlines and Annotations :) | 15:21.00 |
kens | But The Creator won't be nor will some other elements, and the fotns may be differently described, the character codes could be differnt, images may be compressed differntly etc | 15:21.07 |
inarus | Ok I get it. That might be a major issue for me, thank you | 15:22.21 |
kens | NP | 15:22.27 |
rayjj | inarus: the logs may not have caught up, but for most purposes, gs can combine PDF's into a single PDF. Links that specify a page number may be a problem (kens can address that) | 15:27.12 |
| kens: does the pdfwrite adjust the page number destination in links for PDF's after the first input ? | 15:28.00 |
kens | Up to a point yes | 15:28.37 |
tor8 | Robin_Watts: a bunch of commits on tor/master for review. sebras' stuff is LGTM but a second pair of eyes wouldn't hurt. | 21:34.40 |
marcosw | HenryStiles: ping | 23:06.44 |
HenryStiles | marcosw: hi | 23:38.15 |
| Forward 1 day (to 2016/02/24)>>> | |