Log of #mupdf at irc.freenode.net.

Search:
 <<<Back 1 day (to 2017/12/17)20171218 
sebras JayVii: with mutool you can probably do it if you know the structure of PDF00:35.11 
  JayVii: better annotation support is currently in development, but not yet finished.00:35.42 
  JayVii: I assume you are talking about text annotations that popup if an icon is clicked?00:36.20 
  JayVii: if so the icon should be drawn currently, but it cannot be clicked to show the text yet.00:37.01 
  JayVii: if you run mutool show doc.pdf p it will list what PDF objects belong to each page. if you then do something like mutool show doc.pdf 42 where 42 is the PDF object belonging to the page you want you ought to see the PDF object contents for that page.00:39.56 
  JayVii: in that object there should be a /Annots if the page has annotations.00:40.12 
  JayVii: this is usually an array of object numbers, so if you then do mutool show doc.pdf 123 where 123 is each of these object numbers you will see how each such annotation is created.00:41.00 
  JayVii: a text annotation has typically a /Subtype /Text in there somewhere and likely the text is in /Contents.00:41.51 
  JayVii: this is the easiest case, there might be details that make it harder, like all the encoding of the Contents, or that the contents is in another indirect PDF object. it all depends on how the PDF was created and by what software.00:43.40 
  JayVii: once text annotation support is fully implemented of course the text should be visible somehow, but we're not there yet.00:44.22 
  JayVii: it is quite late so I'm heading out now, good night!00:44.52 
  kens: I'll take a look at 698819 and 698820.11:00.55 
kens Thanks sebras, feel free to change the assignments, I never know who to put them to for MuPDF11:01.14 
sebras kens: I think it is safe to put tor8.11:01.32 
kens <sigh> there's a 21 as well now11:01.34 
sebras kens: I'll take them when I can.11:01.51 
  kens: ok. that will follow later then.11:01.58 
kens Thanks, I wouldn't be surprised to find they are duplicates and already fixed by your previous patches.11:02.11 
sebras kens: right.11:02.23 
  I would expect so too.11:02.33 
kens I know they can never reasonably be completely up to date, but it would be nice if they *tried* to use current code.....11:03.33 
sebras kens: they have been using 1.12.0 which we released recently.11:08.21 
kens Yes, but its not the current code11:08.30 
sebras kens: the xref-related fixes did not go in before the tag though.11:08.33 
kens and hence doesn't (I assume) have your xref fixes11:08.43 
sebras exactly.11:08.49 
kens Like I said, I understand they can't be 100% up to date, but they *could* test the latest SHA1 before filing the bug report11:09.29 
sebras kens: so how do I mark these bugs? RESOLVED INVALID?11:09.43 
  kens: especially for security bugs. they are devs, so they should know how to find it/compile it/etc.11:10.16 
kens IMO If they are already fixed, tehn 'WORKSFORME' though you might mention that these have been recently fixed and if they tested the current code they would find that out before opening bug reports11:10.19 
sebras kens: ok, so they all triggered the same issue.11:21.18 
kens all fixed ?11:21.27 
sebras kens: yes, they were all fixed by the fix to 698804.11:21.43 
  kens: so it was just a matter of testing a bit to confirm this fact.11:21.55 
kens Yeah, rather thought they might be, they all looked very similar11:22.05 
sebras kens: however I was reminded to the bugzilla work to fix the bugs. :)11:22.46 
  kens: ehm.. not fix them, mark them as fixed!.11:22.58 
kens \right, wjhich is a good thing I guess11:23.07 
sebras kens: it keeps bugzilla tidy, and allows us to see what _real_ issues remain, so yes, I think it is good. :)11:25.25 
kens Yep, I'm all in favour :-D11:25.37 
sebras kens: I think the guy was just retriggered since we released a new version that did not fix the security issues.11:26.04 
kens Umm, possibly.....11:26.18 
  But again, that's always going to be possible, if you report a bug late enough in the release cycle11:26.39 
sebras kens: indeed.11:26.56 
JayVii sebras thanks for the clarification. I'll see if i get along with doing it via mutool for now22:50.01 
janzo is it possible to convert a pdf to a flat pdf (images)22:55.11 
  using mutool22:55.22 
  maybe something like mutool draw input.pdf -format png piped into a pdf22:57.09 
sebras janzo: so using a pdf as input, you want to render each page to a PNG file and then you want to create a new pdf using these PNG files?22:58.17 
  where each page is one of the PNG files..?22:58.32 
janzo yea23:12.07 
  currently i am using imagemagick convert to do this23:13.23 
  convert input.pdf output.pdf23:13.35 
  second question. i see you can output pdf to text, mutool.exe convert -o test.txt test.pdf23:15.58 
  but i dont see any options in the documentation23:16.07 
  i've been using pdftotext (poppler tools) and it has a few options like layout, etc.23:16.54 
  does this have any thing smiliar?23:17.32 
sebras janzo: text output ought to be documented on the man page. I see it. :)23:21.55 
janzo im looking here https://artifex.com/developers-mupdf-documentation/command-line-tools/#show23:24.36 
sebras janzo: ah, then you should be looking at eiterh https://artifex.com/developers-mupdf-documentation/command-line-tools/#draw or https://mupdf.com/docs/manual-mutool-draw.html23:25.12 
  janzo: or if you are on linux: "man mutool"23:25.40 
  janzo: to output each page as a PNG: mutool convert -o page%04d.png -F png doc.pdf23:28.47 
janzo ok thanks23:29.03 
sebras janzo: to create a pdf from a single PNG you can create a file page1.txt with these contents https://pastebin.com/raw/kf5uXkZ623:29.59 
  next you run mutool create -o new.pdf page1.txt23:30.09 
  you can list extra page2.txt page3.txt at the end if you want to add more pages.23:30.38 
janzo i see. no real easy command lines for this23:30.59 
sebras janzo: flattening PDFs into images causes the fonts and vector graphics to be converted into pixels.23:31.41 
  janzo: and that just looks bad if you zoom in, so it is not something I (or we) generally do no. :)23:32.03 
janzo ok here's the problem im trying to solve. we are trying to read a pdf's text, but not all pdfs have legible text, some are encrypted and look like bad characters23:36.33 
  the solution we are using now is making them into images and then doing ocr23:37.03 
sebras encrypted?23:44.13 
  janzo: PDF files can be encrypted, but if they are and you do not know the password then you would not be able to view it at all.23:44.39 
  janzo: as I understand you, you _do_ see text, just not the text you expect..?23:44.58 
  janzo: does it only happen in mupdf? does it look the same in acrobat or evince or some other PDF viewer?23:45.39 
  janzo: if it looks the same in mupdf as in other viewers then there is probably something wrong with the file.23:46.13 
  janzo: if it looks worse in mupdf I would appreciate if you could make the file available. e.g. by reporting at bug over at bugs.ghostsript.com23:46.47 
Robin_Watts sebras: I suspect he means that when he cut and pastes from the PDF he doesn't get the chars he expects.23:46.47 
sebras Robin_Watts: ah! yes, perhaps it is copy/paste that doesn't work.23:47.04 
janzo yes copy / paste23:47.25 
  the pdf is readable23:47.34 
Robin_Watts janzo: Ok, so the problem is that text in PDF is sent as a series of glyph ids, NOT a series of unicode values.23:47.57 
janzo the text is obfuscated or something, i dont know the right terminology23:47.58 
Robin_Watts Fonts are often subsetted, and glyphs sent out of order, so you get glyphs 1,2,3,4,5 rather than chars 65, 73, 85, 67 etc.23:48.42 
  You *can* have a mapping from glyph to unicode embedded in the PDF but frequently people don't bother.23:49.08 
  hence yes, you're right, you can get PDFs where they look visually correct, but the info is not there.23:49.29 
  And there is nothing we can do about it.23:49.40 
  OCR is a reasonable solution.23:49.46 
  Using the latest mupdf:23:50.17 
  mutool draw -o out.pclm in.pdf23:50.42 
  Or perhaps:23:50.51 
  mutool draw -o out.pdf -F pclm in.pdf23:51.05 
janzo whats pclm23:51.09 
Robin_Watts It's a modified form of PDF used for some printers.23:51.45 
  Basically it's image strips wrapped up in a PDF.23:51.55 
  so I think what you're asking for.23:52.07 
sebras Robin_Watts: ah, smart. why didn't I think of that?23:52.11 
  Robin_Watts: so the intermediate step is not PNG, but then that doesn't matter.23:52.30 
janzo ok i think the resolution needs to be higher23:52.42 
  it's looking blurry23:52.47 
Robin_Watts mutool draw -r 200 -o out.pdf -F pclm in.pdf23:52.53 
janzo i use 288 in gs23:53.02 
Robin_Watts 200 is fax quality.23:53.04 
janzo ocr doesnt like blurry text :)23:53.24 
Robin_Watts 600 dpi is high quality laser. I'd imagine anything over 300 would be fine.23:53.26 
janzo interesting side effect. one of the fields fonts got bigger23:55.09 
  or smaller23:56.03 
  but i like this solution23:56.33 
 Forward 1 day (to 2017/12/19)>>> 
ghostscript.com #ghostscript
Search: