| <<<Back 1 day (to 2017/12/17) | 20171218 |
sebras | JayVii: with mutool you can probably do it if you know the structure of PDF | 00:35.11 |
| JayVii: better annotation support is currently in development, but not yet finished. | 00:35.42 |
| JayVii: I assume you are talking about text annotations that popup if an icon is clicked? | 00:36.20 |
| JayVii: if so the icon should be drawn currently, but it cannot be clicked to show the text yet. | 00:37.01 |
| JayVii: if you run mutool show doc.pdf p it will list what PDF objects belong to each page. if you then do something like mutool show doc.pdf 42 where 42 is the PDF object belonging to the page you want you ought to see the PDF object contents for that page. | 00:39.56 |
| JayVii: in that object there should be a /Annots if the page has annotations. | 00:40.12 |
| JayVii: this is usually an array of object numbers, so if you then do mutool show doc.pdf 123 where 123 is each of these object numbers you will see how each such annotation is created. | 00:41.00 |
| JayVii: a text annotation has typically a /Subtype /Text in there somewhere and likely the text is in /Contents. | 00:41.51 |
| JayVii: this is the easiest case, there might be details that make it harder, like all the encoding of the Contents, or that the contents is in another indirect PDF object. it all depends on how the PDF was created and by what software. | 00:43.40 |
| JayVii: once text annotation support is fully implemented of course the text should be visible somehow, but we're not there yet. | 00:44.22 |
| JayVii: it is quite late so I'm heading out now, good night! | 00:44.52 |
| kens: I'll take a look at 698819 and 698820. | 11:00.55 |
kens | Thanks sebras, feel free to change the assignments, I never know who to put them to for MuPDF | 11:01.14 |
sebras | kens: I think it is safe to put tor8. | 11:01.32 |
kens | <sigh> there's a 21 as well now | 11:01.34 |
sebras | kens: I'll take them when I can. | 11:01.51 |
| kens: ok. that will follow later then. | 11:01.58 |
kens | Thanks, I wouldn't be surprised to find they are duplicates and already fixed by your previous patches. | 11:02.11 |
sebras | kens: right. | 11:02.23 |
| I would expect so too. | 11:02.33 |
kens | I know they can never reasonably be completely up to date, but it would be nice if they *tried* to use current code..... | 11:03.33 |
sebras | kens: they have been using 1.12.0 which we released recently. | 11:08.21 |
kens | Yes, but its not the current code | 11:08.30 |
sebras | kens: the xref-related fixes did not go in before the tag though. | 11:08.33 |
kens | and hence doesn't (I assume) have your xref fixes | 11:08.43 |
sebras | exactly. | 11:08.49 |
kens | Like I said, I understand they can't be 100% up to date, but they *could* test the latest SHA1 before filing the bug report | 11:09.29 |
sebras | kens: so how do I mark these bugs? RESOLVED INVALID? | 11:09.43 |
| kens: especially for security bugs. they are devs, so they should know how to find it/compile it/etc. | 11:10.16 |
kens | IMO If they are already fixed, tehn 'WORKSFORME' though you might mention that these have been recently fixed and if they tested the current code they would find that out before opening bug reports | 11:10.19 |
sebras | kens: ok, so they all triggered the same issue. | 11:21.18 |
kens | all fixed ? | 11:21.27 |
sebras | kens: yes, they were all fixed by the fix to 698804. | 11:21.43 |
| kens: so it was just a matter of testing a bit to confirm this fact. | 11:21.55 |
kens | Yeah, rather thought they might be, they all looked very similar | 11:22.05 |
sebras | kens: however I was reminded to the bugzilla work to fix the bugs. :) | 11:22.46 |
| kens: ehm.. not fix them, mark them as fixed!. | 11:22.58 |
kens | \right, wjhich is a good thing I guess | 11:23.07 |
sebras | kens: it keeps bugzilla tidy, and allows us to see what _real_ issues remain, so yes, I think it is good. :) | 11:25.25 |
kens | Yep, I'm all in favour :-D | 11:25.37 |
sebras | kens: I think the guy was just retriggered since we released a new version that did not fix the security issues. | 11:26.04 |
kens | Umm, possibly..... | 11:26.18 |
| But again, that's always going to be possible, if you report a bug late enough in the release cycle | 11:26.39 |
sebras | kens: indeed. | 11:26.56 |
JayVii | sebras thanks for the clarification. I'll see if i get along with doing it via mutool for now | 22:50.01 |
janzo | is it possible to convert a pdf to a flat pdf (images) | 22:55.11 |
| using mutool | 22:55.22 |
| maybe something like mutool draw input.pdf -format png piped into a pdf | 22:57.09 |
sebras | janzo: so using a pdf as input, you want to render each page to a PNG file and then you want to create a new pdf using these PNG files? | 22:58.17 |
| where each page is one of the PNG files..? | 22:58.32 |
janzo | yea | 23:12.07 |
| currently i am using imagemagick convert to do this | 23:13.23 |
| convert input.pdf output.pdf | 23:13.35 |
| second question. i see you can output pdf to text, mutool.exe convert -o test.txt test.pdf | 23:15.58 |
| but i dont see any options in the documentation | 23:16.07 |
| i've been using pdftotext (poppler tools) and it has a few options like layout, etc. | 23:16.54 |
| does this have any thing smiliar? | 23:17.32 |
sebras | janzo: text output ought to be documented on the man page. I see it. :) | 23:21.55 |
janzo | im looking here https://artifex.com/developers-mupdf-documentation/command-line-tools/#show | 23:24.36 |
sebras | janzo: ah, then you should be looking at eiterh https://artifex.com/developers-mupdf-documentation/command-line-tools/#draw or https://mupdf.com/docs/manual-mutool-draw.html | 23:25.12 |
| janzo: or if you are on linux: "man mutool" | 23:25.40 |
| janzo: to output each page as a PNG: mutool convert -o page%04d.png -F png doc.pdf | 23:28.47 |
janzo | ok thanks | 23:29.03 |
sebras | janzo: to create a pdf from a single PNG you can create a file page1.txt with these contents https://pastebin.com/raw/kf5uXkZ6 | 23:29.59 |
| next you run mutool create -o new.pdf page1.txt | 23:30.09 |
| you can list extra page2.txt page3.txt at the end if you want to add more pages. | 23:30.38 |
janzo | i see. no real easy command lines for this | 23:30.59 |
sebras | janzo: flattening PDFs into images causes the fonts and vector graphics to be converted into pixels. | 23:31.41 |
| janzo: and that just looks bad if you zoom in, so it is not something I (or we) generally do no. :) | 23:32.03 |
janzo | ok here's the problem im trying to solve. we are trying to read a pdf's text, but not all pdfs have legible text, some are encrypted and look like bad characters | 23:36.33 |
| the solution we are using now is making them into images and then doing ocr | 23:37.03 |
sebras | encrypted? | 23:44.13 |
| janzo: PDF files can be encrypted, but if they are and you do not know the password then you would not be able to view it at all. | 23:44.39 |
| janzo: as I understand you, you _do_ see text, just not the text you expect..? | 23:44.58 |
| janzo: does it only happen in mupdf? does it look the same in acrobat or evince or some other PDF viewer? | 23:45.39 |
| janzo: if it looks the same in mupdf as in other viewers then there is probably something wrong with the file. | 23:46.13 |
| janzo: if it looks worse in mupdf I would appreciate if you could make the file available. e.g. by reporting at bug over at bugs.ghostsript.com | 23:46.47 |
Robin_Watts | sebras: I suspect he means that when he cut and pastes from the PDF he doesn't get the chars he expects. | 23:46.47 |
sebras | Robin_Watts: ah! yes, perhaps it is copy/paste that doesn't work. | 23:47.04 |
janzo | yes copy / paste | 23:47.25 |
| the pdf is readable | 23:47.34 |
Robin_Watts | janzo: Ok, so the problem is that text in PDF is sent as a series of glyph ids, NOT a series of unicode values. | 23:47.57 |
janzo | the text is obfuscated or something, i dont know the right terminology | 23:47.58 |
Robin_Watts | Fonts are often subsetted, and glyphs sent out of order, so you get glyphs 1,2,3,4,5 rather than chars 65, 73, 85, 67 etc. | 23:48.42 |
| You *can* have a mapping from glyph to unicode embedded in the PDF but frequently people don't bother. | 23:49.08 |
| hence yes, you're right, you can get PDFs where they look visually correct, but the info is not there. | 23:49.29 |
| And there is nothing we can do about it. | 23:49.40 |
| OCR is a reasonable solution. | 23:49.46 |
| Using the latest mupdf: | 23:50.17 |
| mutool draw -o out.pclm in.pdf | 23:50.42 |
| Or perhaps: | 23:50.51 |
| mutool draw -o out.pdf -F pclm in.pdf | 23:51.05 |
janzo | whats pclm | 23:51.09 |
Robin_Watts | It's a modified form of PDF used for some printers. | 23:51.45 |
| Basically it's image strips wrapped up in a PDF. | 23:51.55 |
| so I think what you're asking for. | 23:52.07 |
sebras | Robin_Watts: ah, smart. why didn't I think of that? | 23:52.11 |
| Robin_Watts: so the intermediate step is not PNG, but then that doesn't matter. | 23:52.30 |
janzo | ok i think the resolution needs to be higher | 23:52.42 |
| it's looking blurry | 23:52.47 |
Robin_Watts | mutool draw -r 200 -o out.pdf -F pclm in.pdf | 23:52.53 |
janzo | i use 288 in gs | 23:53.02 |
Robin_Watts | 200 is fax quality. | 23:53.04 |
janzo | ocr doesnt like blurry text :) | 23:53.24 |
Robin_Watts | 600 dpi is high quality laser. I'd imagine anything over 300 would be fine. | 23:53.26 |
janzo | interesting side effect. one of the fields fonts got bigger | 23:55.09 |
| or smaller | 23:56.03 |
| but i like this solution | 23:56.33 |
| Forward 1 day (to 2017/12/19)>>> | |