MuPDF IRC logs

	<<<Back 1 day (to 2017/12/17)	20171218
sebras	JayVii: with mutool you can probably do it if you know the structure of PDF	00:35.11
	JayVii: better annotation support is currently in development, but not yet finished.	00:35.42
	JayVii: I assume you are talking about text annotations that popup if an icon is clicked?	00:36.20
	JayVii: if so the icon should be drawn currently, but it cannot be clicked to show the text yet.	00:37.01
	JayVii: if you run mutool show doc.pdf p it will list what PDF objects belong to each page. if you then do something like mutool show doc.pdf 42 where 42 is the PDF object belonging to the page you want you ought to see the PDF object contents for that page.	00:39.56
	JayVii: in that object there should be a /Annots if the page has annotations.	00:40.12
	JayVii: this is usually an array of object numbers, so if you then do mutool show doc.pdf 123 where 123 is each of these object numbers you will see how each such annotation is created.	00:41.00
	JayVii: a text annotation has typically a /Subtype /Text in there somewhere and likely the text is in /Contents.	00:41.51
	JayVii: this is the easiest case, there might be details that make it harder, like all the encoding of the Contents, or that the contents is in another indirect PDF object. it all depends on how the PDF was created and by what software.	00:43.40
	JayVii: once text annotation support is fully implemented of course the text should be visible somehow, but we're not there yet.	00:44.22
	JayVii: it is quite late so I'm heading out now, good night!	00:44.52
	kens: I'll take a look at 698819 and 698820.	11:00.55
kens	Thanks sebras, feel free to change the assignments, I never know who to put them to for MuPDF	11:01.14
sebras	kens: I think it is safe to put tor8.	11:01.32
kens	<sigh> there's a 21 as well now	11:01.34
sebras	kens: I'll take them when I can.	11:01.51
	kens: ok. that will follow later then.	11:01.58
kens	Thanks, I wouldn't be surprised to find they are duplicates and already fixed by your previous patches.	11:02.11
sebras	kens: right.	11:02.23
	I would expect so too.	11:02.33
kens	I know they can never reasonably be completely up to date, but it would be nice if they tried to use current code.....	11:03.33
sebras	kens: they have been using 1.12.0 which we released recently.	11:08.21
kens	Yes, but its not the current code	11:08.30
sebras	kens: the xref-related fixes did not go in before the tag though.	11:08.33
kens	and hence doesn't (I assume) have your xref fixes	11:08.43
sebras	exactly.	11:08.49
kens	Like I said, I understand they can't be 100% up to date, but they could test the latest SHA1 before filing the bug report	11:09.29
sebras	kens: so how do I mark these bugs? RESOLVED INVALID?	11:09.43
	kens: especially for security bugs. they are devs, so they should know how to find it/compile it/etc.	11:10.16
kens	IMO If they are already fixed, tehn 'WORKSFORME' though you might mention that these have been recently fixed and if they tested the current code they would find that out before opening bug reports	11:10.19
sebras	kens: ok, so they all triggered the same issue.	11:21.18
kens	all fixed ?	11:21.27
sebras	kens: yes, they were all fixed by the fix to 698804.	11:21.43
	kens: so it was just a matter of testing a bit to confirm this fact.	11:21.55
kens	Yeah, rather thought they might be, they all looked very similar	11:22.05
sebras	kens: however I was reminded to the bugzilla work to fix the bugs. :)	11:22.46
	kens: ehm.. not fix them, mark them as fixed!.	11:22.58
kens	\right, wjhich is a good thing I guess	11:23.07
sebras	kens: it keeps bugzilla tidy, and allows us to see what _real_ issues remain, so yes, I think it is good. :)	11:25.25
kens	Yep, I'm all in favour :-D	11:25.37
sebras	kens: I think the guy was just retriggered since we released a new version that did not fix the security issues.	11:26.04
kens	Umm, possibly.....	11:26.18
	But again, that's always going to be possible, if you report a bug late enough in the release cycle	11:26.39
sebras	kens: indeed.	11:26.56
JayVii	sebras thanks for the clarification. I'll see if i get along with doing it via mutool for now	22:50.01
janzo	is it possible to convert a pdf to a flat pdf (images)	22:55.11
	using mutool	22:55.22
	maybe something like mutool draw input.pdf -format png piped into a pdf	22:57.09
sebras	janzo: so using a pdf as input, you want to render each page to a PNG file and then you want to create a new pdf using these PNG files?	22:58.17
	where each page is one of the PNG files..?	22:58.32
janzo	yea	23:12.07
	currently i am using imagemagick convert to do this	23:13.23
	convert input.pdf output.pdf	23:13.35
	second question. i see you can output pdf to text, mutool.exe convert -o test.txt test.pdf	23:15.58
	but i dont see any options in the documentation	23:16.07
	i've been using pdftotext (poppler tools) and it has a few options like layout, etc.	23:16.54
	does this have any thing smiliar?	23:17.32
sebras	janzo: text output ought to be documented on the man page. I see it. :)	23:21.55
janzo	im looking here https://artifex.com/developers-mupdf-documentation/command-line-tools/#show	23:24.36
sebras	janzo: ah, then you should be looking at eiterh https://artifex.com/developers-mupdf-documentation/command-line-tools/#draw or https://mupdf.com/docs/manual-mutool-draw.html	23:25.12
	janzo: or if you are on linux: "man mutool"	23:25.40
	janzo: to output each page as a PNG: mutool convert -o page%04d.png -F png doc.pdf	23:28.47
janzo	ok thanks	23:29.03
sebras	janzo: to create a pdf from a single PNG you can create a file page1.txt with these contents https://pastebin.com/raw/kf5uXkZ6	23:29.59
	next you run mutool create -o new.pdf page1.txt	23:30.09
	you can list extra page2.txt page3.txt at the end if you want to add more pages.	23:30.38
janzo	i see. no real easy command lines for this	23:30.59
sebras	janzo: flattening PDFs into images causes the fonts and vector graphics to be converted into pixels.	23:31.41
	janzo: and that just looks bad if you zoom in, so it is not something I (or we) generally do no. :)	23:32.03
janzo	ok here's the problem im trying to solve. we are trying to read a pdf's text, but not all pdfs have legible text, some are encrypted and look like bad characters	23:36.33
	the solution we are using now is making them into images and then doing ocr	23:37.03
sebras	encrypted?	23:44.13
	janzo: PDF files can be encrypted, but if they are and you do not know the password then you would not be able to view it at all.	23:44.39
	janzo: as I understand you, you _do_ see text, just not the text you expect..?	23:44.58
	janzo: does it only happen in mupdf? does it look the same in acrobat or evince or some other PDF viewer?	23:45.39
	janzo: if it looks the same in mupdf as in other viewers then there is probably something wrong with the file.	23:46.13
	janzo: if it looks worse in mupdf I would appreciate if you could make the file available. e.g. by reporting at bug over at bugs.ghostsript.com	23:46.47
Robin_Watts	sebras: I suspect he means that when he cut and pastes from the PDF he doesn't get the chars he expects.	23:46.47
sebras	Robin_Watts: ah! yes, perhaps it is copy/paste that doesn't work.	23:47.04
janzo	yes copy / paste	23:47.25
	the pdf is readable	23:47.34
Robin_Watts	janzo: Ok, so the problem is that text in PDF is sent as a series of glyph ids, NOT a series of unicode values.	23:47.57
janzo	the text is obfuscated or something, i dont know the right terminology	23:47.58
Robin_Watts	Fonts are often subsetted, and glyphs sent out of order, so you get glyphs 1,2,3,4,5 rather than chars 65, 73, 85, 67 etc.	23:48.42
	You can have a mapping from glyph to unicode embedded in the PDF but frequently people don't bother.	23:49.08
	hence yes, you're right, you can get PDFs where they look visually correct, but the info is not there.	23:49.29
	And there is nothing we can do about it.	23:49.40
	OCR is a reasonable solution.	23:49.46
	Using the latest mupdf:	23:50.17
	mutool draw -o out.pclm in.pdf	23:50.42
	Or perhaps:	23:50.51
	mutool draw -o out.pdf -F pclm in.pdf	23:51.05
janzo	whats pclm	23:51.09
Robin_Watts	It's a modified form of PDF used for some printers.	23:51.45
	Basically it's image strips wrapped up in a PDF.	23:51.55
	so I think what you're asking for.	23:52.07
sebras	Robin_Watts: ah, smart. why didn't I think of that?	23:52.11
	Robin_Watts: so the intermediate step is not PNG, but then that doesn't matter.	23:52.30
janzo	ok i think the resolution needs to be higher	23:52.42
	it's looking blurry	23:52.47
Robin_Watts	mutool draw -r 200 -o out.pdf -F pclm in.pdf	23:52.53
janzo	i use 288 in gs	23:53.02
Robin_Watts	200 is fax quality.	23:53.04
janzo	ocr doesnt like blurry text :)	23:53.24
Robin_Watts	600 dpi is high quality laser. I'd imagine anything over 300 would be fine.	23:53.26
janzo	interesting side effect. one of the fields fonts got bigger	23:55.09
	or smaller	23:56.03
	but i like this solution	23:56.33
	Forward 1 day (to 2017/12/19)>>>

Log of #mupdf at irc.freenode.net.