| <<<Back 1 day (to 2017/02/05) | 20170206 |
tor8 | sebras: for the logs. in PDF the 0 byte is whitespace. see table 3.1 white-space characters. | 11:17.39 |
| I suspect Robin_Watts just picked the first best iswhite() function for use in mudraw :) | 11:18.13 |
| the use of octal characters bothers me though... I'd just use decimal if I were to have written it today. | 11:19.26 |
Robin_Watts | tor8: i think I have a couple of commits pending. | 11:30.03 |
tor8 | Robin_Watts: fix win32 and -i to invert LGTM, though maybe use capital -I for invert? | 11:31.09 |
| to match the -I invert flag of mudra | 11:31.39 |
| w | 11:31.40 |
Robin_Watts | Good point. Will fix. | 11:31.47 |
| fixed commit, plus 1 more tiny one. | 11:36.04 |
tor8 | Robin_Watts: might want to fix '-i' in the commit message too? | 11:37.31 |
Robin_Watts | D'Oh. | 11:37.41 |
| Done. | 11:38.25 |
tor8 | Robin_Watts: those 3 commits LGTM. | 11:38.57 |
Robin_Watts | Ta | 11:39.12 |
tor8 | I tried an approach to improve staying on the same "page" when doing text re-layout | 11:40.01 |
| I'm not entirely happy with it though | 11:40.07 |
| I take a temporary 'bookmark' to a location in the text (the first bit of text on a page), layout the document with the new font size, and find the page that contains the same marked location again | 11:41.06 |
| but it's obviously not reversible. changing font size up and then back down to the same, you end up at a different page | 11:41.32 |
Robin_Watts | How about... take a temporary bookmark to a location in the text. | 11:42.05 |
tor8 | maybe I'm just trying too hard, and there's a simpler approach by just using current_page / page_count? | 11:42.07 |
Robin_Watts | Keep that temporary bookmark until you change page. | 11:42.17 |
| That way if you zoom up and down, the bookmark is still in the same place. | 11:42.39 |
tor8 | taking the bookmarks is expensive, but keeping it around unless a page change occurs might help that use case yeah | 11:43.11 |
| my other gripe is how they are temporary and that's going to lead users into confusion trying to save the bookmarks in a preference file or something | 11:43.38 |
| at the moment I just use a raw pointer to the fz_html_flow node | 11:44.09 |
Robin_Watts | I'm not sure what else you could use, unless you start using counts of html flow entries or something - and that'll be screwed when we change the structure at all. | 11:45.05 |
tor8 | yeah. still not thrilled about the API implications though. | 11:45.35 |
acharles | Hi. | 19:57.02 |
ray_laptop | acharles: hi back | 21:07.06 |
| (1 hr later) | 21:07.20 |
acharles | haha | 21:07.36 |
| I had a few questions about ghostscript and the way it handles pdf files. | 21:07.57 |
ray_laptop | acharles: go ahead | 21:08.10 |
| I can most likely answer those | 21:09.02 |
| the general answer is "very well" :-) | 21:09.38 |
acharles | Is the postscirpt interpreter used to process pdf files as well or does pdf have a different code path? | 21:10.09 |
ray_laptop | the PS interpreter processes the PDF input, invoking PS operators to actually do stuff (images, text, other graphics) | 21:13.24 |
| the PostScript that does that is in Resource/Init/pdf*.ps | 21:13.47 |
| Note that the "scanner" that processes PDF input is in C -- it's not like PostScript is trying to read the PDF directly (except at the very start to find out if the input is PDF or not) | 21:15.30 |
acharles | Ah, so you take advantage of the fact that pdf is a subset of postscript to use postscript functions to process the pdf. | 21:15.51 |
ray_laptop | acharles: well, PDF is actually a disjoint set (not a subset), but yes, the syntax is similar enough that the scanner has only a few special "tweaks" for PDF | 21:17.23 |
| but things like << ... ... >> defining a dictionary and strings being enclosed in (...) | 21:18.17 |
acharles | Ah, I didnât know it was disjoint. | 21:18.41 |
ray_laptop | well, at the operator level, PDF has transparency and the concept of "streams" but doesn't have some of the noisome PS operators like file manipulation | 21:20.01 |
acharles | Ah, I guess that makes sense. | 21:21.02 |
ray_laptop | but our scanner has "PDF_SCAN_RULES" for a couple of exceptions | 21:23.47 |
kens | lurking | 21:23.49 |
ray_laptop | hi kens. ISTR there is something about names in PDF as well, right ? | 21:24.24 |
kens | spaces in names | 21:24.37 |
| and other non-priontable characters | 21:24.44 |
| or even printable | 21:24.52 |
| THe original point was that hte graphics model of PDF matched that of PostScript, so a one-to-one mapping was trivial, it tgherefore made sense to wite a PDF processor in PostScript. | 21:25.36 |
| Since then, well things have changed.... | 21:25.45 |
| And many PDF files break the specification, but Acrobat opens them so we have to too. Which makes our handling much, much more complicated than it should be. | 21:26.42 |
ray_laptop | ah, it is that ANY character except NUL can be in a name using hex | 21:26.43 |
kens | I think a NULL can be in a name too, you just escape it with # | 21:27.03 |
ray_laptop | kens: PDF 1.7 spec section 3.2 (p 57) excludes NUL | 21:27.52 |
kens | acharles is there a reason for wanting to know all this ? Its probabl;y not useful.... | 21:27.59 |
| ray_laptop well, this is all form memory for me, I don't have the spec open in front of me | 21:28.12 |
acharles | Yes, there is. :) | 21:28.16 |
ray_laptop | kens: I cheated and opened the spec :-) | 21:28.30 |
kens | It might be easier to explain what your goal is | 21:28.35 |
acharles | My goal is âsecure pdf processingâ, but thatâs vague. Iâm just doing some investigative work and I figured asking here made more sense than reading the code for days on end. :) | 21:30.20 |
kens | Well, PDF is pretty secure if you avoid JavaScript | 21:30.39 |
acharles | Yes, but postscript is not | 21:30.57 |
kens | Though its also a good idea to prevent PostScript XObjects | 21:31.01 |
ray_laptop | kens: I think GS disables PS XObjects by default | 21:31.38 |
kens | acharles, but you cannot execute random PostScript in a PDF file using Ghostscript | 21:31.45 |
| Obviously if you send PostScript that's a different matter | 21:32.03 |
ray_laptop | acharles: and GS has "SAFER" mode that is supposed to make it more secure | 21:32.04 |
| (for PS or PDF input) | 21:32.30 |
acharles | Yeah, Iâm assuming -dSAFER is enabled. | 21:32.44 |
ray_laptop | and since GS doesn't use JS, there isn't a problem there | 21:32.57 |
acharles | How does GS detect pdf vs ps input? | 21:33.08 |
kens | THough (as the recent news showed) if you are running a job server its a good idea to set the job server password to something other than 0 :-) | 21:33.26 |
ray_laptop | acharles: using PS code in Resource/Init/pdf_main.ps | 21:33.30 |
kens | acharles depensd how you invoke it | 21:33.38 |
| ray_laptop you can use pdfrun directly | 21:33.46 |
ray_laptop | kens: true, then we don't even try to "detect: | 21:34.04 |
acharles | how does pdfrun work? | 21:34.21 |
ray_laptop | acharles: basically if you don't use pdfrun and just "run" an input file, it looks at the first 1024 bytes for the PDF header | 21:34.53 |
kens | Its an internal Ghostscript thing, you give it a filename and it runs it as a PDF file | 21:34.54 |
| Hmm, actually that may not be entierly correct. | 21:35.32 |
| Probably best not to rely on memory at this time of night | 21:35.43 |
ray_laptop | kens: actually, I think you have to send it a PS file | 21:35.50 |
| filetype, not a string that contains the filename | 21:36.07 |
kens | ah runpdfbegin maybe | 21:36.31 |
| Oops no there it is, runpdf, which calls runpdfbegin :-) | 21:36.57 |
ray_laptop | acharles: so you need to make a PDF file type, which can be done with: (filename.pdf) (r) file runpdf | 21:37.14 |
| kens: right -- they both expect a filetype | 21:37.37 |
kens | Indeed | 21:37.42 |
| It would be trivial to define a function to take a filename, but why bother.... | 21:38.07 |
ray_laptop | kens: agreed | 21:38.19 |
acharles | Ah, thatâs not exposed as a command line option? | 21:38.23 |
kens | You can use -c and -f | 21:38.33 |
| to send PostScript directly | 21:38.40 |
| so -c "(filename.pdf) (r) runpdf" -f | 21:39.01 |
ray_laptop | kens: the -f doesn't really do anything other than get out of -c mode, so is rather useless if -dBATCH is given | 21:40.15 |
kens | Note that the pdf*.ps files constitute a rather large PostScript program, one of the things it will do is attempt to validate the PDF file. So if you send it a PostScript file it **won't** run it, it will just complain its not a valid PDF file | 21:40.25 |
acharles | What does the (r) parameter mean? I mean, it pushes r on the stack. | 21:40.35 |
kens | makes it readable, like +r in C | 21:40.45 |
ray_laptop | acharles: you can also do: echo (filename.pdf (r) file runpdf | gs ... - | 21:40.47 |
kens | If you wnted a writable file you would use (w) | 21:41.11 |
acharles | Ah | 21:41.36 |
ray_laptop | acharles: for that refer to the PLRM | 21:41.38 |
acharles | Ah, file is the PS operator for opening a file | 21:42.16 |
| that makes sense. | 21:42.26 |
kens | yes exactly. It will leave a file object on the stack, which is then consumed by the pdfrun executable function | 21:42.41 |
ray_laptop | darn, I forgot the ) after the filename.pdf and kens forgot the "file" operator, but acharles, I assume you get the idea | 21:42.49 |
kens | Hey, its late here :) | 21:43.01 |
acharles | I do | 21:43.45 |
kens | thinks I'm doing well to be making any sense at all.... | 21:44.16 |
acharles | I only first read the PLRM on Friday and Iâm not used to stack based languages. But I think Iâm learning fast. :P | 21:44.53 |
kens | If you only want to process PDF files, why not use MuPDF ? | 21:45.29 |
ray_laptop | acharles: so if you want to use "runpdf", if the input file is NOT PDF, it will confuse the pdf_main.ps code that is trying to open it as a PDF and won't expose you to accidentally executing PS | 21:45.30 |
kens | I wouldn't say it will confuse it exactly, it will reject it as an invalid and unfixable PDF file | 21:46.16 |
ray_laptop | e.g., gs -c "(examples/colorcir.ps) (r) file runpdf quit" | 21:48.27 |
| gives: Error: /syntaxerror in pdfopen | 21:48.43 |
| acharles: well, you don't have to read ALL 912 pages -- just the first 700 or so ;-) | 21:49.43 |
acharles | Does MuPDF offer pdf compression? | 21:49.44 |
kens | I still think that if you don't need PostScript (or PCL) input, MuPDF is probably more appropriate. | 21:49.45 |
ray_laptop | acharles: yes | 21:49.50 |
kens | acharles ah, you want to *modify* the PDF files ? | 21:49.59 |
ray_laptop | acharles: but pdf output from mupdf is rather limited | 21:50.39 |
kens | was assuming rendering the PDF files was the goal | 21:50.39 |
acharles | read an input file and create an output file that contains the same pdf, but linearized and compressed (perhaps with lower quality) | 21:51.03 |
kens | Currently Ghostscript has more options for doing that. | 21:51.21 |
ray_laptop | acharles: I don't know which (if any) mupdf can do, | 21:51.41 |
acharles | And the runpdf command gives me an error about invalid file access, which I assume is due to using -dSAFER | 21:51.54 |
kens | MuPDF can compress and linearize the file (though linearization is pointless) but I don't think it can currently d things like downsample images or subset fonts | 21:52.02 |
| acharles yes, it will be. | 21:52.15 |
| You only really need to worry about -dSAFER if you are using PostScript, PDF has no file operators | 21:52.56 |
| Umm actually that's not totally true. | 21:53.09 |
| It can link to other files. | 21:53.17 |
| other PDF files I should say | 21:53.24 |
| Anyway, I have to be off. GOt to go and feed the cat | 21:54.16 |
| Goodnight all | 21:54.21 |
acharles | Night | 21:54.30 |
| Thanks | 21:54.33 |
ray_laptop | acharles: I am in PST, so I'll be around for a while yet | 21:55.08 |
acharles | Iâm also PST | 21:55.59 |
ray_laptop | acharles: -dSAFER will limit the files you can read and write to. | 21:56.01 |
acharles | Can I use -dSAFER and read from the pdf input file? | 21:56.48 |
ray_laptop | if you use -DELAYSAFER and open the input file, such as with (filename.pdf) (r) file then you can use .setsafe to go into SAFER mode before running the file (with "run" or "runpdf") | 21:57.31 |
| the filenames named on the command line as arguments are automatically allowed in SAFER mode | 21:58.14 |
acharles | And we use -dColorImageResolution and -dSubsetFonts. | 21:58.14 |
| So, I guess MuPDF isnât an option | 21:58.25 |
ray_laptop | acharles: yes, those are on GS options (pdfwrite options) | 21:58.35 |
| acharles:: the use of .setsafe is in doc/Language.htm that also discusses PermitFileReading PermitFileWriting, etc. | 21:59.38 |
acharles | Should SAFER prevent using the status command on files when processing PostScript files? (unrelated to pdf processing) | 22:00.40 |
| so, Iâm running `gs -dSAFER -dDELAYSAFER -c â(file.pdf) (r) file .setsafe runpdfâ -f` | 22:04.28 |
ray_laptop | acharles: that is a question | 22:04.29 |
acharles | It seems to work. | 22:04.32 |
| And it gives me an error if I give it a PS file. | 22:04.52 |
ray_laptop | hmm... I need to look into SAFER mode. It isn't doing what I expect (at least on Windows). I wonder if it is bitrotted | 22:07.18 |
| this is *NOT* good | 22:10.39 |
can-of-bees | having a hard time googling this -- is there a way for ghostscript to return the version of pdf? e.g. can i feed gs a pdf and have it tell me if the pdf is pdf/a? | 22:12.30 |
| thanks in advance | 22:12.40 |
ray_laptop | can-of-bees: not currently. It is contained in the XML Metadata, but our toolbin/pdf_info.ps doesn't currently dump any of the Metadata | 23:08.07 |
| it is possible to write PS (or extend pdf_info.ps) to allow you to dump all or part of the Metadata | 23:08.44 |
| The Metadata object is in the Catalog object (the document Root object from the trailer) | 23:10.43 |
acharles | ray_laptop: Did you determine if SAFER is working as intended? | 23:48.01 |
ray_laptop | acharles: haven't had a chance to look into it yet | 23:49.30 |
| sorry | 23:49.34 |
| Forward 1 day (to 2017/02/07)>>> | |