| <<<Back 1 day (to 2019/11/26) | Fwd 1 day (to 2019/11/28) >>> | 20191127 |
cryptopsy | any pdfcropping tool which can do bbox ? | 13:14.00 |
| i want to trim all margin whitespace without calculating myself | 13:14.10 |
chrisl | There's probably isn't a trivial tool for that, because it will need to interpret and (at least) scan convert the contents. | 13:16.42 |
cryptopsy | there used to be | 13:16.56 |
chrisl | Oh, well use that then | 13:17.05 |
cryptopsy | the scan is the easy part | 13:17.07 |
kens | Ghostscript's bbox tool can tell you the bbox of the page contents | 13:17.14 |
cryptopsy | yep | 13:17.26 |
kens | 'Not 'scan' but scan-convert' | 13:17.26 |
cryptopsy | gs \ | 13:17.45 |
| -q -dBATCH -dNOPAUSE \ | 13:17.48 |
| -sDEVICE=bbox \ | 13:17.50 |
| "$1" | 13:17.52 |
| it just needs to do a little string parsing | 13:18.08 |
kens | What ? I think you are mistaken | 13:18.21 |
cryptopsy | to calculate -g5400x7200 \ | 13:18.24 |
chrisl | No, it really, really needs to do a *lot* more than that | 13:18.26 |
cryptopsy | -c "<</PageOffset [-36 -36]>> setpagedevice" \ | 13:18.26 |
| it just needs to set that 5400, 7200, 36 | 13:18.40 |
kens | Setting PageOffset has nothing to do with cropping content | 13:18.46 |
cryptopsy | this uses the bounding box for the perfect page not the largest bounding box, which would require a little more arithmetic | 13:18.57 |
| for the first page* | 13:19.06 |
kens | Well since we have not seen you rfile we have no clue what is on the first page, nor its content. So your numbers make no sense | 13:19.27 |
cryptopsy | the 2nd gs command will crop for sure, its all about getting the right numbers | 13:19.34 |
kens | Also I cannot see whhy you are using the resolution and declaring the media in pixels, you;'d do better to stay in points | 13:19.47 |
cryptopsy | i could easily just use the 2nd page or 3rd page | 13:19.58 |
| its a book | 13:20.03 |
| true that the 1st page is often a different size than the rest | 13:20.10 |
kens | No, the second Ghostscript commnd does no cropping. It simply moves the markign content with respect to the underlygin media | 13:20.22 |
cryptopsy | those are pts | 13:20.30 |
chrisl | If you think you can get an accurate bounding box of the marking operations without scan converting, I'll be most interesting to hear how | 13:20.41 |
kens | -g5400x7200 is not points | 13:20.42 |
cryptopsy | i can guess the bb based on some page's bb | 13:24.16 |
| this seems reasonable? | 13:25.39 |
kens | Since you have not supplied an example file, its impossible to tell | 13:25.54 |
cryptopsy | i mean, theoretically | 13:26.35 |
| i have some files here but i'm currently writing the script | 13:26.54 |
kens | To be frank yuou have not defined the problem sufficeintly well for me to have an informed opinion. | 13:27.00 |
| Bluntly; I haven't a clue what you are talking about now | 13:27.15 |
cryptopsy | is -sDEVICE=bbox adequate in guessing a bbox? | 13:27.37 |
chrisl | It doesn't guess | 13:27.47 |
kens | Its not guessing anything | 13:27.47 |
cryptopsy | while in and of itself it doesn't guess, used in this context it provides a guess by using the bbox of a page to represent the entirety of the pages | 13:28.22 |
kens | No, there is no guesswork | 13:28.37 |
cryptopsy | the guessing is me accepting that bbox as a possibility to represent the entirey of the pages | 13:28.58 |
kens | Well that'll be wrong then | 13:29.06 |
cryptopsy | why would it be wrong in a book? | 13:29.15 |
kens | Many PDF files have pages of differnt sizes and orientations | 13:29.18 |
cryptopsy | yes, i know | 13:29.24 |
| the question is whether bbox is good at returning the bbox values | 13:29.44 |
kens | You keep talking generally then jumping to 'I'm talking about a book' but won't supply an example, so its not surprising we're talkign at cross-purposes | 13:29.53 |
cryptopsy | i could pull a pdf at random what does that prove | 13:30.13 |
| would you like to test it on 10k pdf books? | 13:30.22 |
kens | It gives us a concrete example to talk about | 13:30.27 |
cryptopsy | i could provide you with 10, 100 | 13:30.41 |
kens | OK this is not a forum for discussing random problems. The answer to your question is that the bbox device will provide the precise bounding box of a given page form a file, a\t a given resolution. | 13:31.36 |
cryptopsy | here is a pdf chrome-extension://mhjfbmdgcfjbbpaeojofohoefgiehjai/index.html?https://www.ema.europa.eu/en/documents/report/meeting-report-paediatric-high-grade-glioma-medicines-expert-workshop_en.pdf | 13:32.09 |
| err | 13:32.16 |
kens | I've answered your question above | 13:32.23 |
cryptopsy | you wanted a pdf, here is your pdf https://www.ema.europa.eu/en/documents/report/meeting-report-paediatric-high-grade-glioma-medicines-expert-workshop_en.pdf | 13:32.33 |
kens | No, frankly I'm not discussing this any longer. | 13:32.46 |
cryptopsy | lol | 13:32.50 |
| i am taking your replies one at a time | 13:32.58 |
kens | If you have a Ghostscript question ask it. If you have random PDF questions, please take them elsewhere | 13:33.27 |
cryptopsy | i am truly baffled why you changed your tone this way | 13:33.44 |
| ¯\_(ツ)_/¯ | 13:33.54 |
| book was not the right choice of word, i should have said 'document' since a book can be a pdf with scanned images of book pages | 13:35.51 |
| so what conditions would make bbox return wrong values? | 13:36.44 |
kens | None | 13:36.55 |
cryptopsy | i have mentioned one above, a scanned image. Can pdfs come broken somehow, such that no margins are set, but rather the text is somehow confined to a space on the space to pretend there was a margin? | 13:37.34 |
| how does bbox work? | 13:37.40 |
kens | Well teh source is available to you | 13:38.01 |
cryptopsy | does it compare pixels? probably not ? | 13:38.03 |
| oh, yea, the source ... | 13:38.10 |
| neat | 13:38.20 |
kens | And yes, it does render the input and (effectiely) count pizels | 13:39.09 |
cryptopsy | does gs offer something like poppler's pdfinfo ? | 13:40.24 |
kens | I don't know what poppler's thing does. Ghostscript has pdf_infp.ps | 13:40.47 |
| pdf_info | 13:40.56 |
cryptopsy | didnt come with my 9.50 ghostscript-gpl-9.50 | 13:41.22 |
kens | And you got that from where ? | 13:42.14 |
chrisl | http://git.ghostscript.com/?p=ghostpdl.git;a=blob;f=toolbin/pdf_info.ps | 13:42.20 |
cryptopsy | normal gentoo install | 13:42.23 |
kens | Then you probably want to complain to the Gentoo dpackage maintainer | 13:42.36 |
cryptopsy | there was no options for compiling 'examples' or 'scripts' or anything extra | 13:42.40 |
kens | Well no | 13:42.48 |
cryptopsy | you'd think so but that would not solve things | 13:42.54 |
kens | Then your solution is to get the code from us and build it yourself. | 13:43.21 |
cryptopsy | thats the difference between bb and hiresbb ? | 13:48.02 |
| %%BoundingBox: 18 0 567 683 | 13:48.31 |
| %%HiResBoundingBox: 18.593999 0.000000 566.927983 682.865979 | 13:48.33 |
kens | I assume you mean 'what' rather than 'that', and the answer is one is more accurate | 13:48.56 |
cryptopsy | can gs hide Substituting font text when processing pages so that only page numbers are seen? | 13:53.37 |
kens | This is covered in the documentation | 13:54.04 |
| However, this is important information. It tells you that the fonts being used are not present. Using different fonts will potentially affect the bounding box | 13:55.13 |
cryptopsy | yep | 13:56.18 |
| my input file doesnt contain any /CropBox lines but -c "[/CropBox [36 36 60 60] /PAGES pdfmark" \ failed to crop it, even though the change was made to the file | 14:09.09 |
| i am putting together an example now | 14:09.28 |
kens | Adding a CropBox doesn't crop the file in any meaningful sense, its merely an instruction to the consumer that when rendering the content should be cropped in this fashion *if* the consumer chooses to use the CropBox | 14:10.17 |
cryptopsy | i am using mupdf-gl does it not use cropbox by default? | 14:10.45 |
kens | No idea, that would be a MuPDF question | 14:10.58 |
cryptopsy | in that case i should use DEVICEWIDTHPOINTS DEVICEHEIGHTPOINTS, -c translate and -c rectclip | 14:12.31 |
| i was not able to crop this file https://clbin.com/yWOIV with this command gs -o cropped.pdf -sDEVICE=pdfwrite -dDEVICEWIDTHPOINTS=595 -dDEVICEHEIGHTPOINTS=842 -dFIXEDMEDIA -c "24 72 translate" -c " 0 0 235 422 rectclip" -f $1 | 14:16.57 |
| or maybe ... | 14:17.19 |
kens | That file really needs to go to dropbox or something | 14:17.42 |
| Copying and pasting from a browser isn't likely to work correctly | 14:17.55 |
cryptopsy | its a 170kb pdf i dont have a commandline tool for uploading files so i just used that clbin | 14:18.06 |
| file opens if renamed | 14:18.15 |
kens | Maybe on your system, I'm using Windows | 14:18.31 |
cryptopsy | save-as "file.pdf" | 14:18.44 |
| stil doesn't ? | 14:18.47 |
chrisl | It's probably borked the binary | 14:19.08 |
cryptopsy | i think windows uses \r\n insead of \n | 14:19.21 |
| it crops, rectclip wasnt sufficient | 14:20.26 |
| is the documentation for rectclip online? | 14:22.03 |
kens | IIRC eahch page of a PDF file from teh PDF interpreter will execute initgraphics, whihch will reset the cli[ | 14:22.06 |
| rectclip is part of the PostScript Language | 14:22.18 |
| And as I said above, it won;'t do you any good | 14:22.36 |
cryptopsy | man that's old https://ghostscript.com/pipermail/gs-text-api/2004-January/000146.html | 14:22.46 |
kens | The PDF itnerpreter executes 'intigraphics' before starting each page of the PDF file. That operator rezets the graphics state, which blows away the clipi you ahev set | 14:23.35 |
cryptops1 | crash | 14:26.43 |
| can i quiet the startup message but not the page processing? -q quiets all | 14:31.17 |
kens | No | 14:31.30 |
cryptops1 | i can't find the translate documentation on the doc page https://www.ghostscript.com/doc/current/Use.htm | 14:36.16 |
kens | translate is a PostScript operator | 14:36.35 |
| We don't provide documentation for teh PostScritp language | 14:37.02 |
cryptops1 | what about rectclip? | 14:37.55 |
| welp, i am out of ideas | 14:38.07 |
kens | That too is part of the PostScript language | 14:38.09 |
cryptops1 | how do i run gs on one page? is sPageList=pagenumber deprecated? | 14:41.38 |
kens | -sPageList works | 14:41.52 |
cryptops1 | it says: "These command line options are no longer specific to PDF, but have some specific differences with PDF files" | 14:41.54 |
| ok | 14:41.56 |
kens | As does FirstPage nad LastPage | 14:41.59 |
cryptopsy | .j $postscript | 15:20.19 |
| oh, that channel is dead ... | 15:20.39 |
| i can't get my document o translate, does gs offer some ability for this? | 15:21.39 |
| my attempt was with -c "-$x -$y translate" | 15:22.16 |
kens | I've already said, twice, that using translate in PostScript preceding the PDF page is not going to work. That's because the PDF interpreter executes a setpagedevice before every page form the PDF file (in order to set hte required media size) and setpagedevice does an implicit 'initgraphics'. The initgraphics operator resets the graphics state to its default. That means it throws away your translate. | 15:23.16 |
| So you can't use that and expect anythign to happen | 15:23.54 |
cryptopsy | even with -dFIXEDMEDIA ? | 15:23.56 |
kens | That has nothing to do with it | 15:24.08 |
cryptopsy | saw a many upvoted thread on stakoverflow doing it | 15:24.11 |
kens | Then its wrong, or its not doing what you think it is | 15:24.30 |
cryptopsy | yes it must be wrong since it isn't working | 15:24.41 |
| what can i do? | 15:24.44 |
kens | Well that depends entirely on what you are trying to achieve | 15:24.57 |
cryptopsy | i achieved a crop i would like to center the content | 15:25.16 |
kens | You can use <</PageOffset x y>> setpagedevice to move the content across the media. Because the PageOffset key is in the page device dictionary it is preserved when setpagedevice is executed, which means it continues to take effect on every page | 15:26.29 |
cryptopsy | i found your SO thread | 15:26.43 |
| reading now | 15:26.45 |
| https://stackoverflow.com/questions/46051517/ghostscript-crop-pdf-not-correctly | 15:26.53 |
kens | Actuall that should be <</PageOfgfset px y]>> setpagedevice | 15:26.59 |
cryptopsy | hmm i actually i had a script with that ... | 15:28.49 |
| its really nasty it processes one page at a time in a loop | 15:29.14 |
| https://i.imgur.com/UTqjX8q.png | 15:29.42 |
kens | doesn't grok bash scripting | 15:30.06 |
cryptopsy | i hope it will work once i add this PageOffset thing | 15:30.42 |
| mm | 15:32.10 |
| it over translated | 15:33.35 |
| -c "<</PageOffset [-$x -$y]>>setpagedevice" \ | 15:34.33 |
| oops | 15:34.35 |
| https://i.imgur.com/j9eRlJR.png | 15:34.42 |
| there were the x y w h from gs -q -dBATCH -dNOPAUSE -sDEVICE=bbox -sPageList=1 "$1" | 15:35.06 |
kens | Looks correct to me | 15:37.42 |
| the left and bottom extents of the content seem to be at the left and bottom of hte output | 15:38.06 |
cryptopsy | x y w h before 62.387998 42.479999 532.691984 827.639975 | 15:38.13 |
| x y w h after 0.000000 0.000000 470.303986 785.160046 | 15:38.20 |
| i guess bbox didn't box as i expected | 15:39.14 |
| since there is still a top and right margin | 15:39.20 |
| this is the part i was dreading , bash arithmetic | 15:40.12 |
| i have to subtract another margin from w and h | 15:40.26 |
| why did bbox do this? | 15:40.49 |
kens | I susp[ect that the logo contains white | 15:41.16 |
| That gets counted as marking the page | 15:41.30 |
cryptopsy | which logo? | 15:41.42 |
kens | The blue circle at the top of the page | 15:41.54 |
cryptopsy | lets try bbox on page 3 | 15:42.18 |
| https://i.imgur.com/UKYjXf7.png | 15:43.16 |
| lets try another file | 15:43.43 |
kens | Well I'd have to dissect the PDF file to figure out why its doing that, but my guess is still that the PDF draws white on the page | 15:43.51 |
cryptopsy | no problem - could be luck of a draw - i pull that off the net | 15:44.09 |
kens | Looks like the majority of those pages are images. | 15:45.25 |
cryptopsy | https://i.imgur.com/GziQxQt.png | 15:45.34 |
| this seems like there is exactly another margi's width of whitespace there | 15:46.09 |
kens | Well as I say, that file seems to consist mostly of images. Images can contain white space, and it is still counted as being marked | 15:46.39 |
cryptopsy | this was another file | 15:46.59 |
| pure text you can select it with the mouse | 15:47.17 |
kens | Being able to select text does not mean it is not an image | 15:47.31 |
| THis is a common technique used by OCR packages, the text is actually invisible | 15:47.43 |
cryptopsy | if i zoom into this it isn't scanned text | 15:48.16 |
| the background is a perfect white | 15:48.28 |
kens | Well I don't have the file so I can't comment. | 15:48.44 |
cryptopsy | you would know better i am not familiar with this kind of trickery | 15:48.48 |
kens | TBH we are well past the 'goodwill' point here, we don't give technical support to open source users. If you think you've found a bug feel free to report it in our bug tracker. | 15:49.39 |
| I need to concentrate on actual paid work | 15:49.48 |
cryptopsy | https://gofile.io/?c=nJTeMk | 15:50.03 |
| alright | 15:50.05 |
| ♥ тнαηк уσυ ѕєηραι ♥ | 15:50.08 |
| subtracing another x,y from w,h, crops perfectly. I guess most pdf are set up that way | 16:01.39 |
| <<<Back 1 day (to 2019/11/26) | Forward 1 day (to 2019/11/28)>>> | |